Get this episode by subscribing to our podcast: search for 80,000 Hours wherever you get your podcasts.
Prof Philip Tetlock is a social science legend. Over forty years he has researched whose forecasts we can trust, whose we can’t and why – and developed methods that allow all of us to be better at predicting the future.
After the Iraq WMDs fiasco, the US intelligence services hired him to figure out how to ensure they’d never screw up that badly again. The result of that work – Superforecasting – was a media sensation in 2015.
It described Tetlock’s Good Judgement Project, which found forecasting methods so accurate they beat everyone else in open competition, including thousands of people in the intelligence services with access to classified information.
Today he’s working to develop the best forecasting process ever by combining the best of human and machine intelligence in the Hybrid Forecasting Competition, which you can start participating in now to sharpen your own judgement.
In this interview we describe his key findings and then push to the edge of what’s known about how to foresee the unforeseeable:
- Should people who want to be right just adopt the views of experts rather than apply their own judgement?
- Why are Berkeley undergrads worse forecasters than dart-throwing chimps?
- Should I keep my political views secret, so it will be easier to change them later?
- How can listeners contribute to his latest cutting-edge research?
- What do we know about our accuracy at predicting low-probability high-impact disasters?
- Does his research provide an intellectual basis for populist political movements?
- Was the Iraq War caused by bad politics, or bad intelligence methods?
- What can we learn about forecasting from the 2016 election?
- Can experience help people avoid overconfidence and underconfidence?
- When does an AI easily beat human judgement?
- Could more accurate forecasting methods make the world more dangerous?
- How much does demographic diversity line up with cognitive diversity?
- What are the odds we’ll go to war with China?
- Should we let prediction tournaments run most of the government?
Listen to it.
If you subscribe to The 80,000 Hours Podcast, you can listen at leisure on your phone, speed up the conversation if you like, and find out about future episodes. You can do so by searching for ‘80,000 Hours’ in your podcasting app (include the comma).
Below you’ll find the full interview, along with a coaching application form, brief summary and extra resources to learn more.
Table of Contents
Three key points
“…if when the President went around the room and he asked his advisors how likely is Osama to be in this mystery compound, if each advisor had said 0.7, what probability should the President conclude is the correct probability? Most people say well, it’s kind of obvious, the answer is 0.7. But the answer is only obvious if the advisors are clones of each other. If the advisors all share the same information and are reaching the same conclusion from the same information, the answer is probably very close to 0.7
But imagine that one of the advisors reaches the 0.7 conclusion because she has access to satellite intelligence. Another reaches that conclusion because he access to human intelligence. Another one reaches that conclusion because of code breaking, and so forth. So the advisors are reaching the same conclusion, 0.7, but are basing it on quite different data sets processed in different ways. What’s the probability now? Most people have the intuition that the probability should be more extreme than 0.7. And the question then becomes how much more extreme? … You want to extremise that in proportion to the diversity of the viewpoints among the forecasters who are being aggregated.”
“the John Cleese, Michael Gove perspective that Expert Political Judgment somehow justified not listening to expert opinion about the consequences of Brexit struck me as a somewhat dangerous misreading of the book. It’s not that I’m saying that the experts are going to be right, but I would say completely ignoring them is dangerous.
It’s very hard to strike the right balance between justified skepticism of pseudo-expertise, and there’s a lot of pseudo-expertise out there and there’s a lot of over-claiming by legitimate experts. So justified skepticism is very appropriate, obviously – but then you have this kind of know-nothingism, which you don’t want to blur over into that. So you have to strike some kind of balance between the two, and that’s what the new preface is about in large measure.”
“There’s an active debate among researchers in the field about the degree to which calibration training generalizes. I could get you to be well-calibrated in judging poker hands; is that going to generalize to how calibrated you are on the weather? Is that going to generalize to how well-calibrated you are on the effects of rising interest rates?
The effects of transfer of training are somewhat on the modest side so you want to be really careful about this. I would say, oh gosh. You really want to concentrate your training efforts into things you care about. So if it’s philanthropic activities, I think you want to have people make judgements on projects that are quite similar to that to get the maximal benefit.
I’m not saying that transfer of training is zero, although some people do say that. I think it’s too extreme to say that transfer of training is zero and, I think the transfer of training is greater if people not only get practice at doing it but if people understand what calibration is…”
Articles, books and blog posts discussed in the show
- When powerful people make dumb choices it hurts us all. Here’s how to fix it.
- The case for epistemic modesty
- The case against epistemic modesty
- Hybrid Forecasting Competition
- Good Judgement Open tournament
- Superforecasting book
- Explanation of Robin Hanson’s futarchy idea
- Expert Political Judgement book, 2nd edition
- Podcast: You want to do as much good as possible and have billions of dollars. What do you do?
- Speeding up social science 10-fold, how to do research that’s actually useful, & why plenty of startups cause harm
- Is it time for a new scientific revolution? Julia Galef on how to make humans smarter, why Twitter isn’t all bad, and where effective altruism is going wrong
Hi listeners, this is the 80,000 Hours podcast, the show about the world’s most pressing problems and how you can use your career to solve them. I’m Rob Wiblin, director of Research at 80,000 Hours.
Today I’m speaking with one of my intellectual heroes, Prof Philip Tetlock. His main research area has been the concept of good judgment and the impact of accountability on judgment and choice.
He has spent 35 years collecting forecasts from a wide range of people, in order to figure out how good their predictions are, and whose predictions are more reliable than others. His research is unusual for its depth and rigour, as it has now included tens of thousands of participants and millions of predictions.
I only had Philip for an hour, and I didn’t want to waste the first few minutes having him describe his research in general.
So I’ll briefly describe what he has found, by reading an extract from a new preface for Expert Political Judgement: How good is it and how can we know?.
The studies that provided the foundation for Expert Political Judgment were forecasting tournaments held in the 1980s and 1990s during which experts assessed the probabilities of a wide range of global events—from interstate violence to economic growth to leadership changes. By the end, we had nearly 30,000 predictions that we scored for accuracy using a rigorous system invented by and named for a statistically savvy meteorologist, Glenn Brier. One gets better Brier scores by assigning probabilities “closer” to reality over the long term, where reality takes on the value of 1.0 when the predicted event occurs and zero when it does not. Lower Brier scores are therefore good. Indeed, a perfect score of zero indicates uncanny clairvoyance, an infallible knack for assigning probabilities of 1.0 to things that happen and of zero to things that do not. The worst possible score, 2.0, indicates equally uncanny inverse clairvoyance, an infallible knack for getting everything wrong. And if we computed the long-term average of all the chimps’ dart-tossing forecasts, we would converge on 0.5, the same maximum-uncertainty judgment rational observers would make in guessing purely stochastic binary outcomes such as coin flips—which would earn chimps and humans alike the chance-accuracy baseline Brier score of 0.5.
The headline result of the tournaments was the chimp sound-bite, but EPJ’s central findings were more nuanced. It is hard to condense them into fewer than five propositions, each a mouthful in itself:
• Overall, EPJ found over-confidence: experts thought they knew more about the future than they did. The subjective probabilities they attached to possible futures they deemed to be most likely exceeded, by statistically and substantively significant margins, the objective frequency with which those futures materialized. When experts judged events to be 100 percent slam-dunks, those events occurred, roughly, 80 percent of the time, and events assigned 80 percent probabilities materialized, on average, roughly 65 percent of the time.
• In aggregate, experts edged out the dart-tossing chimp but their margins of victory were narrow. And they failed to beat: (a) sophisticated dilettantes (experts making predictions outside their specialty, whom I labeled “attentive readers of the New York Times”—a label almost as unpopular as the dart-tossing chimp); (b) extrapolation algorithms which mechanically predicted that the future would be a continuation of the present. Experts’ most decisive victory was over Berkeley undergraduates, who pulled off the improbable feat of doing worse than chance.
• But we should not let terms like “overall” and “in aggregate” obscure key variations in performance. The experts surest of their big-picture grasp of the deep drivers of history, the Isaiah Berlin–style “hedgehogs,” performed worse than their more diffident colleagues, or “foxes,” who stuck closer to the data at hand and saw merit in clashing schools of thought.10 That differential was particularly pronounced for long-range forecasts inside experts’ domains of expertise. The more remote the day of reckoning with reality, the freer the well-informed hedgehogs felt to embellish their theory-driven portraits of the future, and the more embellishments there were, the steeper the price they eventually paid in accuracy. Foxes seemed more attuned to how rapidly uncertainty compounds over time—and more resigned to the eventual appearance of inherently unpredictable events, Black Swans, that will humble even the most formidable forecasters.”
• A tentative composite portrait of good judgment emerged in which a blend of curiosity, open-mindedness, and unusual tolerance for dissonance were linked both to forecasting accuracy and to an awareness of the fragility of forecasting achievements.12 For instance, better forecasters were more aware of how much our analyses of the present depend on educated guesswork about alternative histories, about what would have happened if we had gone down one policy path rather than another (chapter 5). This awareness translated into openness to ideologically discomfiting counterfactuals. So, better forecasters among liberals were more open to the pos-sibility that the policies of a second Carter administration could have prolonged the Cold War, whereas better forecasters among conservatives were more open to the possibility that the Cold War could have ended just as swiftly under Carter as it did under Reagan. Greater open-mindedness also protected foxier forecasters from the more virulent strains of cognitive bias that handicapped hedgehogs in recalling their inaccurate forecasts (hindsight bias) and in updating their beliefs in response to failed predictions (cognitive conservatism).
• Most important, beware of sweeping generalizations. Hedgehogs were not always the worst forecasters. Tempting though it is to mock their belief-system defenses for their often too-bold forecasts—like “off-on-timing” (the outcome I predicted hasn’t happened yet, but it will) or the close-call counterfactual (the outcome I predicted would have happened but for a fluky exogenous shock)—some of these defenses proved quite defensible. And. though less opinionated, foxes were not always the best forecasters. Some were so open to alternative scenarios (in chapter 7) that their probability estimates of exclusive and exhaustive sets of possible futures summed to well over 1.0. Good judgment requires balancing opposing biases. Over-confidence and belief perseverance may be the more common errors in human judgment but we set the stage for over-correction if we focus solely on these errors and ignore the mirror image mistakes, of under-confidence and excessive volatility.
With that lengthy introduction out of the way, now I bring you, Philip Tetlock.
Prof Tetlock is the Annenberg University Professor at the University of Pennsylvania. He is also co-principal investigator of The Good Judgment Project, a multi-year study of the feasibility of improving the accuracy of probability judgments of high-stakes, real-world events.
He has written 200 articles in peer-reviewed journals and several books including Superforecasting: The Art and Science of Prediction; Expert Political Judgment: How Good Is It? How Can We Know?; Unmaking the West: What-if Scenarios that Rewrite World History; and Counterfactual Thought Experiments in World Politics.
Robert Wiblin: Thanks for coming on the podcast, Phillip.
Philip Tetlock: My pleasure.
Robert Wiblin: So we planned to talk about how people can conduct really valuable social science research and perhaps even build on your own work, but first, you have a new crowdsourcing tournament going on now, don’t you, called Hybrid Mind?
Philip Tetlock: Well, I wouldn’t claim that it belongs to me. It belongs to IARPA, the Intelligence Advanced Research Projects Activity, which is the same operation and US intelligence community that ran the earlier forecasting tournament. The new one is called Hybrid Forecasting Competition, and it, I think, represents a very important new development in forecasting technology. It pits humans against machines against human-machine hybrids, and they’re looking actively for human volunteers.
So hybridforecasting.com is the place to go if you want to volunteer.
Robert Wiblin: I just signed up a couple of hours ago. How is it different from The Good Judgment Project’s open tournament?
Philip Tetlock: Much more emphasis on statistical and artificial intelligence tools for aggregating, forecasting cues, and combining them in what competitors hope will be optimal ways.
Robert Wiblin: Right. So you’re going to take judgements that people submit as humans and then try to combine with judgements from statistical process, and then try to produce a combination of the two that’s better than either one?
Philip Tetlock: Well, there will be different competitors, and the different competitors have different ideas about what’s optimal. The people who volunteer will be randomly assigned to one of those forecasting competitors, much the same way that people were assigned last time.
Robert Wiblin: Interesting. Okay. What are you hoping to learn that’s different?
Philip Tetlock: Well, there are a lot of unknowns. It may seem obvious that machines will have an advantage when you’re dealing with complex quantitative problems. It would be very hard for humans to do better than machines when you’re trying to forecast, say, patterns of economic growth in OECD countries where you have very rich, pre-quantified time series, cross-sectional data sets, correlation matrices, lots of macro models. It’s hard to imagine people doing much better than that, but it’s not impossible because the models often over fit.
So far, as the better forecasters are aware of turbulence on the horizon and appropriately adjust their forecasts, they could even have an advantage on turf where we might assume machines would be able to do better.
So there’s a domain, I think, of questions where there’s kind of a presumption among many people observe these things that the machines have an advantage. Then there are questions where people sort of scratch their heads and say how could the machines possibly do questions like this? Here, they have in mind the sorts of questions that were posed, many of the questions that were posed anyway, on the earlier IARPA forecasting tournament, the one that lead to the discovery of super forecasters.
These are really hard questions about how long is the Syrian civil war going to last in 2012? Is the war going to last another six months or another 12 months? When the Swiss and French medical authorities do an autopsy on Yasser Arafat, will they discover polonium? It’s hard to imagine machines getting a lot of traction on many of these quite idiosyncratic context-specific questions where it’s very difficult to conjure any kind of meaningful statistical model.
Although, when I say it’s hard to construct those things, it doesn’t mean it’s impossible.
Robert Wiblin: I’ve been using The Good Judgment … I’ve been in the tournament for The Good Judgment Project for the last few years. I found it pretty hard to make good judgements or to find ways to improve the numbers that are already on there because, typically, they’re just pretty reasonable as far as I can tell.
For example, I was looking at the Chilean election today to try to figure out who was going to become president. I just don’t know about the Chilean elections, so you throw some difficult ones at us.
When you take those numbers that people like me are putting in there, how much processing do you do afterwards?
Philip Tetlock: Are we referring back to the earlier forecasting tournament from 2011 or 2015?
Robert Wiblin: Yeah.
Philip Tetlock: There’s quite a bit of statistical razzmatazz going on, although it turned out that the winning algorithm was pretty simple. If you’re not comfortable with logarithms, it might be a little bit perplexing. The core idea behind the algorithm was you would take a weighted average, the most recent forecast of the best forecasters, and then … And so far as those five forecasters are quite different from each other, you would do something called extremising.
So that’s the argument for cognitive diversity. If you have a number of different people reaching the same conclusion via different inferential pathways, you should have more confidence in that conclusion than if they’re reaching it if they’re all clones of each other. The example I used in the Super Forecasting book was the example from the advisors to President Obama when he was making the decision about whether to launch the Navy SEALs at a large house in the Pakistani city of Abbottabad.
The thought experiment runs like this, that if when the President went around the room and he asked his advisors how likely is Osama to be in this compound, this mystery compound, if each advisor had said 0.7, what probability should the President conclude is the correct probability? Most people sort of look at you and say well, it’s kind of obvious, the answer is 0.7, but the answer is only obvious if the advisors are clones of each other. If the advisors all share the same information and are reaching the same conclusion from the same information, the answer is probably very close to 0.7
Imagine that one of the advisors reaches the 0.7 conclusion because she has access to satellite intelligence. Another reaches that conclusion because he access to human intelligence. Another one reaches that conclusion because of code breaking, and so forth. So the advisors are reaching the same conclusion, 0.7, but are basing it on quite different data sets processed in different ways. What’s the probability now? Most people have the intuition that the probability should be more extreme than 0.7, and the question then becomes how much more extreme?
That’s where the statistical razzmatazz comes in, where you want to extremise and you want to extremise the weighted average of the most recent forecast of the best forecasters. You want to extremise that in proportion to the diversity of the viewpoints among the forecasters who are being aggregated.
That’s kind of a long answer, but that make sense?
Robert Wiblin: Yeah, no, it makes complete sense. It’s quite interesting because within the Centre for Effective Altruism where I’ve been working the last few years, we often get people to independently come up with probability estimates for different things before we discuss something, and then after we discuss it.
We’ve never done this thing of then combining them and then saying well, if we’re all on one side, then that should make us even more confident than the average of our answers. But perhaps we shouldn’t, anyway, because we’re all clones of one another or something like that or we all have access to too similar information, but that’s maybe something we should consider doing.
Philip Tetlock: Well, well-functioning groups that are very good at overcoming biases like failing to share distinctive information, groups that are effective at that, you want to be careful about extremising. For example, it wasn’t a good idea to extremise the judgements of super forecasting teams.
Robert Wiblin: Because they’d spoken to one another too much.
Philip Tetlock: They were in effect self-extremising.
Robert Wiblin: Interesting. Okay, so they were already adjusting for this.
Philip Tetlock: Well, I think when you applied those statistical tools to that situation, the result was worse than you would have otherwise got.
Robert Wiblin: Nice. So if people want to participate in the last forecasting tournament and see if they can be forecasters themselves, and just contribute to your research, I think they can sign up at … I believe it was hybridforecasting.com
Philip Tetlock: That’s correct.
Robert Wiblin: You can start making forecasts about all kinds of events including North Korea’s nukes and elections around the world. Wars, when they’ll start and when they’ll end. It’s pretty fun if you’re into that.
Philip Tetlock: It’s going to be a huge challenge and you won’t just be competing against human beings, you’ll be competing against artificial forms of intelligence.
Robert Wiblin: Right. Who are you backing?
Philip Tetlock: Well, we started to talk about that at the beginning, the pros and cons, and you can make a good case that the machines are going to have an advantage on one set of problems and you can make a good case the machines are going to have a very hard time getting any traction on another set like the Yasser Arafat autopsy problems or very idiosyncratic problems. Then there are things that are sort of in the middle that are hard to classify. Actually, I think most of life is in the middle of hard to classify.
Robert Wiblin: Yeah. Let’s talk for a little bit about the latest edition of your book, Expert Political Judgment. It has a new preface. Does it have any other changes?
Philip Tetlock: No, that’s the core change.
Robert Wiblin: Yeah, right. So I’ve read the new preface and I could sense a little bit of frustration on your part because it’s been a bit over 10 years since the original version came out. I think you feel that you’ve been somewhat misunderstood. One of the ways that you’ve been misunderstood is as endorsing populism, the idea that we’ve had enough of experts and experts didn’t know that much, anyway. They don’t know more than random people.
What’s actually going on there? You did say something like that, but then perhaps it’s been distorted.
Philip Tetlock: I mean I was always a big fan of Monty Python and John Cleese. I think John Cleese was a brilliant comedian, he may still be a brilliant comedian, but the John Cleese, Michael Gove perspective that Expert Political Judgment somehow justified not listening to expert opinion about the consequences of Brexit struck me as somewhat dangerous misreading of the book. It’s not that I’m saying that the experts are going to be right, but I would say completely ignoring them is dangerous.
It’s very hard to strike the right balance between justified skepticism of pseudo-expertise, and there’s a lot of pseudo-expertise out there and there’s a lot of over-claiming by legitimate experts, even. So justified skepticism is very appropriate, obviously, but then you have this kind of know-nothingism, which you don’t want to blur over into that. So you have to strike some kind of balance between the two, and that’s what the preface is about in large measure.
Robert Wiblin: A good point that you raised in a bunch of your books is that pundits who get on television tend to be extremely overconfident because that’s much more entertaining to listen to. I think they’re almost worse than random because they’re extremely confident about exciting things that they want to say that are kind of contrarian.
There’s also this criticism of intellectuals in general that they kind of become to obsessed with their own particular field and they end up with potentially quite radical views just ’cause they’ve studied only economics or that’s the only thing that they know and think about. Do you think it’s the case that academics could even be worse than just a random person off the street? Is there any evidence that you’ve found for that?
Philip Tetlock: I do not have evidence that PhD level academics are worse than the average person off the street. What we do have evidence for is something that Daniel Kahneman called The Attentive Reader of the New York Times hypothesis. He coined that phrase when this research was in a very early phase back in 1987, 1988 when we were colleagues at Berkeley.
What that means is you get a boost from being an attentive reader of The New York Times, or The Wall Street Journal to be bipartisan here. You get a boost from being an attentive reader of the news. So moving from nothing to being an attentive reader of the elite press, The Economist, Financial Times, whatever, the elite press, there is a boost. That boost is substantially greater than the boost you get moving from being an attentive reader of the elite press to having a PhD in China Studies. You hit a point of diminishing marginal predictive returns for knowledge depressingly quickly if you’re an academic.
Robert Wiblin: And you haven’t found that experts can be poorly calibrated, even the hedgehogs?
Philip Tetlock: Well, the hedgehogs were certainly worse calibrated in the early work, especially for their longer-range forecast. That’s true.
Robert Wiblin: Another thing you say in the preface is that an expert’s biggest victory was over a bunch of Berkeley undergraduates who managed to pull off the improbable feat of doing worse than chance. What’s the deal there?
Philip Tetlock: Well, they just didn’t know what they were doing. They didn’t know how little they knew. So they were over confident in messy ways. I don’t think it’s possible to do systematically worse than chance for prolonged periods of time. I wouldn’t treat that as a … That was a curious feature of the data back then.
Robert Wiblin: I imagine some people might listen to that and worry that students at these liberal universities are just getting stuck in an echo chamber forming very extreme views that aren’t too justified. I know a bunch of people at Berkeley. I live in Berkeley myself. Should they worry about that or do you think undergraduates around the world are just over overconfident?
Philip Tetlock: I worked at Berkeley for many years. I spent more of my academic career at Berkeley than any other university, and that’s going to remain true even if I stay at Penn until I’m 75. The echo chamber hypothesis, I think there is something to worry about there. I worry about campus intolerance. I worry about echo chambers. I think our data are fairly clear on the benefits of cognitive diversity in forecasting.
I used the term cognitive diversity so we don’t conflate it with the way diversity is used in the more political legal sphere.
Robert Wiblin: Do you have any evidence about how much that kind of demographic diversity, like across ethnic or gender lines … Does that match up with cognitive diversity?
Philip Tetlock: I suspect it does to some degree. It’s an empirical question. We’re not in a great position to answer that question because the people who volunteer for these forecasting tournaments tend to be quite disproportionately male, and well-educated, and having somewhat of a quantitative inclination. So a bit of a Silicon Valley personality profile there.
Robert Wiblin: So maybe you could improve the forecast by trying to spread out from that demographic group. I guess that they would also lack diversity of knowledge, potentially.
Philip Tetlock: It’s going to hinge a lot on the match between the forecasters predispositions and knowledge base and the types of questions you’re asking, I think. Although, in the early forecasting tournaments they really did strive to ask an extremely heterogenous array of questions. It was as though what IARPA was looking for, the forecasting equivalent of general intelligence.
Robert Wiblin: So what other ways do you think … have you found that your work has been misinterpreted that frustrate you?
Philip Tetlock: Well, I think the biggest one is the one you’ve already touched on. It’s the populous know-nothingism which I think is dangerous for democracy. It’s danger for the well-being of our society. It is true that there is a lot of expert over-claiming so it’s understandable that some very smart people have veered in a know-nothing direction, but it strikes me as very much an overreaction.
If only, I mean what I’ve always have hoped for, is that more major institutions would take forecasting tournaments seriously so we could develop some kind of institutional track record; who tends to be more accurate about what? It would be more difficult, I think, for people like Michael Gove to dismiss skeptics of Brexit, opponents of Brexit, if those opponents had pretty good track records.
Robert Wiblin: Yeah, right. So what should just a typical person do if they hear experts talking about something that they know more about than the listener does? Should they kind of go somewhere between common sense and their judgment and what the expert is saying?
Philip Tetlock: These are really hard questions. I draw a distinction in the new preface to this in the second edition between domains in which expertise is more likely to have really strong … should receive more difference than in other domains. The domains in which expertise should receive … the most effort in the domains in which experts get quick clear feedback on the accuracy of their judgements.
People get quick, clear feedback on the accuracy on their judgements … Well, to take an extreme case, if you’re an expert poker player. If you’re an expert chess player, but there are many other domains of life where people get pretty quick, clear feedback, as well. All other things being equal, you want to have a surgeon who’s done the procedure a thousand times, not 10 times. All other things being equal, you want to have a pilot who’s had a lot of experience, and so forth.
Even the most developed know-nothings I suspect, when to it comes to picking a pilot or picking a surgeon, are willing to concede that there is genuine expertise. The bulk of the criticism, I think, has been aimed at political and economic experts who are making judgements about complex, macro trends, and nobody knows for sure how close the world came to other possible worlds. We don’t know whether … We don’t know how Brexit’s going to work out, but if the economy does substantially better than people were worrying, there’s going to be an argument well, it would have done much better, still, if we had …
You see that kind of argument for even seemingly pretty clear-cut cases of pretty egregious errors. Like say the Bush administration decision to go to war over weapons of mass destruction in Iraq in 2003. There are relatively few defenders of that decision who look back with historical hindsight at all and all the different, the bloody sequelae of the Iraq war and say boy, that was a really good decision to go there. They say, well, Saddam Hussein was a monster. They say this is so bad, though. Wouldn’t we have been better sticking with the monster we knew rather than…
Robert Wiblin: ISIS.
Philip Tetlock: Well, all of the things that happened between 2003 and 2017. Massive causalities in both Iraq and Syria.
Robert Wiblin: How unevenly distributed is forecasting ability, while we’re discussing elitism? How many people does it take to be as good as a single super forecaster?
Philip Tetlock: Well, that’s a tough one. I don’t-
Robert Wiblin: The number you’ve calculated.
Philip Tetlock: I’m sure there are people who are better at the internal data analysis than I am who could give you a more confident answer, but it’s somewhere between 10 and 35 or so, I would say.
Robert Wiblin: Oh, wow. Oh, wow. Okay. So it’s a lot.
Philip Tetlock: It’s a quite a few, yeah.
Robert Wiblin: Interesting. Okay.
Philip Tetlock: I’ve got a fairly wide confidence band around that.
Robert Wiblin: Sure. Okay, so it sounds like it’s quite useful to become a superforecaster. What’s kind of the best training that you’re aware of, for listeners, if they want to get better at forecasting things?
Philip Tetlock: That’s really easy: practice, practice, practice. Same way you get to Carnegie Hall.
Robert Wiblin: Right, okay. So they should sign up to the tournament and start making predictions.
Philip Tetlock: Sign up for tournaments, yeah. Don’t be embarrassed of being wrong. It’s not so terrible to be wrong.
Robert Wiblin: What about calibration training? Is there anything that gets provided to businesses or governments or the super forecasters? Any advice that you give them or processes that they can go through to get even better, other than practicing?”
Philip Tetlock: There’s an active debate among researchers in the field about the degree to which calibration training generalizes. I could get you to be well-calibrated in judging poker hands; is that going to generalize to how calibrated you are on the weather? Is that going to generalize to how well-calibrated you are on the effects of rising interest rates?
The effects of transfer of training are somewhat on the modest side so you want to be really careful about this. I would say, oh gosh. You really want to concentrate your training efforts into things you care about. So if it’s philanthropic activities, I think you want to have people make judgements on projects that are quite similar to that to get the maximal benefit.
I’m not saying that transfer of training is zero, although some people do say that. I think it’s too extreme to say that transfer of training is zero and, I think the transfer of training is greater if people not only get practice at doing it but if people understand what calibration is, and if people understand what resolution is, if people understand there’s a degree of tension between calibration and resolution in many situations.
The deeper an understanding people have of the metrics, how the metrics work, what it means to metricize one’s judgements, I think the great the transfer of training you’re going to get. I think that’s one of the benefits of being a super forecaster is that they really did develop a fairly nuanced understanding of how the subjective probability metrics work.
Robert Wiblin: So some of my friends say that I shouldn’t be too public about my political views because by saying what I believe now, it’s going to make it more difficult for me to update my beliefs and future because I’ve already publicly tied myself to the mass and I’ve said no, I think this candidate’s good or that candidate’s bad. Is that something that people should consider, being more circumspect?
Philip Tetlock: Yes. Some of the classic work that Leon Festinger inspired with cognitive business theory bore on the exactly the hypothesis you’re advancing there, which is if public comment tends to freeze attitudes into place. So it becomes more difficult to change attitudes that you publicly committed yourself to. It becomes more emotionally painful to change them. It also becomes more cognitively difficult because once you make a public commitment to a position, the direction of thought tends to shift. You spend more time thinking of reasons how you could be right and other people could be wrong as opposed to thinking about plausible actions that reasonable critics could raise and that you might factor in your position.
The former form of thinking, which is what happens when people make probable commitments, we call that defensive bolstering. The major function of thought is to generate reasons why you’re right and other people are wrong, critics are wrong, as opposed to the more open-minded preemptive sought criticism pattern of thinking that often occurs among more open-minded people prior to commitment. You scan the universe for plausible critics and plausible objections and try to factor them into your position.
So you’re right. Once you take a strong public position, one you become a Paul Krugman or a Brett Stephens or once you really become identifying with a world view, it becomes very, very hard to acknowledge that you might have been wrong about this or that. The whole picture starts to crack. It’s a difficult thing.
So yes, there is a case to be made. I try, when I teach, to be very agnostic and I want people from quite a wide range of political views to feel comfortable. Obviously, there’s some political views that are so outrageous that I can’t take seriously, but I try to have an expansive conception of tolerance when I teach. I think the quality of debate is better when my minority points of view feel free to be aired. So I think that’s all to the better.
I’m not all that effective as a political advocate, anyway, so I think my comparative advantage is going to be in advancing the science of judgment and choice, and science of forecasting in the remaining years of my career is not going to be as an amateur politician.
Robert Wiblin: Are you a super forecaster yourself?
Philip Tetlock: No. I could tell you a story about that. I actually thought I could be, I would be. So in the second year of the forecasting tournament, by which time I should’ve known enough to know this was a bad idea. I decided I would enter into the forecasting competition and make my own forecasts. If I had simply done what the research literature tells me would’ve been the right thing and looked at the best algorithm that distills the most recent forecast or the best forecast and then extremises as a function of the diversity of the views within, if I had simply followed that, I would’ve been the second best forecaster out of all the super forecasters. I would have been like a super, super forecaster.
However, I insisted … What I did is I struck a kind of compromise. I didn’t have as much time as I needed to research all the questions, so I deferred to the algorithms with moderate frequency. I often tweaked them. I often said they’re not right about that, I’m going tweak this here, I’m going to tweak this here. The net effect of all my tweaking effort, which was to move me from being in second place which I would’ve been if I’d mindlessly adopted the algorithmic prediction, to about 35th place. So that was … I fell 33 positions thanks to the cognitive effort I devoted there.
So I think there’s a bit of morality tale.
Robert Wiblin: It’s a little bit like the active traders versus the index fund.
Philip Tetlock: It is. It is. Quite a good parallel.
Robert Wiblin: So in trying to become a better forecaster, are there any heuristics that you’ve adopted or any practices that you’ve adopted that have been helpful from your own work?
Philip Tetlock: I learned a lot from the super forecasters. I mean, it’s wonderful to have research subjects who are your teachers, and the super forecasters have, in many ways, been my teachers. I’ve had the privilege of watching their debates go on over the years and it’s deeply informative.
What have I learned from the super forecasters? What I do different? I’m not going to say … Most of the things they do I was aware of as being best practices, but there were certain ways of doing things that I didn’t quite get, and I appreciate them more now that I’ve seen the super forecasters in action up close across a variety of problems. One is this technique that in the book Superforecasting I name it after Fermi, I call it Fermi estimation, is the tendency to take problems and decompose them into their unknowns. That’s a very useful heuristic, especially for these really weird problems like Yasser Arafat’s remains are going to score positive for polonium in either the Swiss or the French autopsy.
Watching the super forecasters took a question like that that initially really just like a hopeless head scratcher and turned it into something tractable, that was an interesting thing to behold. That ties into another thing they do, which is they look for what would… Kahneman draws a distinction between the inside and the outside view approaches to forecasting, and the super forecasters are much more likely than regular mortals to look at things from the standpoint of the outside view.
So if they’re at a wedding, they’re more likely to say what’s the divorce rate for this sociodemographic group when they’re asked how likely is the couple to get divorced. A, they’re not going to be that offended, they’re going to say yeah, it’s a real probability and it’s a probability … maybe it’s 25%, maybe it’s 35% depending on the socio demographics. They’re not going to say something like well, look at how in love they seem to be and how happy everybody seems to be, it’s outrageous to imagine them getting divorced; 95% they’re going to stay married.
If you assign a probability because you get sucked up into the atmospherics and the enthusiasm of the moment and you start off around 95%, it’s going to take you a long time to adjust that probability so you’re in the ballpark of plausibility. Much better if you start your plausibility estimates around something that’s plausible.
so on the Yasser Arafat situation, you could would ask a question … you say what statistical comparison classes are there for this sort of weird autopsy for a leader of a pharma-terrorist organization. You’d say, well, you could ask the question when major medical and political, legal authorities believe there are reasonable grounds for suspecting an autopsy is needed, how often does the autopsy reveal something that was not … had not previously been revealed and it may indicate murder? If you phrase it that way, you can actually make more headway.
Robert Wiblin: There’s a very active debate in the effective altruism community at the moment about how much people should adopt the inside view versus the outside view, and how much they should just defer to mass opinion on important questions, or just defer to the average view of a bunch of experts. Do you have any views on that? I mean obviously, the argument of … There’s some people promoting a very radical view, basically, that you should almost ignore your own inside view and only look at the reported views of other people, or give your own inside view no more weight than anyone else’s. Do you think that’s a good approach to having more accurate beliefs?
Philip Tetlock: I’ve never been able to impose that kind of monastic discipline on myself. The division between the inside and the outside view is blurry on close inspection. I mean, if you start off your date with a base rate probability of divorce for the couple being 35%, then you … Information comes in about quarrels or about this or about that, you’re going to move your probabilities up or down. That’s kind of inside view information, and that’s proper belief updating.
Getting the mechanics of belief updating are very tricky and there’s a problem of both cognitive conservatism, under adjusting your beliefs and response to new evidence, and also the problem of excess volatility, over adjusting and spiking around too much. Both of which can obviously degrade accuracy.
I think a categorical prohibition on the inside view is way too extreme, but starting off with your first guess with the most plausible outside views is pretty demonstrably sound practice.
Robert Wiblin: Yeah, so I kind of misspoke when I said not allowing the inside view. It’s more the argument that the outside view that you should take is to look at the probability estimates of people who seem like they know about the area and say well, the experts, on average, that this is 70% likely and take that perspective and say well, when the experts say 70% likely, in fact it is 70% likely, so my reference class is based on the opinion of experts and their probability judgements. And then you don’t try to tinker with it and add your own judgment. That’s one approach that you could take.
Philip Tetlock: Right. The mistake that I made in the second year of the forecasting tournament, tinkering, whereas if I’d just left well enough alone, I would’ve been a super super forecaster. But only by virtue of cheating, looking at the algorithmic combination to the best forecasters.
So after my second year experience, you’d say well, maybe Tetlock has converted to that view, right? There’s merit in it. I don’t know why, but I worry about it, too. I have this feeling it goes too far, but I’m hard pressed to give you an evidentiary basis for that. I mean, you could imagine … I mean, the argument about index funds is if index funds become too common, no one will have incentives to do the research anymore, and then markets will cease to be efficient, and then presumably it’ll become efficient for … The incentives will return again. So it will oscillate back and forth, neck and neck, in an unstable equilibrium between the professional traders and the index fund people.
Would you want to have a situation where the expert community felt it was getting automatic deference? Oh, gee… Well, I think we’re so far removed from that, anyway. I don’t see it as an imminent problem, but it is interesting that there are thoughtful people inside The Centre for Effective Altruism who take that strong a position; it’s an interesting position. It’s one that I should be sympathetic to given what happened to me in the second year of the forecasting tournament, but it’s one that I’m wary of. And I’m going to have to think hard about exactly why I’m wary of it, but I am wary.
Robert Wiblin: Well, I’m very sympathetic to it. I agree it’s extremely hard to actually impose that rigor on yourself to always just be thinking well, my opinion is no more informative than the stated views of another person. So even though I’m in my head 24/7, I’m not going to give any greater weight to my own perspective on things. Very difficult to do.
My guess is if you were just trying to make accurate predictions about the future, then that would be a pretty good way to do it. As much as you have an unusual contrarian view, it’s true that you might be right, but most of the time you’re wrong if most people don’t agree with you and can’t be convinced.
The tricky implementation question is who are the set of experts who you should be looking at? There’s some people who think that it’s a very broad range of people and you should average them all, and there’s others who think it’s only a small number people, and in fact, it might be quite straight forward for me to become one of those experts, become one of the top five people, and then we could average the views of me and four other people, which gives you a bit more room to skew the results based on your own opinion.
Philip Tetlock: I think that’s a really good point about what’s the composition of the expert pool from which we’re … on which you’re doing aggregation operations. I think the 2016 election was a cautionary tale about a couple of things. One is the opinion bubbles that many academics live in—that it was quite unthinkable to them that their acquaintances could be that radically unrepresentative.
And the other is just the perils of extremising. I mean, one of the great poll aggregators, Sam Wang at Princeton, had a probability of a Hilary victory rate at around 95%. Now, Nate Silver, who comes out relatively well in this situation, he comes with a probability just before the election of around 70%. If Nate had been extremising, if he had been saying, look, virtually all the polls are point toward a Hilary victory, she even has a margin … She has borderline margins of victory in these swing states, if we aggregate and take these different polls with different strengths and we aggregate them, any given individual poll might have a 55 or 60% Hilary victory.
Many 55 or 60% probabilities from polls of varying quality from different places would tip you toward a Sam Lang kind of 95% percent extremising judgment, whereas he throttled back. He didn’t go beyond 70%. The reason was … I think it was an intuition that there was a problem with correlated measurement and that these things weren’t as independent as you might hope. They didn’t have the sort of independence that would justify strong extremising, therefore a throttle back. That, in fact, proves to be a pretty good judgment.
Of course, that’s an individual case. It’s a very important case. Perhaps history would have unfolded somewhat differently if there had been a Nate Silver advising the democratic campaign, but maybe not.
Robert Wiblin: Around that time, a lot of my friends were saying there’s just no way … including some of my family was saying that there’s no way that Trump can win. I said the prediction markets say it’s about 25% so I’m just going to think it’s 25%. I’m not going to sleep easy. There’s more categories than yes, no, and maybe.
Philip Tetlock: Yeah. It’s perfectly conceivable to say we live in a 25% likely world right now as opposed to this 75% likelihood world.
Robert Wiblin: 25% likely things happen all the time.
Robert Wiblin: Are there any things that you’d prefer that we weren’t able to predict, where it would actually be bad if we knew more?
Philip Tetlock: Oh, you mean like when I’m going to die, things like that?
Robert Wiblin: I’m thinking more on the geopolitical level, like what if … Would it be bad if countries could predict the behavior of other countries or other adversaries more accurately? Could that be more unstable?
Philip Tetlock: What an interesting question. I think most intelligence agencies around the world work with the tacit assumption that they’re better off if their probability judgements are more accurate. Is it possible that feeding accurate probability estimates to risk-seeking politicians could lead them to make bad decisions? Absolutely. Intelligence agencies don’t make the decisions; they simply inform the decision makers.
I’m hard pressed to say that accuracy is a bad thing, but I certainly can see that it’s easy to imagine situations in which feeding accurate probability judgements to decision makers who have reckless utility conscience think things can blow up.
Robert Wiblin: Yeah, right. But because-
Philip Tetlock: So then the question becomes do you want intelligence agencies to offer false probability? If the intelligence agency thinks they’re going to misinterpret the probabilities, should the intelligence agencies be retreating back into vague verbiage forecasts and say there’s a distinct possibility and nobody quite understands what anybody’s really said but they … they think they’ve learned something but they haven’t really learned anything.
Robert Wiblin: I imagine there could be some downsides to having intelligence agencies puppeteering politicians.
Philip Tetlock: Well, there’s always a debate about whether intelligence agencies have gone too far in influence. The term is sometimes usurpation. I mean, are they taking over at the role of policy makers? My impression is the intelligence agencies are very careful about avoiding that.
I think a lot undoubtedly hinges on the personal relationships that exist between top level intelligence agency executives and the senior policy makers they advise. I suspect when they’re sitting in private talking to each other that it becomes more difficult to figure out who’s …
Robert Wiblin: Right.
Philip Tetlock: In principle, there’s a very clear legal statutory division of labor between intelligence agencies and policy makers. Intelligence agencies are about just the facts, man.
Robert Wiblin: Yeah. At 80,000 Hours, we’re particularly interested in forecasting really extreme and unlikely events, things that might be less than 1% likely to occur, and part of the reason is that we think that these issues are quite neglected and often not very well handled because people are so bad at predicting how likely they are to be. That something that is 1% likely, a typical person might think that it’s only a one in a million chance.
What have you managed to learn, if anything, about predicting these quite unlikely events? Are we less accurate at forecasting those? Have we managed to learn anything?
Philip Tetlock: That is one of the most difficult of all the questions you could’ve asked. How do you go about assessing accuracy in the tails, in the tails of the probability distribution, and how fat are the tails of the probability distribution? It’s a kind of argument Taleb has raised. How good are we at distinguishing events that are 1% likely from events that are .000001% likely? Some people claim they’ve been able to make a lot of money in financial markets because they have a better appreciation of the fragility of the mainstream financial models, which they rest too heavily on Gaussian premises, normal bell curve kinds of premises.
So the question you’re asking is how much have we learned about how accurate people can … If I’m understanding your question correctly, you’re asking how much can we learn or have we learned about how accurate people are in the tail regions, say between 5% probability and .00005%. The short answer to that, and that’s very hard.
Robert Wiblin: Because you have to have such a large number of forecasts to even get a few things to happen.
Philip Tetlock: You do, you do. Now, there are interesting things you can do to assess the logical coherence of people’s probability estimates of things like a pandemic that kills 100 million people in the next three years or a nuclear war that kills a substantial number of people. Although, nuclear war maybe etching up above the 5% probability zone in some expert’s mind because of events in the Korean peninsula.
But yeah, you can assess the logical coherence of people’s estimates. So you can say well, you can get people’s judgements of how likely events are within a six month period, or a one year period, or a two year period, or a four year period, and if people’s judgements … You can do this in a between subjects design, so different people are making judgements on different things, or you can do it in a subtle repeated measures design where it’s not obvious to people that their consistency is being checked.
If people show a phenomenon known as temporal scope insensitivity, that’s a worrisome sign that they don’t know what they’re talking about when it comes to making judgements of low probability events. That temporal scope insensitive would mean that you think that events are just about as likely six months into the future as they are four year into the future or eight years into the future.
We saw some of that, for example, in the work we did in the ACE tournament with the Syrian civil war. There were occasions where people seemed to be making judgements that the likelihood of a peace deal in three months, six months, 12 months were more or less the same.
Robert Wiblin: Which can’t be right. Given that it sounds like we don’t have that many research findings here, maybe it would be good if we could just put some questions to super forecasters who have been trained and shown to do well on the 50/50 or the 90% likely to 10% likely range, and then say what they have to say about these tail events. They have a better shot at having accurate judgment there than otherwise, but we’ll never be sure.
Philip Tetlock: I think that’s certainly one useful thing to do. I think another useful thing to do is to check the logical coherence of their probability estimates. Are they showing, for example, temporal scope sensitivity or spacial scope sensitivity? What’s the likelihood of a flu epidemic in 10 Chinese provinces versus 20?
I think logical consistency checks, I think some cautious empirical generalization from people who are good in one zone to people who are good in another zone is warranted, and I think, also, there are techniques for generating question clusters. So if you think that … So Graham Allison just came out with a book on the future of US/China relations. Using historical base rates and what he calls hegemonic transitions in which one great power seems to be being eclipsed or superseded by another great power … As China’s power rises and US power doesn’t rise as fast, the argument being that …
When you look at world history, hegemonic transitions are a particularly dangerous time for major wars. So if you were just to use that historical base rate as he suggests, and he doesn’t have a lot of confidence in the base rate but he says it’s one of the few things we have. He thinks the probability of war is probably a little higher than 50% in the 21st century, which is disturbing. The thought that two super-powers could go to war. You get one of those exponential functions, right? World War I to World War II, the casualty level of World War II, the bloodiest war ever, was about 50 or 60 million. Now we’re going to move up to 500 or 600 million. It’s unthinkably bad.
You take something like a scenario of that sort, the hegemonic transition that the prospect of war, major US/China war sometime by the mid-21st century, or you take another big scenario like a tech revolution that’s going to dislocate major labor markets by mid-21st century, and you ask what sorts of things would we expect to observe in the relative near term future if this longer term future were likely to occur? We populate our forecasting tournament with clusters of questions that are designed to have some degree of diagnosticity vis a vis the bigger thing that we’re interested in predicting.
I think you could do that for these lower probability events, as well. The likelihood of, I don’t know, some kind of bird flu jumping across species in the next three or four years. What sorts of things are the biochemists and the epidemiologists tell us would need to occur and would we be likely to observe as early warning indicators if we were on that historical trajectory? I’d say the probability is less than five in a thousand or 10 thousand or 100 thousand.
You could ask about advances that would have some diagnosticity vis a vis that target. You could assess how accurate people are as forecasters of that and you could create a kind of early warning index. For good things, an early opportunity index.
Robert Wiblin: Do any of your tournaments have forecasts on the probability of war versus China? If it’s actually 20%, then I might have to start making preparations.
Philip Tetlock: Well, it really hinges on what timeframe you’re talking about. Graham Allison was talking about history on a very extended scale. Hegemonic transitions occur over the course of a century, often, and Chinese GDP is approaching US GDP. Chinese military power is still substantially under US but growing fairly rapidly. These are among the warning signs. Then, of course, you have these kind of tinderbox or catalyst situations that could trigger a war like the Korean peninsula or something in the South China Sea or whatnot.
Robert Wiblin: I’m not sure that there will be much to prepare for, anywhere. I don’t think there will be anyone left to execute my will, so probably not worth writing.
Robin Hanson has proposed an entire model of government based on prediction markets called Futarchy where you would elect a parliament to kind of choose what outcome society wants, and then you would have people betting on these prediction markets to say how the utility function of society as a whole would be affected, positively or negatively, by different policies that could be implemented. And then, if the prediction market consistently says that implementing a policy would raise the social welfare function that the parliament has chosen, then that would make it become a law. What do you think of that?
Philip Tetlock: I know Robin and Robin was one of the people who was a participant in the earlier forecasting tournament; we experimented with prediction markets ourselves. We think they’re powerful tools for improving accuracy. We don’t use prediction markets that much ourselves, we prefer forecasting tournaments and competitions among individuals and statistical algorithms applied to individual forecasts. We believe that we can perform as well or better than forecasting tournaments using those tools.
I’m sure Robin has thought through these issues about once you have very high policy stakes hinging on either a prediction market or a forecasting tournament, you create incentives for people to try to skew the results.
Robert Wiblin: I think his claim there is that yes, you would have a big business that would come in and try to skew the results and then other people would come along and try to take their money because they’ll just be able to take the other side of the bet easily.
Philip Tetlock: Right, but then there’s that … Was it Keynes or Soros who said the market can stay irrational longer than you can stay solvent?
Robert Wiblin: Yeah, that’s true. And they can also take advantage of the risk aversion on the other side of the bet if they have a lot of access to money.
Philip Tetlock: I’m very sympathetic to the idea that policy makers should be informed by thoughtful, probabilistic judgements distilled through the best known scientific methods. I think the world would be a better place if we proceeded along those lines. I think Robin and I are fundamentally on the same side, we just have slightly different approaches.
Robert Wiblin: Given the value of good foresight to society as a whole, do you think it would be worth trying to create a school, either a high school or perhaps a university course, where the main goal is just to produce incredibly good super forecasters and the entire curriculum is focused around making them have good judgment and learn what they need to know about forecasting?
Philip Tetlock: I think that’s an excellent idea.
Robert Wiblin: I’ve thought about how exciting that would be to try to get that funded and up and running in the Bay Area. Unfortunately, I already have a job, but if anyone’s listening and would like to start a high school focused around a super forecasting tournament, then I could good touch and I’ll see if I can help you.
Philip Tetlock: Do you know Stephen Kosslyn?
Robert Wiblin: I actually don’t.
Philip Tetlock: He’s the Chief Academic Officer of Minerva, that university that’s based in the Bay Area.
Robert Wiblin: Yeah, I know some people from Minerva. So that’s a potential hub for it.
Philip Tetlock: From what I’ve read about Stephen Kosslyn’s work, I think he would be quite sympathetic to that idea.
Robert Wiblin: Okay. I’ll send him an email.
Philip Tetlock: And they have a very unusual curriculum.
Robert Wiblin: They seem willing to be quite experimental.
Philip Tetlock: They are. Much more so than mainstream academic institutions where to create a major in super forecasting would be an academic career in itself.
Robert Wiblin: If someone could produce a training course like that one that would allow people to improve that judgment, wouldn’t that just be of incredible value to hedge funds and people in the financial industry? They’d be able to make just a whole lot more money, you would think. It kind of surprises me that this isn’t a bigger business or that there aren’t more people trying to start for-profits based on the research in your book Super Forecasting.
Philip Tetlock: I think that that’s a good point. There certainly has been a lot of interest in the super forecasting project in the financial industry, so I wouldn’t say it’s been ignored by any means. We’re just in the very early stages right now of discovering what is possible to teach and train by way of improving forecasting.
I mean, we did test, in the each of the four years of the IARPA tournament, the first generation tournament from 2011 to 2015. We did test training modules that produced improvements in probabilistic accuracy between eight and 12% each year in randomized control experiences. So that was a … I think that was an impressive demonstration, it’s in the scientific literature now in a John Barron’s journal Judgment and Decision Making.
That, I think, was an important finding but bear in mind that those training modules were developed in the context of a particular tournament in which certain types of questions were being asked and we needed to create performance sanctions. We were looking for ways of winning and we were not only doing research, we were also trying to win a tournament at the same time.
There are advantages now that we’re not as actively engaged in tournaments at the moment. There are advantages to stepping back and designing these kinds of training systems and contouring them around the needs of specific cognitive niches. That’s one of the things, I think, that we’re doing right now.
Robert Wiblin: At some point in the next decade or two, you’re probably going to retire from active research on these questions. What things do you worry you’re going to leave unsolved that perhaps someone in the listening audience might be able to work on?
Philip Tetlock: One of my deepest hopes, as I lay out in the preface for the new Expert Political Judgment, is that there will be a new generation of forecasting tournaments that focus not just on the accuracy of answers. I mean, everything in the first generation of forecasting tournaments is really about accurate answers. I think the questions are reasonably good, but it wasn’t a major focus of the research to generate good questions.
In fact, we don’t even have a very clear understanding of what it means to generate good forecasting questions. We know what forecasting accuracy means thanks to a range of scoring rules. There are very well defined criteria for judging the accuracy of forecasting judgements, but there aren’t such well defined criteria for judging how probative or insightful or creative forecasting questions are. What sorts of questions should we be asking about fourth industrial revolution, or about Sino-American relations, or the future of Eurozone, or the future of Islam. What sorts of questions should we be asking about those topics?
Inserting them into forecasting tournaments in ways that advance the conversation, make the larger societal conversation about those issues richer and deeper. So that requires bridging what I call rigor and relevance. The first generation tournaments are really strong on rigor and assessing probabilistic accuracy, but they weren’t so strong on ensuring relevance to these issues that we all deeply care about. It’s making the connection … it’s bridging the rigor/relevance divide. It’s having questions that are simultaneously rigorous but relevant to big issues.
There’s a tension between doing that because if you want to train people to be more accurate probabilistic judges, you want to ask questions that are going to come due every three or six months that are very specific and well-defined. You’re not going to be asking questions about the atmospherics of US/China relations, they’re going to be asking questions about what happened in the South China Sea, or what happened here or there, and be very grounded.
Developing methods of connecting the very specific rigorous questions you need to give people clear feedback and learn to get better, how to link those kinds of questions with these larger questions to dominate policy debates. I think one of the key tools for that is, I alluded to it briefly earlier, question clustering. Developing clusters of questions about US/China relations or fourth industrial revolution that cumulatively …
Each question in itself isn’t a deal maker or a deal breaker. Each question in itself doesn’t tell you there’s going to be a nuclear war between US and China or that the Kurzweilian scenario in 2045 is going to come to pass. Each one sheds a little bit of light on that and each question sheds a different type of light so that the questions are … Each quest in the cluster has some degree of diagnostic relevance to the big theme, but the questions within the cluster are not highly correlated with each other.
So you want items within the cluster, questions within a cluster that are not tightly correlated with each other but that are highly correlated with a big abstraction. So you want questions about … If it’s fourth industrial revolution, you want questions about, let’s say, driverless Ubers or Lyfts picking people up in major US cities by 2020. You want a question about AI systems winning world poker championships, multi-player world poker championships. You want something about robotic spending in the US exceeding 200 billion dollars by 2020.
You want a lot of questions, but you don’t want them to be too overlapping with each other. You want each of them to be somewhat independent, but you want each of them, also, to be relevant to the same theme. That’s the arch of generating probative question clusters that retain … keeping the rigor of a forecasting tournament but gaining more relevance by posing carefully selected question clusters.
That is really hard. We have been working on it, and other people have, too, but it’s a great challenge and I’m very much hoping that we will make progress on that before I retire.
Robert Wiblin: It seems like to pursue your dream of better predictions in government, there’s kind of two different paths you could take. One would be to become a researcher like you, and another one would be to become an advocate, either as a public intellectual or in the media or perhaps within the government itself. Would you like to see more advocates as well as more researchers?
Which do you think is the greatest bottleneck? Just to demonstrate how controversial this issue can be, when we published our profile on improving decision making in government we opened with a story of the Iraq war and the estimates about weapons of mass destruction. Quite a lot of people got back to us and said you’re completely misunderstanding this. That wasn’t about bad forecasting it was about bad politics, and it wouldn’t have mattered what the intelligence services had said because they were just being bullied by Dick Cheney.
I think that’s not your view, but there’s a whole lot of things to potentially talk about.
Philip Tetlock: Well, it’s certainly the case that intelligence agencies are sometimes ignored and it’s certainly the case intelligence can be politicized. Putting aside whatever happened in Iraq in 2003, I think most intelligence agencies around the world thought there was a better than even chance that Saddam was up to something suspicious. Saddam, we now know with historical hindsight, was actually trying to create that impression because he didn’t want to appear weak.
Robert Wiblin: Didn’t work out too well for him.
Philip Tetlock: No, it didn’t. It didn’t work out too well for him. Virtually everybody thought it was something but the idea that it was a slam dunk or 100% probability that there were weapons of mass destruction, I think that was either an egregious intelligence error or it was somehow politicization. I don’t for sure what the actual truth of the matter is, but it’s a serious problem either way.
With respect to activism, I’m not temperamentally an activist. I’m temperamentally a researcher and it’s what I know how to do, but I have been tempted by some forms of quasi activism. One of them is a project I’ve started on called the alpha pundit challenge, which is an effort to take the common carry of leading pundits like Tom Friedman or Martin Wolf or very prominent people in UK, US, elsewhere who offer opinions on various subjects and often offer implicit forecasts as well.
To extract those implicit forecasts and have intelligent readers go through them and impute probability ranges to them. So Larry Summers in 2016 thought the probability of a US recession was one in three and the super forecasters happened to think it 10%. Didn’t happen. In any given case, we don’t know who was right because their probabilities and their … But cumulatively, if over a number of judgements of this sort we find that Larry Summers really was exaggerating the probability of a recession in 2015 or ’16, that sort of thing will be picked up.
So far pundits have come to believe that there is some reasonably well-connect and prominent monitoring agency that’s watching the implicit forecast they’re making. You could make a case that they would become more circumspect and more thoughtful about what they claim because they would fear that their credibility would wax or wane as a function of accuracy. Right now, pundit credibility does not wane very much as a function of accuracy because they have quite effectively mastered the art of appearing to go out on a limb without actually going out on a limb by using vague verbiage forecast like there’s a distinct possibility of this happening, which readers will tell you could mean something about as low as 20% or could be something as high as 75 or 80% which keeps you very comfortably positioned on both sides of maybe no matter what happens.
If Putin does invade the Baltic’s, you can see …
Robert Wiblin: You’ll always be vindicated.
Philip Tetlock: I told you a distinct possibility, and if he doesn’t, he said I merely said it was possible. You have this kind of perpetual immunity from falsification. So I think that is a deep problem and I think it needs an activist push. I think it needs an activist big institutional sponsor that’s prepared to push that agenda. It would be great if we could learn how to automate the text analysis, automate the prediction extraction, even automate the probability imputation. Although, I think for probability imputation you really do need to have panels of intelligent leaders of different ideological persuasions reading it as a way of showing that the fix is not in, that it’s a legit.
Robert Wiblin: And blind them to who said it, presumably.
Philip Tetlock: Depending on the purpose, yes.
Robert Wiblin: Yeah, so that’s a really interesting idea for activism. Coming back to research, if someone wanted to be the next Dr. Tetlock, what should they study early on in their career and where would be the ideal place for them to study?
Philip Tetlock: I don’t think it would probably be where I came from, which is from psychology. It’d be more likely to come from public policy schools, either in business or independent public policy schools like Kennedy or Wilson, or perhaps economics.
Robert Wiblin: I thought that you might say statistics or potentially even computer science given you’re going in an AI direction now.
Philip Tetlock: Well, I’m not an AI researcher myself, I’m studying predictions about AI which is really quite a different thing. I hope to be a consumer of AI techniques in some of the work I’m doing, but I’m not actually generating those techniques myself. I was thinking a field that combined technical expertise with another, like economics would be more likely, but it could be the right combination includes AI.
Robert Wiblin: Are there any mentors or PhD supervisors or particular schools that you think would be great for people to go to for grad school?
Philip Tetlock: Meaning if their goal is-
Robert Wiblin: If they wanted to specialize in forecasting and accuracy and good judgment?
Philip Tetlock: That’s a surprisingly difficult question.
Robert Wiblin: Are there people at UPenn who could be good to study with?
Philip Tetlock: There are many good people in many different places. It’s not that I’m thinking of anybody at all, it’s just that I’m having too many competing associations. I’m thinking there are good people, very good people, at all of the usual suspect places.
Robert Wiblin: Okay. So you’re not short of choices.
Philip Tetlock: There’s not a particular place that stands out to say oh, you must go there because that’s where the future is happening. I don’t see that.
Robert Wiblin: Interesting. What about if you’re thinking about potentially going to grad school to work on this kind of research? Are there any, again, conferences or internships or volunteering that people could do to build a professional network in the area and experiment with whether it’s the right thing for them to do long term?
Philip Tetlock: There are. We certainly work with a lot of different people, so that would be … We’re one place and people should feel free to write to me. Because of the demands on my time I’ll probably have to refer it to some segment of my collaboration network.
I’m very, very fond of a particular computer scientist here at Penn who I think is just marvelous, and he was the guy who invented the algorithm that was the winning algorithm in the ACE tournament that I was talking about earlier that goes under the extremising label like the Osama/Obama story that I told earlier. His name is Lyle Ungar and he’s a terrific mentor and a terrific human being, and he played a huge role in the success of The Good Judgment Project.
Robert Wiblin: Outside of academic research, are there any organizations that you think would be really helpful to work for? I guess you’ve got quite a lot of funding through IARPA, so I suppose it would be good to have supporters continuing to work in IARPA in the intelligence services. Is there anywhere else?
Philip Tetlock: Well, Open Philanthropy has been supportive of the work. The Carnegie corporation has been somewhat supportive of the work. MacArthur has expressed some interest. Are you talking about institutions like The World Bank or places like that?
Robert Wiblin: Potential funders is certainly one, but I guess I was also wondering are there any business that are doing cutting-edge research in this area?
Philip Tetlock: Well, my friend Michael [inaudible 00:53:42], who is an adjunct professor of finance at Columbia; you’ve probably heard of him. Wherever he goes there’s interesting activity on the intersection between judgment choice and finance. He has a really deep and visceral understanding of both the psychology of forecasting and the statistics of forecasting.
Robert Wiblin: We often give advice to young people on how they can build up their skills in order to advance their career. I guess one way you can build up your scores is to try to either become a forecaster. Do you know of any examples of these random retirees or people who are just at home, in your tournaments, becoming super forecasters and and then getting hired by Goldman Sachs or the intelligence services to produce a good forecast for them?
Philip Tetlock: You’d probably be better asking Terry Murray 00:54:34, who is the former project manager of The Good Judgment Project and now the CEO of Good Judgment Inc, which is another place, by the way, interns might want to consider going. Good Judgment, Inc could be a good place to start.
Robert Wiblin: Is there any other advice, in general, that you would like to give people who might want to contribute to your research agenda in the future?
Philip Tetlock: Most people, I think, who are listening to this podcast are … have career tracks and lives and there’s an embeddedness. You’re on a certain track. I think you want to be careful not to ask too much people. All I would say is it would be helpful if people read the news in a different way. If they read the news from the standpoint of the principles of super forecasting, if they think about what are the implicit probabilistic claims that are in here.
Are people trying to influence me? Are the people who are making the claims playing a pure accuracy game? If they were playing a pure accuracy game, wouldn’t they be more clear-cut about what exactly they’re claiming? The climate of political debate has deteriorated to such a degree that … I guess I’m an optimist. I’m an enlightenment optimist. I believe that enlightenment values will ultimately triumph. I don’t know why I still continue to believe that so much, but I do. It may be irrational belief perseverance on my part, but I think that the current polarization, nasty polarization, and politicization, and the name calling, and so forth, that politicization will eventually die down and that people will recognize the value of intellectual temperance and thoughtfulness.
It is a tough sell. The ideals of super forecasting, I think, are a tough sell in any administration in the 20th century or nearly 21st century. It’s become an even tougher sell now. I do hold out this hope that the long … that there’s the long arch of history is going to bend toward enlightenment values because there is such a competitive advantage to be had in doing that, but you’re getting me into a somewhat morose philosophical mood here, Robert.
Robert Wiblin: I’m not sure when you say that people will produce more accurate forecasts and politics will be better, whether that’s an prediction or just more of a dream.
Philip Tetlock: Well, institutions that generate more … the institutions that are guided by more realistic probability estimates and the consequences of their policy options will, on average, do better over the long term. I think that’s a fairly uncontroversial claim.
Robert Wiblin: Well, it’s been fantastic to talk to you. You’re a very busy guy and I’ve taken up quite a lot of your time, so I should let you go. Thanks so much for making time for the 80,000 Hours podcast.
Philip Tetlock: It’s a real pleasure talking with you. I’m very sympathetic to the goals of effective altruism, and I’m always glad to talk with you guys.
Robert Wiblin: Great, all right. Well, maybe we can talk again about future findings in your research.
Philip Tetlock: Yes, I think we should do that. This has been a very enjoyable conversation.
Robert Wiblin: Fantastic, have a great day.
Philip Tetlock: Okay, bye.
As always I hope you enjoyed that episode. We’re going to take a break from the podcast for a couple of weeks while 80,000 Hours is working on its annual review. But I have a number of great interviews already recorded including ones with Professor Will MacAskill about moral philosophy and the effective altruism community, Jan Leike from DeepMind about advances in artificial intelligence, and Anders Sandberg from the Future of Humanity Institute about colonising space, and Michelle Hutchinson about how to set up a new academic institute.
So we’ll be back in December with plenty of new content.
Remember if you’d like to try improving your forecasting ability then you can sign up for the latest forecasting tournament at hybridforecasting.com.
And if you’re at all interested in doing similar research to that which Tetlock has been doing the last 35-40 years, or are keen to find other very valuable social science research, then you should definitely get in touch with us to get free personalised coaching. You can do that with the link in the show notes or the associated blog post.
We know a lot people in this area and can offer plenty of guidance to help you advance your career.
Thanks so much, talk to you in a couple of weeks.