Enjoyed the episode? Want to listen later? Subscribe here, or anywhere you get podcasts:

I am actually quite skeptical of most of the stories that people tell about why an intervention worked in one place and why it didn’t work in another place. Because I think a lot of those stories are constructed after the fact, and they’re just stories that I don’t think are very credible. But that said, I don’t want to say that we can learn nothing. I would just say that it’s very, very hard to learn things. But, what’s the alternative?

Eva Vivalt

If we have a study on the impact of a social program in a particular place and time, how confident can we be that we’ll get a similar result if we study the same program again somewhere else?

Dr Eva Vivalt is a lecturer in the Research School of Economics at the Australian National University. She compiled a huge database of impact evaluations in global development – including 15,024 estimates from 635 papers across 20 types of intervention – to help answer this question.

Her finding: not confident at all.

The typical study result differs from the average effect found in similar studies so far by almost 100%. That is to say, if all existing studies of an education program find that it improves test scores by 0.5 standard deviations – the next result is as likely to be negative or greater than 1 standard deviation, as it is to be between 0-1 standard deviations.

She also observed that results from smaller studies conducted by NGOs – often pilot studies – would often look promising. But when governments tried to implement scaled-up versions of those programs, their performance would drop considerably.

For researchers hoping to figure out what works and then take those programs global, these failures of generalizability and ‘external validity’ should be disconcerting.

Is ‘evidence-based development’ writing a cheque its methodology can’t cash?

Should we invest more in collecting evidence to try to get reliable results?

Or, as some critics say, is interest in impact evaluation distracting us from more important issues, like national economic reforms that can’t be tested in randomised controlled trials?

We discuss these questions as well as Eva’s other research, including Y Combinator’s basic income study where she is a principal investigator.

Get this episode by subscribing to our podcast on the world’s most pressing problems and how to solve them: type 80,000 Hours into your podcasting app.

Questions include:

  • What is the YC basic income study looking at, and what motivates it?
  • How do we get people to accept clean meat?
  • How much can we generalize from impact evaluations?
  • How much can we generalize from studies in development economics?
  • Should we be running more or fewer studies?
  • Do most social programs work or not?
  • The academic incentives around data aggregation
  • How much can impact evaluations inform policy decisions?
  • How often do people change their minds?
  • Do policy makers update too much or too little in the real world?
  • How good or bad are the predictions of experts? How does that change when looking at individuals versus the average of a group?
  • How often should we believe positive results?
  • What’s the state of development economics?
  • Eva’s thoughts on our article on social interventions
  • How much can we really learn from being empirical?
  • How much should we really value RCTs?
  • Is an Economics PhD overrated or underrated?

The 80,000 Hours podcast is produced by Keiran Harris.


I guess one main takeaway as well is that we should probably be paying a little more attention to sampling variance in terms of thinking of the results of studies. Sampling variance is just the kind of random noise that you get, especially when you’ve got very small studies. And some small studies just happen to find larger results. So I think if we try to separate that out a bit and a little bit down-weight those results that are coming from studies of small sample sizes, that certainly helps a bit.

I’ve got this other paper with Aidan Coville of the World Bank where we are looking at precisely some of the biases that policy-makers have. And one of the bigger ones is that people are perfectly happy to update on new evidence when that goes in a nice, positive — when it’s good news. But people really hate to update based on bad news. So for example, if you think that the effects of a conditional cash transfer program on enrollment rates is that maybe they’ll increase enrollment rates by three percentage points. And then we can randomly show you some information that either says it’s five or it’s one. Well if we show you information that says it’s five, you’re like “great, it’s five.” If we show you information that says it’s one, you’re like “eh, maybe it’s two.” So we see that kind of bias.

I’m getting at this point an email every week roughly asking for advice on collecting priors, because I think researchers are very interested in collecting priors for their projects because it makes sense from their perspective. They’re highly incentivized to do so because it helps with, not just with all this updating work, but also for them, personally, it’s like, “Well now nobody can say that they knew the results of my study all along.” Like, “I can tell them ‘well, this is what people thought beforehand and this is the benefit of my research.'” And also, if I have null results, then it makes the null results more interesting, because we didn’t expect that.

So, the researchers are incentivized to gather these things but I think that, given that, we should be doing that a little bit more systematically to able to say some interesting things about like … well, for example; one thing is that people’s priors might, on average, be pretty accurate. So this is what we saw with the researchers, when we gathered our researcher’s priors, that they were quite accurate on average. Individuals, they were off by quite a lot. There’s the kind of wisdom of the crowds thing.

But, if you think that you could get some wisdom of the crowds and that people are pretty accurate overall, if you aggregate, well that actually suggests that it could be a good yardstick to use in those situations where we don’t have RCTs. And it could even help us figure out where should we do an RCT, where are we not really certain what the effect will be and we need an RCT to come in and arbitrate, as it were.

Related episodes

About the show

The 80,000 Hours Podcast features unusually in-depth conversations about the world's most pressing problems and how you can use your career to solve them. We invite guests pursuing a wide range of career paths — from academics and activists to entrepreneurs and policymakers — to analyse the case for and against working on different issues and which approaches are best for solving them.

The 80,000 Hours Podcast is produced and edited by Keiran Harris. Get in touch with feedback or guest suggestions by emailing [email protected].

What should I listen to first?

We've carefully selected 10 episodes we think it could make sense to listen to first, on a separate podcast feed:

Check out 'Effective Altruism: An Introduction'

Subscribe here, or anywhere you get podcasts:

If you're new, see the podcast homepage for ideas on where to start, or browse our full episode archive.