Enjoyed the episode? Want to listen later? Subscribe by searching 80,000 Hours wherever you get your podcasts, or click one of the buttons below:

I am actually quite skeptical of most of the stories that people tell about why an intervention worked in one place and why it didn’t work in another place. Because I think a lot of those stories are constructed after the fact, and they’re just stories that I don’t think are very credible. But that said, I don’t want to say that we can learn nothing. I would just say that it’s very, very hard to learn things. But, what’s the alternative?

Eva Vivalt

If we have a study on the impact of a social program in a particular place and time, how confident can we be that we’ll get a similar result if we study the same program again somewhere else?

Dr Eva Vivalt is a lecturer in the Research School of Economics at the Australian National University. She compiled a huge database of impact evaluations in global development – including 15,024 estimates from 635 papers across 20 types of intervention – to help answer this question.

Her finding: not confident at all.

The typical study result differs from the average effect found in similar studies so far by almost 100%. That is to say, if all existing studies of an education program find that it improves test scores by 0.5 standard deviations – the next result is as likely to be negative or greater than 1 standard deviation, as it is to be between 0-1 standard deviations.

She also observed that results from smaller studies conducted by NGOs – often pilot studies – would often look promising. But when governments tried to implement scaled-up versions of those programs, their performance would drop considerably.

For researchers hoping to figure out what works and then take those programs global, these failures of generalizability and ‘external validity’ should be disconcerting.

Is ‘evidence-based development’ writing a cheque its methodology can’t cash?

Should we invest more in collecting evidence to try to get reliable results?

Or, as some critics say, is interest in impact evaluation distracting us from more important issues, like national economic reforms that can’t be tested in randomised controlled trials?

We discuss these questions as well as Eva’s other research, including Y Combinator’s basic income study where she is a principal investigator.

Get this episode by subscribing to our podcast on the world’s most pressing problems and how to solve them: type 80,000 Hours into your podcasting app.

Questions include:

  • What is the YC basic income study looking at, and what motivates it?
  • How do we get people to accept clean meat?
  • How much can we generalize from impact evaluations?
  • How much can we generalize from studies in development economics?
  • Should we be running more or fewer studies?
  • Do most social programs work or not?
  • The academic incentives around data aggregation
  • How much can impact evaluations inform policy decisions?
  • How often do people change their minds?
  • Do policy makers update too much or too little in the real world?
  • How good or bad are the predictions of experts? How does that change when looking at individuals versus the average of a group?
  • How often should we believe positive results?
  • What’s the state of development economics?
  • Eva’s thoughts on our article on social interventions
  • How much can we really learn from being empirical?
  • How much should we really value RCTs?
  • Is an Economics PhD overrated or underrated?

The 80,000 Hours podcast is produced by Keiran Harris.


I guess one main takeaway as well is that we should probably be paying a little more attention to sampling variance in terms of thinking of the results of studies. Sampling variance is just the kind of random noise that you get, especially when you’ve got very small studies. And some small studies just happen to find larger results. So I think if we try to separate that out a bit and a little bit down-weight those results that are coming from studies of small sample sizes, that certainly helps a bit.

I’ve got this other paper with Aidan Coville of the World Bank where we are looking at precisely some of the biases that policy-makers have. And one of the bigger ones is that people are perfectly happy to update on new evidence when that goes in a nice, positive — when it’s good news. But people really hate to update based on bad news. So for example, if you think that the effects of a conditional cash transfer program on enrollment rates is that maybe they’ll increase enrollment rates by three percentage points. And then we can randomly show you some information that either says it’s five or it’s one. Well if we show you information that says it’s five, you’re like “great, it’s five.” If we show you information that says it’s one, you’re like “eh, maybe it’s two.” So we see that kind of bias.

I’m getting at this point an email every week roughly asking for advice on collecting priors, because I think researchers are very interested in collecting priors for their projects because it makes sense from their perspective. They’re highly incentivized to do so because it helps with, not just with all this updating work, but also for them, personally, it’s like, “Well now nobody can say that they knew the results of my study all along.” Like, “I can tell them ‘well, this is what people thought beforehand and this is the benefit of my research.’” And also, if I have null results, then it makes the null results more interesting, because we didn’t expect that.

So, the researchers are incentivized to gather these things but I think that, given that, we should be doing that a little bit more systematically to able to say some interesting things about like … well, for example; one thing is that people’s priors might, on average, be pretty accurate. So this is what we saw with the researchers, when we gathered our researcher’s priors, that they were quite accurate on average. Individuals, they were off by quite a lot. There’s the kind of wisdom of the crowds thing.

But, if you think that you could get some wisdom of the crowds and that people are pretty accurate overall, if you aggregate, well that actually suggests that it could be a good yardstick to use in those situations where we don’t have RCTs. And it could even help us figure out where should we do an RCT, where are we not really certain what the effect will be and we need an RCT to come in and arbitrate, as it were.


Robert Wiblin: Hi listeners, this is the 80,000 Hours Podcast, the show about the world’s most pressing problems and how you can use your career to solve them. I’m Rob Wiblin, Director of Research at 80,000 Hours.

Before we get into it just a few quick announcements.

If you think of yourself as part of the effective altruism community you should fill out the 2018 effective altruism survey. This helps keep track of who is involved, how they’re trying to improve the world, and what they believe. I’ll put a link in the show notes and associated blog post.

If you want to get a high impact job you should check out our job board, which was recently updated with new vacancies. You can find it at 80000hours.org/job-board/. It’s where we list the positions we’re most excited about filling.

Finally I just wanted to give a shout out to our producer Keiran Harris who has been doing a great job editing the episodes and generally helping to improve the show.

And without further ado, I bring you Eva Vivalt.

Robert Wiblin: Today I’m speaking with Dr. Eva Vivalt. Eva is a lecturer in the Research School of Economics at the Australian National University and the founder of AidGrade, a research institute that pools together hundreds of global development studies in order to provide actionable advice.

Eva has a PhD in Economics and an MA in Mathematics from UC Berkeley, and an MPhil in Development studies from Oxford University. She’s also previously worked at the World Bank. She’s a vegan, a Giving What We Can member, and principal investigator on Y Combinator Research’s randomized control trial of the basic income.

Thanks for coming on the podcast.

Eva Vivalt: Thank you. Great to be here.

Robert Wiblin: So, we’re going to talk a bit about your career as an economist and the various findings that you’ve had in your research over the last five years. But first, what are your main research interests these days? Is there any way of summarizing it? Is there a core topic that you’re looking into?

Eva Vivalt: So a lot of my work is on really how to make better evidenced-based policy decisions. And part of that, that I’ve recently gotten into, is looking more at priors that people may have, both policy makers and researchers. And there’s lots, actually, to say about priors. But I think that’s a direction that my research has gone recently that actually relates quite well to some of the previous stuff, the linkage being evidence-based policy.

Robert Wiblin: There’s a lot of heavy material to cover there later on in the show. But to warm up let’s talk first about Y Combinator’s basic income study – what is the study looking at and what motivates it?

Eva Vivalt: Yeah, no, I’m really excited by this study. So essentially the study is to give out $1000 per month for either three or five years to a bunch of individuals who are randomly selected. So the randomization is at this individual level, it’s not actually like giving, for example, everybody in an area the program. There’s a control group as well that still gets some nominal amount too, hopefully, so that they continue to answer surveys and such. We’re looking at a variety of outcomes. Things like time use, for example, like most economists would say that if you give people money they should actually work a little a bit less, that’s completely a rational thing to do. But if they are working less, what are they doing with their time instead? Because it could be actually really good for people to work less if they are, for example, getting more education so they can get a better job in the future. Or taking care of their kids, et cetera, et cetera. There’s all sorts of productive uses of time that one might find otherwise adding a lot of value. There’s health outcomes, education outcomes.

I should say this program is targeted to relatively poorer individuals and relatively younger individuals because the thought is it could actually change people’s trajectory over time. Those are kind of the areas where we might expect the money to go a bit farther and to see slightly larger effects.

Robert Wiblin: Interesting. Okay. So, given that it’s from Y Combinator, which is a tech data accelerator, is it kind of motivated by the concern that everyone’s going to lose their jobs because of technology? Or is it just more prosaic issues around equality and lack of opportunity in the United States?

Eva Vivalt: I think there’s a variety of motivations here. So I think in the background somewhere there is this concern about technology potentially displacing workers. I think there’s also some genuine utopian ideal of people should be able to do …

Robert Wiblin: They shouldn’t have to be wage slaves.

Eva Vivalt: Yeah, yeah, yeah. It’s not like all negative people lose their jobs because people could lose jobs in a good way. Nobody actually wants hard work in some regards. To be fair it’s not like really a great test of what happens if people lose jobs per se because to do that is what you’d want is randomized control trial in which you fire people, which is not likely to happen anytime soon.

Robert Wiblin: Not going to get past the ethics board.

Eva Vivalt: Yeah. But I think this is more motivated by the idea that you can imagine some worlds in which what you would want to do is expand the social safety net. And if you are expanding the social safety net, this could be one relatively efficient way of doing so, and so let’s look at what the effects of this particular kind of program would be. And you might imagine that some kind of program like this would probably start out with targeting relatively poorer individuals even though a true basic income program would target everybody.

Robert Wiblin: So what do you expect to find? Given past studies that are similar. And also, how many people are in this study?

Eva Vivalt: We have about 1000 people in the treatment group, 2000 control and then this larger super control group for which we just have administrative data. It’s actually a decent sized experiment and there’ve not been … the most similar studies in the states are some of the negative income tax experiments and EITC from the ’70s, there’s also I guess the Alaska Permanent Fund. The other similar ones I would say would be Moving to Opportunity and the Oregon Health Insurance Experiment. But these are all like … they’ve all got quite a lot of differences actually.

So, Alaska Permanent Fund; everybody just gets a certain transfer. So, that one actually is universal. It’s not very much of a transfer and you’ve got to use different approaches to evaluate it since everybody gets it. Oregon health insurance, well obviously that’s health insurance. Negative income tax experiments, those were quite old and had a lot of differential attrition issues.

Like I say, by now I think most economists would expect some effects on labor supply. There’s loads of papers on labor supply elasticity. I think there’s a little bit less on what people do with their time otherwise. One thing we’re doing is designing this custom time use app that people can put on their phones so we can sort of ping them and ask, “Hey, what are you doing right now?”

Robert Wiblin: Is there a key uncertainty that it’s trying to resolve? Like will people quit their jobs? Or will they become happier? Or will they spend more time on leisure with their family? That kind of thing.

Eva Vivalt: Yeah, so rather than one key outcome, we’ve got like lots of different families of outcomes. So we’ve got health outcomes, we’ve got education outcomes, we’ve got financial health, we’ve got subjective wellbeing, we’ve got this kind of employment/time use/income stuff. We’ve actually even got some more behavioral things like political outcomes, do people have more or less inter-group prejudice and other-regarding preferences, that kind of thing. So, we’ve got actually quite a lot of things. Also doing some things relating to work on scarcity, that people under a lot of economic pressure might make worse decisions. Is that a short-term effect? A long-term effect? That kind of thing.

So there’s actually quite a lot of outcomes and sometimes when I talk to people about it they get a little bit confused. We’re looking at so many different things, but I think for a study of this kind of cost it’s actually really good to get a lot of different outcomes from it.

Robert Wiblin: I just quickly did the maths and it looks like it should cost like 100 million dollars.

Eva Vivalt: Not quite, but still quite high up there. Yeah.

Robert Wiblin: I was just thinking, if you’ve got a 1000 and you’re giving then $12000 each, that would come to 12 million each year of the study, plus then the control group and all the other on costs and so on. It depends how long you run it, but it’s a pretty serious expense.

Do you worry about having too many outcome variables? Or I suppose, you’ll be smart enough to adjust for the multiple testing problem.

Eva Vivalt: Yeah, we’re adjusting for that. We’re basically — within a type of thing, so like health, we’ll consider these as sort of like separate subject areas. So, there’ll be like a paper on health, a paper on financial health, et cetera. And then within each of those papers we’ll do all the appropriate family-wise error corrections, et cetera.

Robert Wiblin: Yeah. Are you going to preregister the analysis do you think?

Eva Vivalt: Yes, we will.

Robert Wiblin: Excellent. That’s great.

So what’s your role in the whole thing? There’s quite a significant number of people involved right?

Eva Vivalt: Yeah, no, this is a great project. For the PIs it’s myself and Elizabeth Rhodes, who’s a recent PhD grad from Michigan, David Broockman, who’s a Stanford GSB assistant professor, and Sarah Miller, who’s a health economist at the business school at Michigan. So those are the PIs and then we’ve got like a larger advisory board. We’re trying to keep in touch with both relevant academics, a bunch of senior researchers, as well as people obviously who are involved in other similar projects that we try to continue to talk with.

Robert Wiblin: And what’s your niche?

Eva Vivalt: Well I’m just one of PIs. “Just”, with quote marks. I think I was originally brought on board partially for experience with impact evaluations and sort of these large-scale trials.

Robert Wiblin: Yeah. When might we hope to see results from it? It’d be some years out.

Eva Vivalt: Yeah it will. The shortest treatment arm, that’s three years out. Actually we’d be gathering data slightly before the very end of it because what we don’t want to do is do the survey at the end of the three years and then we get the effect of people coming off the program, that kind of transition effect. We’ve got a baseline survey, midline survey and endline survey, and we’ve got a bunch of little intermediate surveys along the way that people can do just quickly by themselves on mobile. And for the big surveys, we’re going to do the last of those like two and half years in or so.

And even if we get like some early results, we’re not going to release the bulk of things until at least the end of that three year arm because things can always change and we don’t … because it’s a very high-profile study, what we don’t want is people to come away with some idea of how things went a year in and then three years in things have changed a lot but nobody listens to it. And it could also like affect some of the narrative. We don’t want the subjects to hear about themselves in the media, right? That would not be great.

Robert Wiblin: That would be disastrous really.

Another exciting thing you’re working on outside of your core research agenda is how to get people to accept ‘clean meat’, which we’ve recently done done a few episodes on. That paper is called Effective Strategies for Overcoming the Naturalistic Heuristic: Experimental Evidence on Consumer Acceptance of ‘Clean’ Meat.

What did you look at in that study?

Eva Vivalt: Yeah, so we were interested in a few things. We were interested in looking at … I assume you’ve covered clean meat; clean meat is essentially, you can think of it as lab-grown meat or synthetic meat or some other kind of unpalatable terms, if you like.

Robert Wiblin: It’s the rebranding of that.

Eva Vivalt: Yeah, it’s the rebranding of that. So meat not from animals directly. Some people have got a knee-jerk reaction that, “Ew, this is disgusting. It’s not natural,” and so this is what we’re calling this naturalistic heuristic, that sort of prevents people from being interested in clean meat. And we’re looking at ways of overcoming that. We tried various methods like directly saying “look, things that are natural aren’t necessarily good and vice versa.” We tried another appeal that was more trying to get them to think about things that they are quite happy with even though they are unnatural. So maybe, prompt some sort cognitive dissonance there. Like if they don’t like clean meat they should also not like a lot of other things that they do like.

Robert Wiblin: Vaccines.

Eva Vivalt: Yeah, yeah, and I mean there’s lots of foods that something has happened to them. Like they’re fermented or they just changed a lot from the past anyways. Like corn nowadays looks nothing like corn a long time ago, chickens nowadays look nothing like chickens a long time ago, et cetera. And we also looked at giving people sort of a descriptive norms type of approach of; other people are very excited about clean meat so maybe you should be, too.

It’s a little bit tentative but it seemed like the approach that was sort of trying to prompt cognitive dissonance by telling them about how there’s all these other unnatural goods that they like was maybe doing the best. The downside though is it did seem like quite a lot of … more people than I would have thought were actually quite negative towards clean meat. And especially, almost nothing did as well as — we had one treatment where we didn’t know how, a priori, how poorly people would respond to it. So we thought we’re going to prime some people with negative social information so that at least there’s some people for whom they’ve got some kind of anti, you know, they’ve got some kind of naturalistic [crosstalk 02:02:14].

Robert Wiblin: Some prejudice against it.

Eva Vivalt: Yeah, exactly. And it turned out, that priming effect, was pretty much bigger than anything else we found, which is kind of disappointing because you can imagine that the very first thing that other companies who produce conventional meat products will do, most likely, is to try to attack clean meat as like-

Robert Wiblin: Gross.

Eva Vivalt: Yeah. So that was little bit unfortunate.

And we also did another study where we were looking at the effects of knowing about clean meat on ethical beliefs because we thought actually if the … to some extent your ethical beliefs could be a function of what you think is like fairly easy to do. And so if you think that there is a good alternative out there, it could actually potentially change your views towards animals more generally, or the environment. So we were using this negative priming as an instrument for people thinking more or less positively towards clean meat and then looking at the effect on ethical beliefs, and there was actually some evidence that people were changing at least their stated ethical beliefs. I think we need to do a few more robustness checks there, but it was still quite surprising.

Robert Wiblin: Yeah. Why do you think the ’embrace unnaturalness’ message worked the best? Do you have a theory there?

Eva Vivalt: My best guess is that it had something to do cognitive dissonance and the fact that it was a relatively mild way of putting things. People don’t tend to like fairly strong messages against what they hold dear. We weren’t really undermining or trying to undermine what they were valuing, we were just saying, “Look, even by your own judgements here, to be consistent with your own things” …

Robert Wiblin: ‘You’re right about these other things, so why not be right about this one too’.

Eva Vivalt: Exactly.

Robert Wiblin: ‘You’re so smart’.

Eva Vivalt: It’s a very positive message in a way.

Robert Wiblin: How clear cut was the result? Are you pretty confident that that was the best one?

Eva Vivalt: You know I’m not 100% confident. So this is why I don’t want to oversell it because one could say this one was the one that sort of lasted the longest. We had like some follow ups. But at least in the short run, it could have also been — the descriptive norms might have done pretty well as well. So like it depends on whether you think — how we should weight the different rounds of data that we collected, right? And so we kind of pre-specified we were interested in the follow up but if you weren’t interested in that, if you thought that actually the early data should be somewhat informative about the later data, maybe the later data was just a bad draw, for example, then, you know.

So I wouldn’t lean too, too hard on it.

Robert Wiblin: Yeah. I mean I think that the naturalist heuristic is one of the most consistently harmful heuristics that people apply because it causes them to, in my view at least, reach the wrong answer just about so many different issues. And I wonder if there’s potential to just have a nonprofit that just like pursues relentlessly this point that being unnatural is not bad, being natural is not good. They would help with clean meat, but also just so many other things as well.

Eva Vivalt: That’s a fair point, and while doing this we got introduced to so many people who are doing so much interesting work on vaccines, et cetera, that, you know …. Yeah, I think that especially in the future as biotech in general becomes better, et cetera, et cetera, there’s going to be so many new products that are unnatural that plausibly benefit from such a message.

Robert Wiblin: We just need a generic pro-unnaturalness organization that can kind of be vigilantes and go to whatever new unnatural thing people don’t like.

Eva Vivalt: Yes, exactly.

Robert Wiblin: Well, it sounds like clean meat is just kind of being developed now so there’s probably going to be … we’ll want to try out a whole lot of other messages, because you’ve only tried out three here. Were there any other messages that you considered including that you would like to see other people test?

Eva Vivalt: Hmm, that’s a good question. Things don’t come to mind at this moment but I do think there’s a lot more room for further research here. Especially, one thing I don’t know about … I’m imagining that people are using unnaturalness … they seem to also think it’s unnatural and therefore it’s not healthy and therefore it’s all this other stuff. But I think there could be more done to break that down a little bit more because presumably you could in fact, at least theoretically, think that something is unnatural without thinking it’s necessarily unhealthy.

Robert Wiblin: So, you’ve written a paper that’s been pretty widely cited in the last few years called “How Much Can We Generalize From Impact Evaluations?” That was your job market paper, right?

Eva Vivalt: Yep.

Robert Wiblin: So that’s the work that you did during your PhD that you’re using to try and get a job, which we might talk about later. But, what question were you trying to answer with this paper?

Eva Vivalt: Yeah. So at the time that I was writing it, there was quite a lot of impact evaluation being done on various topics like de-worming, bednets, et cetera. But not so much of an effort to synthesize all the results. And so I’d started this nonprofit research institute, AidGrade, to gather all the results from various impact evaluations and try to say something more systematic about them.

But in the course of doing so I was kind of shocked to see how much results really varied. And I think if you talk to researchers they’ll say, “oh yeah, we know that things vary. Of course, they vary. There’s obviously all these sources of heterogeneity.” But I think that the language people use when talking to the general public or to funders is actually quite a bit different. And there, you know, things get really simplified. So I think there’s a bit of a disconnect.

And anyways, I was investigating a little bit some of the potential sources of heterogeneity. I mean, it was, at that point, what I’m looking at is observational data. Even if the data are coming from RCTs, because I’m just looking at the results that the various papers found. So I can’t definitively say the sources of the heterogeneity, but I could at least look for correlates of that and also try to say something about how, in a way, we should be thinking about generalizability. And how there are some metrics that we can use that can help us estimate the generalizability of our own results.

Robert Wiblin: So basically, you’re trying to figure out if we have a study in a particular place and time that has an outcome, how much can we say that that result will apply to other places and times that this same question could be studied. Is that one way of putting it?

Eva Vivalt: Yeah, because you’ll never actually have exactly the same setting ever again. Even if you do it in the same place, things hopefully would have changed from the first time you did it. So we might naturally expect to have different results. And then the issue is, well by how much? And how can we know that?

Robert Wiblin: All right. So I’m the kind of guy who, when they load up a paper, skips the method section, skips straight to the results. So, how much can we generalize from studies in development economics?

Eva Vivalt: Not terribly much, I’m afraid to say. This was really disheartening to me at the time. Gotten over it a bit, but yeah. I guess one main takeaway as well is that we should probably be paying a little more attention to sampling variance in terms of thinking of the results of studies. Sampling variance is just the kind of random noise that you get, especially when you’ve got very small studies. And some small studies just happen to find larger results. So I think if we try to separate that out a bit and a little bit down-weight those results that are coming from studies of small sample sizes, that certainly helps a bit.

Another thing that came out, and this is just an observational correlation, but one of the more interesting ones and I think it’s now part of the dialogue you hear from people, is that results from smaller studies that were done with an NGO, potentially as a pilot before government scale-up, those ones were initially more promising. And then the scale-ups didn’t live up to the hype as it were. Like the government-implemented larger versions of the same programs, or similar programs, they didn’t seem to do so well. So that’s a little bit disconcerting, if we think that generally we start as researchers by studying these interventions in smaller situations in the hopes that when we scale it up we’ll find the same effects.

Robert Wiblin: Hmm. So is the issue there that NGOs do these pilot studies and for those pilot studies they’re a bit smaller and the people who are running them are very passionate about it, so they run them to a very high standard? Or they offer the intervention to a very high standard. But then when it’s scaled up, the people who are doing it they don’t have a much money or they don’t know what they’re doing. And so the results tend to be much worse?

Eva Vivalt: Yeah, I think that’s part of it. There could also be like a targeting aspect of this. You start with the places where you think there’s going to be particularly high effects. And then, as you scale it up, you might end up incorporating expanding the treatment to some people who are not going to benefit as much. And that would be, actually, completely fine. The worst story is where the initial NGO, or the initial study, everybody was very excited about it and put a lot of effort into it. And then maybe their capacity constraints worsened it when it was trying to be scaled up. So, that’s a little more disconcerting I guess.

Robert Wiblin: Right. So let’s just back up a little bit. You said the answer is that we can’t generalize very much from these development studies. What is your measure of generalizability, statistically? And on a scale between zero and one, where do we stand?

Eva Vivalt: Yeah, so that’s an excellent question. One of the things I argue for in my paper is that we should be caring about this true inter-study variance term. Which, I and some other people like Andrew Gelman call tau-squared. Which one has to estimate, you don’t know that up front. But that this is a pretty good measure of, well, the true inter-study variance.

And there’s also a related figure that that ties into, which is called the I-squared. Where you’ve got essentially the proportion of the variance that’s not just sampling error. And that’s nice because it’s a unitless metric that’s well established in the meta-analysis literature. And it kind of ranges from zero to one and it’s very much related to this pooling factor, where if you’re trying to think about how much to weight a certain study, you might think of putting some weight on that study and some weight on all the other studies in that area.

And if you’re doing that, there’s some weight that you can put on one individual study’s result and that would range between zero and one. And similarly, for the weight you put on all the other studies’ results. I’m not sure if that completely answered your question.

Robert Wiblin: Yeah.

Eva Vivalt: But there are these metrics you can use, and I would completely agree, and I was trying to push for initially that … I mean, I’m still trying to push for it, but I think it’s now more accepted that we should be thinking of generalizability as something that is non-binary that lies somewhere between zero and one.

Robert Wiblin: So, what is tau-squared? I saw this in the paper, but to be honest I didn’t really understand what it actually is. Is this some kind of partition of the variance that’s due to … I just don’t know.

Eva Vivalt: Yeah, no worries. So essentially, yeah, you can think of it as some measure of …. Okay, you’ve got a whole bunch of different results from different studies. Some of that variation is just due to sampling variance. So if you think of these studies as all replications, I mean they’re not, but if you were to think of them as replications then the only source of variance would be the sampling variance because you’d be drawing an observation from some distribution. And you’d be drawing a slightly different observation, so you’d get a little bit of noise there naturally …

Robert Wiblin: So that’s just some studies get lucky and some studies get unlucky in a sense. So they have higher or lower numbers just because of what individuals they happened to include?

Eva Vivalt: Yeah, exactly. And so if you’re then thinking okay well we’re not actually really in a case of replications. We’re actually in a case where there is a different effect size in every place that we do the study because there’s so much heterogeneity. Like, there’s other contextual factors or whatnot. Well, then you’ve got not just this sampling variance, but also some additional sort of true latent heterogeneity that you need to estimate.

Robert Wiblin: That the effect was different in the different cases.

Eva Vivalt: Exactly. Exactly. So, I’m just arguing for separating the two of these things out. And then trying to say, well this is the true heterogeneity.

And you could go even a step further and say well, maybe we can model some of the variation. And maybe we want to think that the important thing in terms of generalizing is how much unmodeled heterogeneity there is. Like how much we can’t explain. Like if we can say that, for example, well I’ve got a conditional cash transfer program and I want to know the effects on enrollment rates and maybe I think baseline enrollment rates are really important in determining that. Because it’s probably easier to do a better job in improving the enrollment rate from 75% than from 99%, right? It’s just a little bit easier. So, you can say okay well then I’ve got some model where baseline enrollment rates are an input into that model. And then after accounting for baseline enrollment rates, what’s sort of the residual unexplained heterogeneity in results. Because that’s going to be the limiting factor on how much I can actually extrapolate from one setting to another accurately.

Robert Wiblin: Okay. So a tau-squared of one would indicate that all of them had the same effect in every case that they were implemented. And a zero would indicate that it was totally random, the effect that it would have in each different circumstance. Is that right?

Eva Vivalt: Not quite, actually. Sorry, I might have explained this a little bit funny. So there is something that ranges between zero and one, which is either the I-squared or this pooling term. But the tau-squared itself, you can think of it as a kind of variance. It’s going to really be in terms of the units of whatever the thing was initially. So if it’s conditional cash transfers on enrollment rates, enrollment rates are maybe in percentage points. So then the variance would relate to those units of enrollment rates. And so that’s actually a great point because it’s going to be very difficult to compare the tau-squared of one particular outcome to the tau-squared of a completely different intervention’s effect on a completely different outcome because those things are going to be in different units entirely.

That’s one advantage of I-squared relative to tau-squared, is that I-squared is unitless. It kind of scales things. So that does run between zero and one, and does not depend on the units. Although it’s not 100% straightforward either. I mean, that has also got some drawbacks.

I’m trying to summarize the paper here, but I guess if one’s really super interested in these issues I would just recommend reading the paper.

Robert Wiblin: Taking a look at it.

Eva Vivalt: It goes in much greater detail. I’m simplifying a bit here.

Robert Wiblin: Sure, okay. We’ll definitely stick up a link to it.

So let’s say that we had a new intervention that no one really knew anything about. And then one trial was done of it in a particular place, and it found that it improved the outcome by one standard deviation. Given your findings, how should we expect it to perform in a different situation. Presumably less than one standard deviation improvement, right?

Eva Vivalt: Yeah. I mean, to be honest, one standard deviation improvement is just huge. Enormous.

Robert Wiblin: I was just saying that because one’s a nice round number.

Eva Vivalt: Oh yeah. But the typical intervention is going to be more like 0.1 rather than one. So if I saw one somewhere, I’d be like, wow, that’s got to be a real outlier. That was a very high draw. So I would be skeptical just for that reason.

Robert Wiblin: Okay, so I’ve got 0.1. What might you expect then if it was done somewhere else?

Eva Vivalt: Well, it’s going to depend a lot on the intervention and the outcome. And if I’m using some more complicated model. I think the best way to answer those questions is to look at a specific intervention and a specific outcome and try to model as much of the heterogeneity as possible. And there’s not going to be any substitute for that, really.

What I’m looking at in my paper is trying to say something like, well that might be so. But still, what can we say about looking across all the interventions, across all the outcomes? And that’s where I pick up patterns like if it’s done by an NGO, if it’s a relatively smaller program it tends to have higher effects. But that’s a little bit hand-wavy. I think the best way to answer those questions in terms of what do I really find is to go to that particular intervention, that particular outcome.

But what I can say is that even with one study’s results, and now this is pretty weak but it’s still true, there’s still a relationship, is that if you look at the heterogeneity of results within the study, that actually does predict the heterogeneity of results across studies. I mean, weakly. And there’s no reason for it to necessarily be true, but it is a stylized fact that one could use.

Robert Wiblin: Hey, I just wanted to interject that I later emailed Eva to see if there was any rule of thumb we could use to get a sense of how bad the generalisability is from one study to another.

One option is to say that:

The median absolute amount by which a predicted effect size differs from the true value given in the next study is 99%. In standardized values, the average absolute value of the error is 0.18, compared to an average effect size of 0.12.

So, colloquially, if you say that your naive prediction was X, well, it could easily be 0 or 2*X — that’s how badly this estimate was off on average. In fact it’s as likely to be outside the range of between 0 and 2x, as inside it.

This wouldn’t be rigorous enough to satisfy an expert in the field, but it’s good enough for us here. Back to the interview.

Robert Wiblin: Okay. So did you find out under what circumstances results are more generalizable and when they’re less generalizable?

Eva Vivalt: Yeah. So again this is a little bit hand-wavy and I think a little bit less the point of the paper, because like I say, even though these studies are mostly RCTs, when I’m looking at them, at that point it’s as though I’ve got observational data. Because the studies are selected in various ways that … where people even choose to do the studies is selected and I’m just looking at this data. But despite that, if you do the naïve thing of doing ordinary least squares regression of your effect sizes on various study characteristics…. So I mentioned bigger programs and government-implemented programs tend to do worse. There’s not much of a general trend in other things. In particular, it doesn’t seem to matter so much if it’s an RCT or not. Or where it was done.

Actually, one thing I did find is, you can’t even necessarily just say …. So often you hear from policy-makers and researchers, “well we’ve got results from one particular country. So at least we know how it works in that country.” And actually, I would disagree with that. Because even within a country, if you’ve got multiple results from the same country, they don’t predict each other very well. And it makes sense if you think about, you know, I don’t think anybody would say within the US, “oh yeah, well results from Massachusetts are going to be very similar to results from Texas,” or something like that. Right? Even within a country there’s so much variation that maybe it’s no better than taking results from a completely different area of the globe. But it’s still not that great and I can’t actually even find any kind of statistically significant relationship within a country.

Robert Wiblin: Isn’t this pretty damning? Why would we bother to do these studies if they don’t generalize to other situations? It seems like we can’t learn very much from them.

Eva Vivalt: Yeah, so that’s a great devil’s advocate type question. I’m still, despite all this, an optimist that we’re learning something. Right? Because part of it is that this way of looking at it doesn’t model all the little factors. I mean, I am actually quite skeptical of most of the stories that people tell about why an intervention worked in one place and why it didn’t work in another place. Because I think a lot of those stories are constructed after the fact, and they’re just stories that I don’t think are very credible. But that said, I don’t want to say that we can learn nothing. I would just say that it’s very, very hard to learn things. But, what’s the alternative?

Robert Wiblin: Well, I guess, potentially using one’s intuition. But one thing you could say looking at this, is that it’s not really worth running these studies. An alternative view would be that because each study is less informative than we thought, we have to run even more of them. Do you have a view between those two different ways of responding?

Eva Vivalt: Yeah. I would argue for running more of them, but not in a completely senseless manner. I think we can still say something about …. There are ones which are higher variance, where we could learn more, where the value of information of doing another study is going to be higher.

So, I guess part of this depends on, sorry to get into technical details but …

Robert Wiblin: No, go for it.

Eva Vivalt: … the decision problem we think people are faced with. Right? Because if you think that a policy-maker is, what they really care about in making their decision is whether some result is statistically significant and better than some other result in a statistically significant way. Well okay, then that’s a different problem from if they are just trying to find, if they’re okay with something that there’s a 20% chance works better than the alternative.

So think of this all in terms of: there is some problem that a policy-maker is trying to solve, and then within that problem you’ve got the ability to run studies or not run studies. And the value of information of running each of those things is going to be different depending on how much underlying heterogeneity there is.

Just to be a little bit simpler about this, the intuition is that if you’ve got … I mean, the studies that are the most valuable to run would be the ones where you don’t know very well a priori what’s going to happen. You’ve got a higher degree of uncertainty up front. But where you think there is a good upswing potential, as it were, right? Like it could overtake the best possible outcome.

Robert Wiblin: A lot of value of information, I think is the …

Eva Vivalt: Yeah, exactly.

Robert Wiblin: Okay. We’ll come back to some of those issues later because you have other papers that deal with how these RCTs can inform policy-makers.

But let’s just talk a little bit more about your method here. So, how did you collect all of this data on all these different RCTs. It sounds like an enormous hassle?

Eva Vivalt: Yeah. I wouldn’t recommend it. I mean, obviously one has to do it. But, oh, my goodness. I think I was very lucky actually to have a lot of great help from various RAs over the course of several years, through AidGrade, who were gathering and double-checking and sometimes triple-checking some of this data. Everything, all the data was gathered by two people. And if their results disagreed in some way, and their inputs disagreed then a third person would come in and arbitrate. So that’s how we got all of the characteristics of the different studies coded up. All the effect sizes.

I am hopeful that in the future, we’re going to be able to do a lot more with automated reading of these papers. You would think that’s absolutely crazy, but I think it works pretty well so far. I mean, not of the actual results tables. I think the results tables are actually the hardest task in a way, because you need to really know what a particular result represents. Is this a regression with controls, is it with whatever else. What methods, et cetera. But for basic characteristics of studies, like where was it done, was it an RCT or not, those kinds of things, actually we’ve had pretty good success with some pilot studies trying to read that automatically through natural language processing.

And that, I think, is really the best hope for the future. Because studies are coming out so quickly these days that I think to keep abreast of all of the literature and all the various topics — I mean, it’s even more of a constraint for the medical literature where there’s loads of studies and new ones coming out all the time. Meta-analyses can go out of date quite quickly and they’re not really incentivized properly in the research community so the only way to get people to actually do them and keep the evidence up-to-date in some sense is by at least making the process easier.

I don’t think that it can be ever 100% done by computer. I think you’re still going to need some inputs from people. But if you can reduce the amount of effort it takes by 80% or 90% and just have people focus on the harder questions and the harder parts of that, that would be a huge benefit.

Robert Wiblin: Do you think there’s enough of this data aggregation? Or are there too few incentives for people to do this in academia?

Eva Vivalt: No, I think the incentives are all wrong. Because researchers, they want to do the first paper on a subject. Or ideally, if not the first then the second. The third is even worse than that. And by the time you get to do a meta-analysis, well that’s kind of the bottom of the bin in some regards. You think it would be more highly valued, but it’s not.

Robert Wiblin: Wouldn’t you get a lot of citations from that? Because people would trust the results of a meta-analysis more that the individual papers.

Eva Vivalt: I think that’s fair. And you can get some fairly well cited meta-analyses. Unfortunately, citations are just not the criterion that’s really used for evaluating research in economics. I know it is more so in other fields, but not so much in economics where it really is the journal that matters.

Robert Wiblin: So the journals that publish that kind of thing just aren’t viewed as the most prestigious?

Eva Vivalt: Yeah, that’s exactly right.

Robert Wiblin: I’ve also heard that in fields where collecting a big data set, especially an historical data set, is what enables you to ask a lot of new questions. There’s perhaps too few incentives to put it together. Because you do all of the work of putting it together then you publish one paper about it, and then other people will use the same dataset to publish lots of papers themselves. And in a sense you don’t get the full fruit of all of the initial work that you did. Is that a possibility here, where other people can now access this dataset of all of these different RCTs that you’ve compiled and so you don’t … Kind of they drank a bit of your milkshake in a sense.

Eva Vivalt: I wouldn’t put it that strongly, both because I’m actually quite happy if other people do things with the data and also because …. It depends I guess where you are at in the process. I think for people who are just finishing up their PhD, for example, it’s actually very good to show that you can compile a very large dataset because that’s what a lot … a lot of research depends on having very good data and if you can show that you can collect really good data then that’s great for you. Obviously you also want to publish well based on that. That’s, I guess, a separate question.

Robert Wiblin: So, what are the biggest weaknesses of this study? Do you think that we should trust this result, that results aren’t that generalizable? Or is this something that could be overturned with future research?

Eva Vivalt: I don’t think it’s really at danger of being overturned per se. That’s just a function of the fact we’re doing social science and there are all sorts of things that can change and that matter for your treatment effects. So, yeah, I’m not tremendously concerned about that.

Robert Wiblin: So what kinds of studies did you include in this particular dataset? For example, you were looking at development studies.

Eva Vivalt: Yeah.

Robert Wiblin: If you looked instead at say, education studies in the developed world. Might you get a different results if you were looking at a different domain or field?

Eva Vivalt: Maybe. I think the bigger difference, though, would probably be with things that are less, at least intuitively, context-specific. Things like health …

Robert Wiblin: Medicine.

Eva Vivalt: Yeah, exactly. So for example in our data, actually the things that almost varied more were the health interventions. But that’s because we weren’t controlling for things like baseline incidence of disease or any of those kinds of things.

Robert Wiblin: Right.

Eva Vivalt: And if you do control for those, then, I mean, we weren’t doing that in the general analysis, but if you do control for them then actually the heterogeneity is a lot smaller. So, things that have a clearer, more straightforward causal effect, there we might expect to see slightly different results.

Robert Wiblin: Hmm. So kind of antibiotics will usually treat the same disease anywhere. But I suppose in these studies they actually have different impacts because in different places people have the underlying disease at different levels.

Eva Vivalt: Yeah, exactly. Yeah. I mean, everybody I think at this point would agree that things like de-worming et cetera depend on what the baseline prevalence of the worms, or whatever, is. And once you control for those things, then you actually … Because there’s some very clear mechanisms through which these things work, there are fewer things that can go wrong. Whereas the more general social science type thing, there’s so many factors that feed into what the treatment effects ultimately are, so it’s a little bit messier.

Robert Wiblin: So you wrote another paper called “How Much can Impact Evaluations Inform Policy Decisions?” Which, I can imagine, was partly informed by this other paper. Do you want to explain what you found there?

Eva Vivalt: Sure. So that paper is looking a bit at, well the fact that if we do try to put this into some kind of framework where a policy-maker is deciding between different options, and they’re always going to want to choose the thing that has the highest effect. Well, given the heterogeneity we observe, how often would they actually change their mind? You know, if the outside option takes some particular value. So, yeah, it’s quite related.

We also tried to use some priors that we had collected. Some predictions that policy-makers had made about the effects of particular programs.

Robert Wiblin: So just to see if I’ve understood the set-up correctly, you’ve got this modeled agent, which I guess is a politician or a bureaucrat or something. And they, they’ve got some background thing that they could spend money on, perhaps this is spending more money on schools or whatever else. And they think that they know how good that is. And so that’s somewhere they could stick the money. And then you’re thinking of the value of a study on another thing, that might be better, or might be worse. And the bureaucrat say, even though there hasn’t been any studies done yet or not many, they have some belief about how good this other option is, this new option. But they’re not sure about it, and they would somewhat change their mind if a randomized control trial were done. And then you want to see, well, how often would that trial cause them to actually change their decision and go for this alternative option?

Eva Vivalt: Yeah, that’s exactly it. You’re putting it much better than I did.

Robert Wiblin: So, what did you find? Is there any way of communicating how often people do change their mind? And maybe perhaps what’s the monetary value of these studies?

Eva Vivalt: That’s an excellent question. So, we didn’t actually connect it to actual monetary value because that depends a bit upon what you think the value of some of these outcomes is. We did this a little bit abstractly, trying to compare two programs that — one was 90% of the value of another one, or 50%. But we weren’t actually making assumptions on the final, the last mile type part of “well yeah, but what is this actually worth?” I mean, that’s going to depend a bit on what the actual outcomes and the values of the outcomes are.

So, I wish I had a better answer is what I’m trying to say.

Robert Wiblin: Okay. So in the abstract you wrote, “We show that the marginal benefits of a study quickly fall and when a study will be the most useful in making a decision in a particular context is also when it will have the lowest external validity,” which is a bit counter-intuitive. And then also, “The results highlight that leveraging the wisdom of the crowds can result in greater improvements in policy outcomes than running an additional study.”

Did you want to explain those sentences?

Eva Vivalt: Sure. So, yeah. I think one of the interesting things is the statement that when a study will be most useful is when it will have the lowest external validity, that is relating to the point that in a sense, when’s the study going to be most useful? What’s going to be the most useful’s when it surprises us and was really different. When’s it going to be the most different? Well, when we’re not going to able to generalize more from it, when it’s got some underlying factors that make it a little bit weird in some way. It’s going to be the highest value in that setting, but if you try to think about extrapolating from it….

Robert Wiblin: So is it not so much that that study can’t be generalized to other things that makes it valuable. But rather that other things can’t already be generalized to this one? So this is a more unique case?

Eva Vivalt: Yeah. And I mean, it could go either way in the sense that if you think that the other studies haven’t found this particular thing, and this particular thing is a bit unique, well, likewise, you wouldn’t expect this unique thing to say much about those other ones either. So, again, this is a little bit abstract because you can try to think about, “well, yes, but does this new thing tell us something about some other, more complicated underlying models of the world as to why this one happened to be so surprising?” But yeah, that’s just the general intuition.

And then with respect to leveraging the wisdom of the crowds, well, we did look at different kinds of ways of making decisions. We looked at a dictator making a decision all by themselves versus a collective of various bureaucrats voting and just using a majority voting rule to try to decide which particular intervention to do. And there, because people can frequently be wrong, actually adding additional people to the set of people who are making the decision can lead to substantial benefits in terms of the actual … in choosing the right program afterwards. There were actually some simulations in which it performed better.

Robert Wiblin: Are you saying that running these broad surveys is potentially more informative than an RCT? And I guess also presumably cheaper as well? Or at least in the model.

Eva Vivalt: Yeah, so I guess … So in the model it’s more a matter of how many people are making the decision and how many people’s inputs are being fed into this process. So, I guess if you’ve got a more democratic decision making process or you involve more people, their priors are more likely to be correct in that case. Sort of like their aggregate prior. And the benefits of just doing that can be higher than the benefits of doing an RCT. I mean, it depends a little bit on all sorts of underlying parameters here. But there were at least some simulations for which that was definitely true, where adding additional people helping to make the decision resulted in better decisions than running an additional study.

Robert Wiblin: So, what surprised you the most from these simulations that you were running? Was there anything that you didn’t expect?

Eva Vivalt: Well I don’t think I was expecting that result, to be honest. Also, obviously, it does depend on the quality of the priors that people initially have, right? Like if you actually do have very highly uninformed individuals, then aggregating more highly uninformed priors is not going to help you.

Robert Wiblin: Shit in, shit out.

Eva Vivalt: Yeah, basically.

Robert Wiblin: I get to swear on my own show.

Eva Vivalt: Well, I could just say that you said it.

Robert Wiblin: So, do you think that we should run more studies, or less, on the basis of this paper?

Eva Vivalt: Well, I don’t think that’s the right … It’s not like we … There’s not a real trade-off here. Have more democratic decision making processes or run additional studies. We can do both. So I think more studies still is going to help, but so is actually taking that evidence into consideration and also having more people help to make decisions and hopefully balance out some of the errors that are made because, actually a lot of … I mean, I’ve also done some work looking at how policy-makers interpret evidence from studies and update.

Robert Wiblin: So you modeled bureaucrats or politicians as these Bayesian agents who I guess update perfectly. Was that right?

Eva Vivalt: At least in this paper. There’s another paper that does not do it, but yeah.

Robert Wiblin: Yeah. What kind of deviations might you expect? Do you think they might update too much or too little in the real world?

Eva Vivalt: Well I think, actually, so I’ve got this other paper with Aidan Coville of the World Bank where we are looking at precisely some of the biases that policy-makers have. And one of the bigger ones is that people are perfectly happy to update on new evidence when that goes in a nice, positive — when it’s good news. But people really hate to update based on bad news. So for example, if you think that the effects of a conditional cash transfer program on enrollment rates is that maybe they’ll increase enrollment rates by three percentage points. And then we can randomly show you some information that either says it’s five or it’s one. Well if we show you information that says it’s five, you’re like “great, it’s five.” If we show you information that says it’s one, you’re like “eh, maybe it’s two.” So we see that kind of bias. We also-

Robert Wiblin: It’s interesting because if you update negatively or if you update downwards then you’re creating a much greater possibility for future exciting positive updates. You can’t have positive updates without negative updates as well.

Eva Vivalt: Well, that’s fair I guess.

Robert Wiblin: I guess they’re not thinking that way.

Eva Vivalt: Present bias or something. No, I don’t know.

And it kind of makes sense intuitively, because one of the initial reasons for why we’re considering this particular bias in the first place is… I think a situation that will be very familiar to people who engage with policy-makers is, you know, you’re asked to do an impact evaluation. You come back saying, “oh yeah, this thing showed no effect.” And people are like, “oh really? It must be the impact evaluation that’s wrong.”

Robert Wiblin: I wonder, it’s notorious that impact evaluations within bureaucracies that want to protect their own programs are too optimistic. But I wonder, it’s a bit like, kind of everyone overstates how tall they are on dating sites but at the end of the day, you end up knowing how tall someone is, because everyone overstates by the same amount. And I wonder if looking at these impact evaluations you kind of figure out what’s the truth or what’s right on average just by saying “well was it extremely good or was it merely good?” You just adjust everything down by a bit.

Eva Vivalt: That’s a good point. That’s a good point. Yeah, no, fair enough. I mean the other thing that …

Robert Wiblin: I suppose that would just end up rewarding even more extreme lying.

Eva Vivalt: Yeah. And that’s not the only bias that people have got either, right? So another thing that we were looking at is how people were taking or not taking the variance into consideration. So in the simplest idea, you can think of this as just sampling variance. But you can also look at heterogeneity across studies. And basically, people were not updating correctly based on confidence intervals. That might be the easiest way of framing it.

And we did try to break that down a bit, and try to say, “well okay but why is that? Are they misinterpreting what a confidence interval is? Is it some kind of aggregation failure? Is it just that they’re ignoring all new information, and so obviously they’re going to be caring less about confidence intervals than somebody who actually does take information into consideration and does actually update at all?”

So we did try to break it down in several ways. And yeah, it does seem like people are not taking the variance into account as a Bayesian would.

Robert Wiblin: Oh, hold on. So you’re saying they just look at the point results and not at how uncertain it was?

Eva Vivalt: Yeah, pretty much. I mean, they do look a little bit at how uncertain it was, but not as much as they should if they were fully Bayesian. If they were actually Bayesian then they would care more about the confidence intervals.

Robert Wiblin: Right. So if it would be a small study that kind of gets a fluky extreme result, people over-rely on that kind of thing.

Eva Vivalt: Yeah, exactly.

Robert Wiblin: That doesn’t surprise me.

So what is the latest on your work on priors? Is that related to this paper?

Eva Vivalt: So, it is. This is one of the things that I’ve been up to. So for this particular one, we were looking at biases that policy-makers might have and biases in updating.

So, you start out with a Bayesian model and say, “okay, well look, but people aren’t Bayesian. How can we modify this model and have some kind of quasi-Bayesian model?” And so we were looking at two biases: this kind of optimism I was talking about and this variance neglect. Which you can think of it as some kind of extension neglect more broadly and related to the hot hand fallacy or gambler’s fallacy for people who are into the behavioral economics literature.

And we basically … It was a really simple study. We just collected peoples priors. We then showed them some results from studies, and then we got their posteriors. And we presented information in different ways, because we were also interested in knowing if the way in which we present information can also help people overcome biases if they are biased. So if you’ve got a problem, what’s the solution? And we did this not just for policy-makers, but also for researchers, for practitioners like NGO operational staff, that kind of thing. We also got a side sample of MTurk participants.

And these biases actually turned out to be pretty general. And the big thing on the solution side is more information will encourage people to update more on the evidence. So I guess if you’re in that situation of, you’ve got some bad news, come bearing a lot of data and that should help at least a little bit. So, you know, more quantiles of the data, that kind of thing. Maximum, minimum values, you know, the whole range of as many statistics as you can really.

Robert Wiblin: Hold on. So your main finding was in order to accept a negative result, people have to be confronted with the overwhelming evidence so that they can’t ignore it?

Eva Vivalt: Yeah, at least it should help.

Robert Wiblin: Were there any other discoveries?

Eva Vivalt: The other kinds of things that we’ve been doing … We have actually collected priors in a whole bunch of different settings so actually I’m in the process, also with a grad student, of trying to look at some additional biases that policy makers may have. Like omission bias, status quo bias, where people don’t want to actually change, deviate, from decisions that were made in the past where they would have to do something differently, or take action. Like there might be some bias towards inaction.

Robert Wiblin: Or at least not changing your action. Not shutting down the program.

Eva Vivalt: Yeah. Yeah, yeah. I mean, the kinds of things that bureaucracies are typically sort of criticized for. But more specifically, on the priors, we’ve also asked experts to predict effects of various impact evaluations. One thing that I’m really excited about is trying to more systematically collect priors in the future. And so, I’ve been talking with many people actually, including Stefano DellaVigna and Devin Pope, who’ve got these great papers on expert predictions, about setting up some larger websites so that in the future people could more systematically collect priors for their research projects.

I’m getting at this point an email every week roughly asking for advice on collecting priors, because I think researchers are very interested in collecting priors for their projects because it makes sense from their perspective. They’re highly incentivized to do so because it helps with, not just with all this updating work, but also for them, personally, it’s like, “Well now nobody can say that they knew the results of my study all along.” Like, “I can tell them ‘well, this is what people thought beforehand and this is the benefit of my research.’” And also, if I have null results, then it makes the null results more interesting, because we didn’t expect that.

So, the researchers are incentivized to gather these things but I think that, given that, we should be doing that a little bit more systematically to able to say some interesting things about like … well, for example; one thing is that people’s priors might, on average, be pretty accurate. So this is what we saw with the researchers, when we gathered our researchers’ priors, that they were quite accurate on average. Individuals, they were off by quite a lot. There’s the kind of wisdom of the crowds thing.

But, if you think that you could get some wisdom of the crowds and that people are pretty accurate overall, if you aggregate, well that actually suggests that it could be a good yardstick to use in those situations where we don’t have RCTs. And it could even help us figure out where should we do an RCT, where are we not really certain what the effect will be and we need an RCT to come in and arbitrate, as it were.

So I think there’s a lot more to do there that could be of pretty high value.

Robert Wiblin: Right, okay. So, I’ve got a number of questions here. I guess, so the question we’re trying to answer, well at least one of them, is: how good are experts as a whole at predicting the likeliest outcome of a study that you’re going to conduct? Or, to put it another way, the impact of an intervention. And, I guess, the stuff that I’ve read is that experts, at least individual experts, are not very reliable. But you’re saying that if you systematically collect the expectations of many different experts, then on average they can be surprisingly good.

Eva Vivalt: Yeah. Yeah. I would say that. I think that like, again, it sort of depends a bit on — this is why it would be really nice to get systematic data across many, many different situations. Because it could just be that the ones that we’ve looked at so far are not particularly surprising, but there probably are some situations in which people are able to predict things less well, and it would be nice to know are there some characteristics of studies that can help to tell us when experts are going to be good or bad at predicting this kind of thing.

But I would agree that any one individual expert is going to be fairly widely off, I think.

Robert Wiblin: So how do you actually solicit these priors or these expectations from these experts? Have you figured out the best way of doing that?

Eva Vivalt: Yeah, so that’s an excellent question. And we tried several different things. By now, I think I’ve got a pretty good idea of what works. So, in some sense the gold standard, if people can understand it, which is a big if, is to ask people to put weights in different bins because then you can get the distributions of their priors as well. Like not just a mean, but sort of how much uncertainty is captured in that.

But that’s quite hard for most people to do. People aren’t really used to thinking of their beliefs as putting weights in bins.

Robert Wiblin: Not even people in this field of social science?

Eva Vivalt: Not really. I mean, the researchers are a bit better at it, but in any case, at least what we’ve done, is even when talking with researchers it’s better to try to be perfectly clear about what the bins mean and go through all that kind of thing beforehand.

The other thing is, if you are asking sort of more of lay public, is it’s probably better to move to asking them to sort of give ranges, as it were. So, you know, what is a value such that you think that it’s less than a 10% chance it’ll fall below this value, or less than 10% chance it’ll fall above this value, or different quantiles… I mean, you then have to make some assumptions about the actual distribution because people can give you a range but if you really want to get at some of the updating questions, you need to know a little bit more. Like, you want to know whether those distributions are normal or not. And you don’t know whether things are normally distributed if you just have three points, right?

Robert Wiblin: Yeah, yeah. So that sounds like a really exciting research agenda, but we’ve got to push on because there’s quite a lot of other paper’s that you’ve published in the last few years that I want to talk about.

Another one that you’ve written up, which is a bit more hopeful, is ‘How Often Should We Believe Positive Results: Assessing The Credibility Of Research Findings In Development Economics.’ And of course, most of social science is facing a replication crisis where we’re just finding that many published results in papers don’t pan out when you try to do the experiment again. What did you find in development economics?

Eva Vivalt: Yeah, so actually the situation was a lot better than I would have initially thought. So I think this is actually quite a positive result. It could be biased from the kinds of studies that we included. Like we had a lot of conditional cash transfers in there. They tend to have very large sample sizes, so they’re kind of like the best case scenario. But nonetheless, the false positive report probabilities are actually quite small.

Robert Wiblin: Are you able to describe the method that you applied in that paper? Obviously, you weren’t replicating lots of these studies, you must have used some other method to reach this conclusion.

Eva Vivalt: Yep. Well, there’s quite a lot of nice literature here that I can refer people on to. The false positive and false negative report probabilities, the equations for how to calculate those are coming from out of a paper by Wacholder et al. There’s some other people who’ve also looked at this. Where essentially the probability that you’ve got a false positive or a false negative depends a bit on the priors that you’ve got.

So for example, if you think of some study that is looking at, I don’t know, something we really don’t believe to exist, like extra sensory perception or something, right? And if you found some positive result for that well, nobody’s going to trust a study that shows that ESP is real. And to really show that credibly, you would need to have lots of studies with really, precisely estimated coefficients.

Again, your priors are going into it, the statistical significance or your p-values that you’ve found would go into it and that’s just an equation you can sort of write out.

The other thing is that there are these type S and type M errors that Andrew Gelman and some co-authors talk about. And these are the probability that if you’ve got a statistically significant result, it’s actually of the right sign-

Robert Wiblin: So it’s positive rather than negative, or negative rather than positive.

Eva Vivalt: Yeah, yeah. Because you would be surprised, but it’s actually true that if you’ve got low-powered results, then even if you find something statistically significant, there is some probability that the true value is negative when you see something that says it’s positive, or vice versa.

Robert Wiblin: Yeah, and then there’s type M errors?

Eva Vivalt: Yeah so this is same kind of thing except for magnitude. So, you’ve found some significant result and it has certain magnitude, but chances are that’s actually incorrect in some way. Like it’s most likely inflated in value, so the truth is likely to lie lower than that.

Robert Wiblin: So how did you put together this information to try to figure out what fraction of results were accurate? I’m not quite understanding that.

Eva Vivalt: Sure, sure, sure. So, the main source of data that we used here is we had to get a whole bunch of expert beliefs, because these were inputs into the equations. And to get the expert beliefs we did one thing that’s not 100% kosher, but is the best kind of approximation we could do, which is that we didn’t want to wait until a lot of impact evaluations were over. Like a lot of the other work that I’ve done on priors, also with Aiden, we are actually waiting until all the results of the real studies come out. But for this we wanted a bunch of results to use already, as it were. So what we did was we used AidGrade’s database of impact evaluation results and we said, “Okay let’s go to topic experts,” like people who have, for example, done a study on a conditional cash transfer program, and then ask them “which of all these other programs have you heard about?”

They were also all conditional cash transfers programs but, you know, ones by other people. And then for the ones that they hadn’t heard about, we asked them to make up to five predictions about the effects that those studies would find. We’d describe the studies to them in great detail and then got their best guess.

Then, using this data we could say something about the false positive report probability, because then we’ve got the p-value that each study found and we’ve got what we’re considering to be the prior probability of some kind of nominal effect. We needed, actually, them to also give a certain value below which they would consider the study to have not been successful. Like, if the conditional cash transfer program doesn’t improve enrollment rates by, I don’t know, 5 percentage points then it’s not successful, because we wanted to …. All these equations deal with sort of like, the likelihood that some particular hypothesis is true. For us we wanted … there’s like some critical threshold above which we would think that it had an effect, versus not have an effect. Some meaningful effect. The minimum meaningful, kind of like the minimum detectable effect size.

So we create this probability of attaining this non-null effect, given the distribution of priors and given this particular cut-off threshold. And those are just inputs to this equation, along with the power of the study.

Robert Wiblin: Right. Okay. I think I understand now. So, you’ve got all of these different studies looking at the effect size on different outcomes, and they have different levels of power. So different kind of sample sizes and different variances in them. And then, you’re collecting priors from a bunch of different subject matter experts, and then you’re thinking, “Well, if we took those priors and updated appropriately based on the results in those studies, how often would we end up forming the wrong conclusion?” Or is actually just that; what if you took the point estimate from that study, how often would you be wrong relative to if you’d updated in a Bayesian way? Is the second right? Or am I totally wrong?

Eva Vivalt: So I would think of it in a different way. If you see a positive, significant result, there’s some probability that it just happened to be that way by chance and there’s some probability that that’s a true thing.

Robert Wiblin: And especially if it was unlikely to begin with, then it may well still probably be wrong, because of, kind of-

Eva Vivalt: Yes.

Robert Wiblin: Regression to the mean effect.

Eva Vivalt: Yeah, if you think that it’s really unlikely a priori and you observe it, it’s more likely to be a false positive. If you’re under-powered to begin with, it’s more likely to be a false positive. If it’s got a p-value of 0.049, it’s more likely to be a false positive. So, these are all just sort of factors that go into it and you could do the same kind of thing for false negatives actually. Yep.

Robert Wiblin: Okay. Well, let’s push on.

You did another paper on specification searching, which is the practice where people who are writing a paper try out a whole lot of different specifications to try to, I guess, get the answer that they’ll like and they publish just the results of that, like, to show you. And you were trying to figure out how common this practice is in different disciplines and researchers using different methods. How did you try to do that and what did you find?

Eva Vivalt: Yeah, this paper is similar in methodology to some papers by Gerber and Malhotra and others, where … and also there’s some work by Brodeur et al. looking at essentially the distribution of statistics. Say you’ve got a bunch a different studies, you’ve got a bunch of different t-statistics from each of those results, what you would expect is that there’s going to be some smooth distribution of those statistics. I mean, hopefully. But what you actually observe in the data is there’s some lumpiness and in particular there tends to be some slightly lower density of results that are just marginally insignificant, than you would expect and some sort of bump in the distribution, just above the threshold for statistical significance, which is usually at the 0.05 level. So 1.96.

So you’ll see like, relatively few results around 1.95 and relatively more results than you had anticipated having around 1.97. That’s the general intuition, right?

Robert Wiblin: Yeah. And that’s an indication that people were fishing around to find the specification that would just get them over the line to be able to publish.

Eva Vivalt: Exactly. But I mean it’s not as straightforward as just that because you can imagine that … what is that distribution supposed to look like in reality? And there’s other reasons why you might expect to see some more statistically significant results. For example, people design the studies such that they can find significant results in the first place. So, it’s not 100% straightforward to just say, “Oh yeah well we’ve got a lot of significant results and therefore it must be specification searching.” I think it becomes more credible that it is specification searching if you can say, “Yeah but it’s within a really small band, right around the threshold for significance.” As you expand the band out a little bit, I think you could try to argue-

Robert Wiblin: There are other possible explanations.

Eva Vivalt: Yeah, exactly. That like, people are designing this study very cleverly just to get significance. Although, honestly to be fair, I think it’s difficult to swallow that people are designing the study perfectly appropriately to just barely get statistical significance, right? I mean it’s so hard to predict what the effects will be anyways, and then your hands are a little bit tied from the fact that generally when you’re doing this you have got a given budget and you can’t really exceed that budget anyways. So you’re dealing with a certain sample size and having to adapt your study accordingly. It’s not like you’ve got free reign to perfectly maximize.

Robert Wiblin: Okay so the alternative innocent explanation is that people can anticipate ahead of time what the effect size will be, and then they chose the sample size that will allow them to get just below 0.05 p-value. So they’ll be able to publish the paper at minimum cost.

Eva Vivalt: Yeah.

Robert Wiblin: But in reality it’s just it’s a bit hard to believe that that explains most of what’s going on, especially given that we just know that lots of academics in fact do do specification searching.

Eva Vivalt: Yeah. It’s just people don’t have as fine control over the design of study as you would perhaps anticipate because funding is somewhat out of their hands. Also, because any one given paper is going to be looking at so many different outcomes, so how can you really design a study so that you are just barely significant for outcome A and B and C, you know? And so like it becomes a little bit implausible. But that would be the best case for the contrary view.

Robert Wiblin: Yeah. Okay. So you looked for this suspicious clumping of p-values or effect sizes across a whole of lot of different methods and disciplines, and what did you find?

Eva Vivalt: Yeah, actually the situation seemed a lot better for RCTs than non-RCTs, which is kind of understandable if you think about it because I think RCTs generally have an easier time getting published these days anyways. It could be reflecting that, that you don’t need to engage in specification searching if you’ve got an RCT and people are more likely to publish your results anyways, even if they’re null.

The other thing is that things do seem to be changing a little bit over time. In particular the non-RCTs, as time goes on they become more and more significant, as it were. Let’s just not lean too hard on this explanation but it could be, in the old days, maybe you would lie and say, “Well, I’ve got a non-RCT and it found a value of 1.97.” People would be like, “Oh, okay. 1.97, I believe that.” And nowadays if you see 1.97 everybody’s like, “Wait a second.” So now, you’ll see values that are more like 2.1 or something, right? It’s like values that are a little bit farther out there and more significant.

Robert Wiblin: I see. Okay, so you’re saying because people have learned that this is kind of an indication of specification searching, people have to go even further and find specifications that get them an even more significant result so it doesn’t look suspicious.

Eva Vivalt: Yeah, maybe, yeah. That would be the intuition. Again, I can’t like 100% say, but it would consistent with that at least.

Robert Wiblin: It sounds to me like you’ve been doing quite a lot of work on this Bayesian approach. Looking into priors and updating, based on those. Does it feel like development economics is becoming more Bayesian? And is that a good thing?

Eva Vivalt: You know, actually, honestly, I believe it is and that’s really exciting. These days I don’t have to worry quite so much about … I’m definitely hardcore Bayesian and I think that it’s a little bit easier for me to talk about things that rely on a Bayesian interpretation.

Robert Wiblin: Do you think there’s any downsides of Bayesian method being applied more often? I guess one thing I worry about is people kind of fiddling with the priors in order to get the outcome that they want. Or perhaps there’s a bit more flexibility and there’s more possibly for specification searching.

Eva Vivalt: Hmm. Honestly we’re probably not going to go down the route of being … I don’t see the discipline as becoming fully Bayesian any time in the near future. I just don’t see the likelihood of that. What I do think though is that … so it is true that what researchers do and what policy makers do could be a bit different. It might be fine to be different. I’ve heard the argument that researchers should be very concerned about getting unbiased estimates and policy makers … there’s this bias-variance tradeoff that I care actually very passionately about and that others care passionately about too as well, I believe.

Robert Wiblin: Did you want explain what that is?

Eva Vivalt: Sure. The bias-variance tradeoff is essentially saying that you’ve got several sources of prediction error. You’ve got some error due to possible biases, you’ve got some error due to variance and you’ve got some other idiosyncratic error. And this is something that is generally true in all contexts, in all ways, and comes up in different ways.

An example is: if you think of nearest neighbor matching, if you want you can include more neighbors, and if you include more neighbors you’ve got more observations, so you’ve got more precise estimates. Like lower variance estimates. But on the other hand, if you’re including more neighbors, you’ve got some worse matches. So you’re increasing your bias. And so, all estimation approaches are going to have some error due to bias and some error due to variance. And economists have focused really narrowly on producing unbiased estimates, and if all you care about is prediction error … I know Andrew Gelman takes this view and so do I and so do other people like Rachael Meager I think and others. We’re like, “Well hang on, why do we care just so much about getting unbiased estimates?” You also care about having precise estimates, too. It would help for prediction error to maybe accept a little bit of bias.

And the argument I’ve heard is that “maybe researchers should be unbiased, but policy makers interpreting the evidence, it’s okay to accept a bit more bias there.” Maybe the … you don’t need every person at every layer to be reducing prediction error as much as possible. I think that like in practical terms, if you’re an effective altruist, et cetera, you do care about minimizing prediction error regardless of the source. But then it’s a slightly separate question to say what researchers should be doing per se.

Robert Wiblin: So I’ll stick up links to both Andrew Gelman’s blog and a description of the bias-variance tradeoff.

So as I understand it you’re saying that there’s different statistical methods that you could use that would be systematically too optimistic or pessimistic, but would be more precise, is that right? And in general, people go for something that’s neither too optimistic or pessimistic, but is not as precise as it might be. It has like larger average mistakes, and it’s just not clear why we’ve chosen that particular approach.

Eva Vivalt: Yeah. So, there’s a nice diagram that you can throw up if you’re putting links to things that sort of shows the bias-variance tradeoff really, really nicely, I think. Where you’ve got prediction error on one axis and you’ve got different curves of error for if you’ve got biased estimates or if you’ve got estimates with high variance, low precision. Your total prediction error is going to be some function of both of these things as well as some other error. And economists have focused really quite a lot on getting unbiased estimates.

You would think that if anywhere this consideration might come up a little bit in the process of using machine learning because there there’s a lot of techniques that are biased that people accept. Like Lasso or ridge regressions and all sorts of other things, but even there, if you talk to people who are actually involved with these kinds of methods, they’re highly focused on getting unbiased estimates so that the rest of the profession accepts them, which I think is kind of a shame in some regards.

But again, I want to be a little bit agnostic because I’m not 100% sure actually myself what is the best way of going about it, I just feel that at least at the time of making a policy decision, we should be minimizing overall prediction error regardless of the source of that error. Whether it’s bias or variance. I’m not sure what the researchers should do. That’s, I think, like I said, a slightly separate problem. But I do think we’re not paying attention to prediction error as much as we should.

Robert Wiblin: Alright. Let’s turn now to some of the implications of this work and some research that we’ve done for people involved in the effective altruism movement. So we wrote this article, ‘Is It Fair To Say That Most Social Interventions Don’t Work?’ Ben Todd worked on that and put it up last year. It’s one of the articles on our site that I like, I think the most out of all of them. And the reason we looked into it is in a lot of our talks, so many years, we’ve been saying most social interventions, if you look at them, don’t work. On the basis of looking at lots of randomized control trials and saying while most of them seem to produce null results, the interventions that they’re looking at don’t seem to be helping.

But then we had some doubts about that, because we’re thinking “it’s possible you’re getting false negatives for example, and it’s possible that an intervention works in some circumstances and not others.” So, is there anything that you want to say about that article possibly? We could walk through the various different moves that we take and then try to reach a conclusion about it.

Eva Vivalt: Yeah, it’s a really difficult question because, like you say, there are lots of things that go into it. Null results could just be underpowered. The other big thing is that unfortunately we tend to do impact evaluations in some of the better situations in the first place, and this would sort of work in the other direction. Like, so many impact evaluations just fall apart and never happen and we don’t actually observe their outcomes because the study just fell apart.

So yeah, it’s hard to say, to be honest, but happy to walk through-

Robert Wiblin: Sure, sure. Okay. So one of the things is; only some interventions are ever evaluated and they’re probably ones that are better than others, because you would only bother spending the money on an RCT if it looks really positive. Do you have any sense of how big that effect is?

Eva Vivalt: Honestly, I don’t, but I will say that there’ve been some people looking at the impact evaluations that don’t end up happening. Like David McKenzie and some other people were trying to pool together some estimates of this. And I think that problem is actually quite large. It’s not necessarily that it’s … it’s a little bit distinct from the problem that we only try to study those things that have some chance of being really highly effective. It’s also that even within a particular topic, that is highly effective or that we suspect is highly effective, the ones that end up happening are the better instantiations of that particular program. Like the government in that particular area had it more together or whatever else. So we’re getting biased estimates as well that way.

Robert Wiblin: Okay. So, we kind of start with this quote from David Anderson, who does research in this area, and he says it looks like 75% of social interventions, that he’s seen, have weak or no effects. And this suggests that it might even be worse than that because there’s all of these programs that aren’t even being evaluated, which are probably worse. Maybe it’s 80 or 90% of social interventions have small or no effects.

But there’s other things that we need to think about. So, there’s lots of different outcomes that you could look at when you’re studying an intervention. You might think “you’ve got this change in a school, should it be expected to improve their math scores or their english scores or how much they enjoy being at school?” All of these different things, which I guess that pushes in the direction of being over optimistic because the papers that get published can kind of fish for whichever one they found a significant effect in. But even if we were honestly reporting the results, it then just becomes kind of unclear which were the things that you kind of expected to have an effect on anyway. It just makes it quite confusing. What actually are we saying when we say 75% of things have weak or no effects? Was it just a primary effect or on many of them?

Eva Vivalt: Yeah. That’s totally fair because often times a study will throw in all sorts of random other things that they don’t actually honestly anticipate there being effect on, but if you’re doing the study anyways, why not?

Robert Wiblin: Yeah, yeah, yeah. So it turns out that this change at the school didn’t make the students happier, would you expect it to anyway? Maybe they were just curious about that. So it’s really unclear what you’re sampling across.

Then there’s this issue of; we said no effect or weak effects is often how this quote is given, but then what is a weak effect? That’s just kind of a subjective judgment. Is it relative to the cost? Is it relative to the statistical significance? Is it material? Again, that just kind of muddies the water and it you think about, it becomes a much more subjective kind of claim. Do you have anything to add to that?

Eva Vivalt: Not really. I mean-

Robert Wiblin: Does this come up in your own research?

Eva Vivalt: I mean, to me, what I would find the important question is actually in some ways … I realize that obviously for the purposes of this post that you’ve put together with Ben Todd, et cetera, I think that the question is really interesting, of which have any effect whatsoever, but I would a little bit think that another important question would be “which matter relative to some other outside” … I guess, which matter at all is a good question, but I always think about what is the outside option, and what the outside option is really matters.

So when you were talking about weak effects, yeah probably they are talking about statistical significance, but you can also think of weak effects as like “sure it has an effect but so what? We can do so much better.”

Robert Wiblin: Mm-hmm (affirmative), yeah. And then, I think the part of the article that you helped with was moving from talking about individual studies, where very often you get null results to meta analyses where you combine different studies. And then more often, I think you find that an intervention works, at least on average. Do you want to talk about that?

Eva Vivalt: Yeah, if you’ve got some underpowered studies then combining them does tend to improve the situation slightly. It depends a little bit on exactly how you’re doing it and what kinds of things you’re including, but I’d say by and large you do end up with … because you’re essentially adding some power when you do a meta-analysis, by at least partially pooling results from different studies.

Robert Wiblin: And so you can pick up smaller effects.

Eva Vivalt: Yeah.

Robert Wiblin: Which means that, I guess, more of them become … like just jump over the line of being positive or material or observable.

Eva Vivalt: Becoming significant, not necessarily-

Robert Wiblin: Statistically.

Eva Vivalt: Yeah exactly. It could be like a very small effect, but …

Robert Wiblin: Well there’s a bunch of other moves that we make here, or adjustments up and down, but what we were trying to kind of get at is how much of a gain do you get by picking the best interventions or trying to be evidence based rather than just picking something at random. And I think the conclusion that we reached after looking at all of this, is that it’s perhaps not as much as you might … or people who are extremely supportive of doing more empirical work might hope, because one; is that the measurements are somewhat poor. So there’s a good chance often of you think that you’ve the best intervention from a pool but in fact you’ve gotten it wrong.

But also that even if there’s like a small fraction of the interventions that you might be sampling from that are much more effective than others, even if you choose at random, you still have a reasonable chance of picking one of those anyway. Which means that, let’s say that there’s like 10 different interventions and only one of them works. If you pick at random, you can’t do worse than a tenth as well as definitely picking the best one because you have a one in ten chance of picking it anyway.

Which I guess is perhaps something that I think effective altruism hadn’t thought as much about. We often tended to compare the very best interventions with the very worst ones, but it’d be a very peculiar strategy to try to find the very worst ones and do those. Instead you should really compare your attempt at picking the best intervention with kind of picking at random among things that have been studied. In which case the multiple ineffectiveness that you get probably isn’t going to be huge. Do you have any comments on that?

Eva Vivalt: Yeah. I mean this is a little bit similar to when I was trying to look at like how much we can learn from an impact evaluation. I had to make assumptions about what that outside option is that the policy makers are considering. And just sort of based on the distribution of effects that I saw in AidGrade’s database, it’s actually reasonable that a lot of these projects, a lot of interventions have got somewhat similar effect sizes, at least without taking cost-effectiveness into consideration. Obviously I’d love to take costs into consideration but it’s very hard to because like 10% of studies say anything about costs and then it’s not very credible when they do say it.

But things were pretty tightly distributed. So I tried some different specifications. Like I was saying, trying out 50% of the effect of another program or 90% of the effect of another program, like how well can you distinguish between two programs, one of which is 90% of the value of the other one, as it were. You have to make some pretty strong assumptions there. Things do seem to be … so, I don’t know. That’s how I’ve gone about it in the past.

Robert Wiblin: Things seem to be fairly clumped together, you’re seeing?

Eva Vivalt: Well, out of the ones in AidGrade’s database, and again without taking costs into consideration. I’m not trying to make a broader claim than that because there’s just no data.

Robert Wiblin: Right. Okay, so I was just about to bring this up next, which is like four years ago or so, Robin Hanson responded to one of your graphs from AidGrade, which seemed to suggest that if you looked at effect sizes in terms of standard deviation improvements then you kind of found a normal distribution of effect sizes and it wasn’t that widely dispersed, as you’re saying. And he was saying “well this was a bit in conflict with the standard line that people in effective altruism give, which is that there’s massive distributions in how cost effective different approaches are. That it’s not just normal, but it’s lognormal or power law distributed, or something like that. Which gives you much greater dispersion between the best and the average and the worst.”

Did you ever respond to that? Because I think we ended concluding it might be a bit of a misunderstanding.

Eva Vivalt: I think that, yeah … so there’s two things that are certainly not included. One thing I just alluded to is costs. That’s saying nothing about the cost-effectiveness of a particular intervention and I would love to have been able to produce those graphs for the cost-effectiveness. But, like I say, the thing is that papers just don’t report costs, and they should. But they don’t do. So it’s really hard for me to come in as an outsider to each of these papers and say, “Oh yeah, but actually I know what the costs are.”

One could make strong assumptions about those and try to infer what costs are from other studies, et cetera, but it’s quite hard to do and not very credible. So I’m sure one could do it, but probably not in an academic setting. I haven’t been pursuing it but I would love for other people to pursue it and I’m sure that other people are pursuing it.

Robert Wiblin: Well the other thing, if you want to move to cost-effectiveness you also have to think about the actual welfare gain from the different improvements.

Eva Vivalt: Exactly. So, that’s the other thing I was going say is then how can you actually value these outcomes? Because the outcomes are pretty … they don’t have intuitive value to them, right? How do you value an extra year in school versus a centimeter of height, right? How do you think about that kind of thing. What does that actually mean in terms of value? So then you need some additional mapping from the outcomes to something that we value.

Robert Wiblin: So, yeah. Is it possible that we start with this normal distribution of standard deviation changes and then because costs per recipient are so wildly distributed and the benefits per standard deviation improvement are so wildly distributed that you still very wide dispersion in the cost effectiveness of different interventions?

Eva Vivalt: You could do.

Robert Wiblin: Mm-hmm (affirmative), you could. Yeah.

Eva Vivalt: I just have not a very clear sense of that because I don’t have a clear sense of the costs.

Robert Wiblin: Okay, it’s not just positive. Other people could look at this and try to figure that out.

Eva Vivalt: Yeah, yeah, and I really hope somebody does.

Robert Wiblin: I guess there’s also the Disease Control Priorities Project, of course, has produced cost effectiveness estimates for lots of different health treatments and find that they’re extremely widely dispersed. But I think that their resourcing per intervention that they’re looking at isn’t so good, and very often they rely on modeling rather than empirical results, which might be causing them to overstate the variance because some of it is just mistakes on their part.

Eva Vivalt: I see. Yeah, no, I’ve heard a little bit about that. That makes a lot of sense. I think that one thing that is certainly necessary and I hope happens in the near future is some attempt at also adding values to these other things that we might care about. Like all the educational stuff, et cetera, to sort of be able to compare them with heath interventions, et cetera. Because the same kind of way that they do the disability adjusted life years, et cetera, they could do for some kind of more general well being.

Robert Wiblin: Right, yeah.

So I really want to try to pin you down a little bit on how valuable is being empirical? Because it seems like you’ve got some positive results and some negative results, you’ve got the generalizability doesn’t seem so good so can we really learn so much? On the other hand it looked like some of your research suggests that in fact most of the results that show positive effects are kind of right about that. And then we’ve got to consider I guess the cost of doing these different studies and whether people actually respond to it in government. Did you have … you’ve been working in this area for five or ten years now, have you updated in favor of empirical social science or against it?

Eva Vivalt: I think it’s the only game in town to be honest. As much as we may criticize some of the things that come out of standard research, I guess the only answer in terms of what to do next is more of the same. And with some improvements, but more is better. And I think people are a little bit more aware of and focused on addressing some of the limitations in past research, both in terms of — people are thinking more now about the differences in scale up. People are thinking a bit more now about how results actually feed into the policy process. So, for me I think there’s incremental change, but I’m certainly pro-empirical work because what’s the alternative? It’s not-

Robert Wiblin: Well, I think there are alternatives. One is, as you were saying, just survey people on their expectations about what works, even before you’ve run any studies. And it could just be that that gets you a lot of the way and it costs very little, so maybe we should just do that and then screw the RCTs, or only do them occasionally.

Eva Vivalt: I don’t want to rule out the possibility that we can learn something from … I think we can learn more using observational data, which priors would also be similar to. And I don’t want to rule out that we can learn something from those, I just, maybe this is just a matter of semantics. Like I would still consider that, in some sense, empirical work because what you could do is try to say, “well yes but I want to try to”-

Robert Wiblin: A systematic survey.

Eva Vivalt: “Figure out which are the … yeah. Figure out the situations in which this is actually relatively okay,” and then it’s some approximation strategy that’s still not quite valid but better than nothing. I think that’s still somewhat empirical.

Robert Wiblin: Yeah. You’re the third person that I’ve brought this question up with on the show, and I think part of it is that I feel a little bit of guilt from I used to work at Giving What We Can, and we pushed this view that people should base their giving on randomized control trial results. Pretty hard. And now I do just have some doubts about whether we had at least strong evidence at the time to know that basing your decisions on that actually is such good advice. One is; the results often aren’t that reliable, they don’t generalize, they’re expensive to deliver, so maybe we should use other methods.

I guess you’re saying that these surveys of opinion are empirical in their own way, but they’re kind of a different sort of empiricism. Speaking to people who have some experience on the ground of what they think works and what doesn’t, and then getting their kind of just overall judgment about how this system that they’ve observed functions and what they expect might move it in one way or another. That’s kind of what the RCT movement was pushing against, was this like, “Oh, just rely on experts to kind of intuitively know what works and what doesn’t.”

But RCTs have their own problems and maybe that method was more cost effective in its own way.

Eva Vivalt: Yeah, I mean I don’t know, I wouldn’t necessarily pit these against each other. You can even think that there could be some ways of integrating them. Like, for example, one thing I’ve thought somebody I should do, and I’m saying this because I hope that somebody who is listening actually goes and does this, because I don’t want to do it myself but I hope somebody does it. I think we could use observational data a lot more, and one thing that we could do pretty easily is, if you’ve got observational data, one thing that the RCT community might like is if you tried to use it to design better RCTs, right? If you can say things like “well RCTs are good because they can help to really determine what is causing what and what mechanisms work to” … But you could have like different kinds of models of the world in mind and different kinds of mechanisms that could be at play.

In particular, there’s this literature on Bayesian networks, which is you can think of these essentially as like graphs or graphical models as well. So there’s this … sorry to be a little bit … there’s a lot to explain to try to-

Robert Wiblin: No, no, no, go for it.

Eva Vivalt: Describe this idea. Graphical models is an alternative way of looking at the world. You’ll often hear of like directed acyclic graphs, et cetera. Nobody uses them per se in mainstream economics and I think that’s actually, in some respects, fine because there is some work that sort of shows that graphical models are equivalent to the more like Heckman-style structural models, and Rubin’s Potential Outcomes framework, which most of economics is based in Rubin’s Potential Outcomes framework these days. These things are all kind of equivalent to one another in that a theorem in one is a theorem in another, and they’re just emphasizing different things.

One thing that graphical models are good at emphasizing is that you can have all sorts of different graphs, as it were. You could have some things causing some other things in different ways and different relationships between different things. And if you’re doing an RCT, presumably, you want to be able to say something at the end of the day. To put it in terms of the graphical models approach, to say something like “this is the graph and these are mechanisms.”

But one thing that graphical models approach sort of gets at is that there can be situations in which, if you really trust the graph, if you really trust the model, the way the world works, you can use observational data to get things that look a lot like causality from just correlations. And it’s using the structure of the graph, that’s essentially your assumptions, that’s the structure that is added so that you can seemingly get causality from correlation, when normally, as we know, correlation is not causality. One thing that I think could-

Robert Wiblin: Would you believe the results from something like that?

Eva Vivalt: Well I could in some situations. I think it’s very few situations but I have encountered one or two times a situation where somebody was wanting to do an impact evaluation and I was like, “Why don’t you just look at the probabilities and update according to Bayes rule?” And you would get something that was so close to causality … it would be actually be very credible as a causal estimate. So the example that [Pearl 01:28:51] likes to talk about, and some others have written about is: suppose you’re looking at the effect of smoking on lung cancer and historically tobacco companies like to argue that there was some third unmodeled factor that was both leading to people smoking and leading to lung cancer. I don’t know what this third factor is.

Robert Wiblin: Health neglect, in general.

Eva Vivalt: Yeah, yeah. But if you knew that there was some other factor between smoking and lung cancer. Like, for example, tar in people’s lungs that you could measure, then you could just look at the population prevalence of each of these little bits. So like what is the probability of cancer given tar in lungs, tar in lungs given smoking? And sort of string all these things together and compare them with just the probability of having cancer in the population and get something that looks pretty convincing.

I think if you’re interested in this Michael Nielsen has a nice blog post that I actually gave some of my students at ANU to look at, because I think it’s a very good description of how these kinds of models work. One thing, to get back to what initially got me down this route is to say something like, look one thing somebody should do is to try to look at how we could use Bayesian networks to say something about what kinds of graphs are more likely than other graphs, given the observational data. Because using the observational data, you can try to infer what kinds of graphs are more likely than other graphs. And if you’ve got like … if you can narrow down the set of possible ways in which the world works to a couple of competing options then that’s something that you can test pretty easily between with RCTs. But like, why don’t we at least use observational data to lower down the options, as it were, before … use the big guns, like RCTs, where it’s like really necessary to but if there’s other smaller problems where we don’t need to then great.

Robert Wiblin: That’s really interesting. We should definitely get that link and put it in the blog post attached to the show, because I’m very interested to learn more about that but it’s not something I know very much about.

Eva Vivalt: Yeah, no, happy to share.

Robert Wiblin: Is this kind of a new fashion or a new interest amongst statisticians?

Eva Vivalt: I don’t think so to be honest. I think it’s actually really unfortunate. I do wish that there was more attention paid to observational data.

Robert Wiblin: Another approach that people are suggesting, and Lant Pritchett has been pushing really hard lately, is to stop looking at just like rigorous evidence that we can really pin down and instead go for really big hits that, admittedly, we might not have such strong evidence for but they have higher expected value.

So he was writing a lot about this last year and he wrote this article for the Center For Global Development called ‘The Perils of Partial Attribution’ and he finished it with paragraph; “In a very strange turn of events the organizations and supporters of the wildly successful ‘team development’ are under pressure to sacrifice actions that can produce trillions in gains, in the economy, in education, in health, through systemic transformation. And instead development actors are being pressured to do only actions for which ‘rigorous evidence’ proves ‘what works’ but that leads inevitably to a focus on individualized actions known to produce at best mere millions, but for which the donors and external development actors can take direct causal credit. But there are real dangers from the perils of partial attribution in which individual actors care more about what they can take credit for than what there is in the collective team success.”

So, I think he’s basically saying there, that there’s things that we could do where we’d never really know who caused the outcome that would be a lot more valuable than just scaling up things that are proven to work where everyone knows who caused what. Do you actually have any views on that general issue?

Eva Vivalt: So I guess I would also relate it to like this idea of larger interventions that are perhaps hard to study versus … in some regards lots of people have said, concurrent with the rise of RCTs is also this rise of a focus on very — smaller questions, as it were, that are more tractable. But, to some extent I’m actually quite sympathetic to the smaller issues just because it’s really hard to know what’s going on with the large ones. And it’s entirely possible that we are leaving things on the table by not sort of focusing on the big ideas. Honestly, I don’t know.

Robert Wiblin: Yep. I guess that’s the issue of it, because you can’t measure the effect of these other things.

Eva Vivalt: Exactly.

Robert Wiblin: It’s very hard to-

Eva Vivalt: And surprisingly I don’t know.

Robert Wiblin: Yeah, yeah, yeah. You’re being a good empiricist.

Well I guess I suppose he’s saying that one way that he thinks people are biased against this kind of big picture hits-based giving in development is that, kind of, the credit is spread too thin, and so each person kind of discounts their impact on the outcome. And so they under weight how valuable they’ve been because they can never really demonstrate that they definitely had an impact. And, I guess, if we set up bureaucracies or we set aid agencies the right way then they’re not going to care that much about their impact on policy reform because they’ll never be able to prove to a satisfactory standard that it was they who caused the outcome, which could lead to a lot of neglect of some important questions. Could you imagine that?

Eva Vivalt: Yeah, honestly I’m not sure what is the right way of thinking about this problem because it seems like it would depend a lot on how the policy makers … and I keep on saying “policy makers”, some people don’t like that term. So “bureaucrats”, whatever you want, institutions that the bureaucrats happen to be nested within. It kind of depends a lot on how they’re making these decisions in the end and what they’re caring about because in some regards a stronger factor in terms of what they fund is just that they don’t tend to … like it’s actually relatively few people still today who care about evidence and who care about impact evaluation results.

I think actually I wouldn’t be terribly concerned that there are all these policy makers who, they really just care about the impact found in an impact evaluation because actually I think, if anything, probably too few people do, and people may be getting more of their guidance from, oh, they happen to have a best friend from high school who tells them that this is the best thing to do. And now they’re minister of finance or minister of whatever and they can do what they want. I would worry more about those kinds of things rather than … I mean, just as a first order consideration. But to be honest again, I don’t really know, I wouldn’t consider myself an expert in this topic. Those are my priors but aggregate them with some other priors.

Robert Wiblin: Sure, sure, sure.

Okay, so let’s move on to talking a bit about your career. We should do a full episode at some point about economics PhDs, but I just had a few questions. You finished your career … sorry your PhD … that was premature. You finished your PhD about two years ago and then you applied to a whole lot of different places and, if I recall, when you were writing your PhD you were a little bit nervous about where you would end up landing, but you ended up getting a whole bunch of good offers. What’s the process by which economics PhD grads actually get their jobs?

Eva Vivalt: Sure. So I’ll slightly correct you and say that I’d actually done my PhD a little bit earlier but then the thing is I didn’t go on the academic job market. I got kind of like an early offer at the World Bank. So I was finishing up at Berkeley, I was in my fourth year, I got the offer, I said, “Oh thank you very much.” I mean, nobody goes out in their fourth year on the job market, so that was a bit too early for the academic job market but I said, “Thank you very much for this job, I’ll take it.” And then when I was at the World Bank I was like actually I think I’d prefer to be in academia so then I did a post-doc and then I went from the post-doc a couple of years ago on the job market.

But, yeah, the PhD process, to get a job … actually, this was quite similar from the post-doc because I didn’t do the post-doc too, too late. In fact there was still a couple of people from my original cohort at Berkeley who were going on the job market at that time. I think it would be different for more senior markets, but if you’re still on the junior market and you’re going though the normal process, there’s a very centralized market in that-

Robert Wiblin: It’s kind of communist really.

Eva Vivalt: It’s very strange. Almost all the first round interviews are done at this one annual meeting. So you go to these meetings and you do your first round interviews and basically everybody is there. All sorts of schools from all over the world, not just US schools. You do some short interviews there over the course of just a few days. Then you do second rounds, or fly-outs, at the different schools that invite you to a second round and then out of those some share of those also make you an offer.

So, that’s the general process. It’s, like I said, highly centralized, and it’s also a little bit of a crapshoot, I’m sorry to say. Because there’s a lot of heterogeneity and lots of things that just randomly happen and the more times that I observe it, the more times I see like really weird random stuff happening. So, as much as economists like to believe in efficient markets and as nice as this process is relative to some other disciplines where they’ve got a lot more hoops to jump through, it is a little bit strange. There is a lot of just random noise that also enters in.

Robert Wiblin: It sounds pretty stressful because of the randomness. How did you manage to do well? I guess was it just luck as well?

Eva Vivalt: Yeah, I don’t know. I mean there’s not much that you can do at that point, I believe. You try to have the best job market paper you possibly can, and then the only other thing you can really do is apply to a lot of places, hopefully get good letters from people. But a lot of this is things that most people in grad school are not going to know very much about, right? If you’re a grad student, do you really know if people can write you a good letter or not? Like how do you really know that kind of thing? It’s also somewhat hard to necessarily figure out while you’re still in grad school really what constitutes a really good job market paper, and it’s tough.

Robert Wiblin: Is this something that you can game while you’re doing your PhD? To get academics to like and write about you?

Eva Vivalt: I think that is fair and I’ve, to be perfectly frank, I’ve occasionally observed … I think there’s a little bit of tradeoff, too, because I think sometimes the things that help you get ahead are not necessarily the things that are useful or important questions. And so, on the rare occasion I’ve seen somebody who’s been very cynical who has entirely played to what they thought was a hot topic and been rewarded for that — good for them for that, that doesn’t always even work out necessarily, so I’m not trying to say that that’s like a straightforward way of doing well either because you can end up working on a hot but unimportant topic and nobody likes you in the end anyways.

Robert Wiblin: Have you ever compromised your career to do research that was more valuable than otherwise and do think that went well or do you regret it?

Eva Vivalt: Well, I don’t know. I think that if I were, strictly speaking, trying to have the absolute best career possible I probably would not have chosen the topics that I’ve chosen to work on. So I’ve probably gone a little bit far along that direction, but I’m actually perfectly happy with that. I’m quite proud of the topics I’ve taken on and wouldn’t really have it any other way.

I remember one time I was talking with somebody who specifically was saying they thought that I would have done better on the job market if I’d had a more conventional paper. This was somebody who had done quite well for themselves, not in a kind of gaming way I don’t think. They had done quite well for themselves, but I was just thinking to myself, “actually I’m kind of proud of what I did in a way that I would actually prefer to have my life than their life in some regards.”

I’m obviously being purposefully vague here on some of the details. But there are trade-offs to be made I think, and it’s not that … I don’t want to be too, too cynical. I think that academics do in fact reward important work, it’s just that they also reward a lot of other stuff too.

Robert Wiblin: No doubt. Yeah, so we often recommend an economics PhD to smart students who have good quant skills because it just opens up so many options after they graduate. Do you think the path is kind of overrated or underrated now?

Eva Vivalt: It’s hard to say, because what is true is that I think people end up with an inflated view, always, routinely, of how things are going to go for them. It’s tough because I think if people have in mind that they’re going to do in the top of their class and get a great academic placement, they’re probably wrong. Almost certainly wrong.

Robert Wiblin: Yeah, usually.

Eva Vivalt: [crosstalk 01:54:54] a lot of the placements that you see on web pages, those are also even sort of like the upper bound because they don’t take into consideration that a lot of people fail to get tenure and go to other places afterwards and may leave some people off the website entirely, et cetera, et cetera.

So there’s all sorts of biases in the process. The other thing is that a lot of people when thinking whether grad school might be right for them, are thinking with the framework in mind of, “Oh well I’ve done really well in classes so far and so therefore I’m going to do really well,” but then what they don’t take into consideration is that a) everybody else has also done very well in classes so far and b) it’s kind of different skill set actually to do good research than to do well in classes.

I think there is actually quite a lot more work that goes into doing good research than most people realize.

The other side is actually I really love my life. I think it’s fantastic and I’m very happy with everything pretty much at this particular moment. But at the same time, it is quite a lot harder than I think people think.

Robert Wiblin: Yeah, how high is the bar for getting into a good economics PhD program?

Eva Vivalt: So the bar for that is already very high and then out of that there is an even higher bar depending on what you want to do afterwards. If you’re not interested in academia but want to do other things, a PhD can also help for other things I suppose. But most people who want to do it, do it because they want to go into academia. In fact, if you don’t want to go into academia, don’t tell anybody because they’re less likely to want to advise you, that’s the general advice that’s given.

I’m not sure where exactly the line is right, right now, but I remember, just to give some kind of background, certainly when I was at Berkeley there was some time that I was helping with … so the grad students could help with admissions but in a very limited way. Basically the ones that didn’t meet some cut-off thresholds were thrown into a pit that grad students would sift through to try to pull out some … scavenge some gems from the losers in the pile.

Robert Wiblin: This sounds grim.

Eva Vivalt: Yeah I know. And Berkeley was one of the better schools, I think, for that. I think a lot of other schools if you don’t meet the cut-off then there’s no sort of recourse. There’s no grad students looking over, trying to find you again.

Robert Wiblin: Is there any kind of clear indication that you can get of whether your maths skills are up to the task?

Eva Vivalt: Not super-clear. Certainly you should at least look at like the quantitative score that you get on the GRE and if it’s not like either perfect or really close to perfect then …

Robert Wiblin: Reconsider.

Eva Vivalt: No matter whether or not you’re cut out for it, you might perfectly well be cut out for it but nobody will accept you. The other thing is, obviously the better you are at math, the better time you’ll have of it, particularly because I think economists are, frankly, biased towards research that involves a lot of math, even when it’s not necessary.

Robert Wiblin: So very top tier schools are very selective, but is there much point going to a second tier school? And I’m thinking especially if you don’t want to become an academic, you instead want to go and work at the World Bank or some other organization that can make a difference, or go into policy. Is that still a good option?

Eva Vivalt: I think if you’re interested in going into policy, et cetera, there’s especially some good schools around DC that are good for that, that have a lot of interactions with the World Bank, et cetera. So like University of Maryland, et cetera. Schools around there that are maybe like not considered to be top economic schools but are fairly well integrated into the policy world, that can still actually help.

Robert Wiblin: Hmm, okay, well I look forward to doing a full episode on this topic at some point in the future, maybe with you. But as it’s one of our top paths, I think we need some more content on it. Of course, we have the career review on this topic, which I’ll link to so people can take a lot at it if they’re considering doing an economics PhD like you did.

I’ve got to let you go but I guess just one final question; what’s the strangest or most ironic thing that you’ve found in your research over the five or ten years? Is there anything that’s made you laugh?

Eva Vivalt: One thing that is kind of ironic is, especially with some of my earlier work having looked at how much context really matters … and some of the pushback that I got at the time was from people saying, “Yeah but we know that already. We know context matters and that results wouldn’t otherwise generalize.” In some of my more recent work that I’ve done with these discrete choice experiments, where we gave people the option to choose which of two studies they thought was going to be more informative. And when … we did that with policy makers but we also did that with researchers, and a little bit contrary to what you might think given how researchers said they cared so much about context, they didn’t seem to in these discrete choice experiments. So, I would say not to pick on policy makers, there’s something probably to be learned on both sides there.

Robert Wiblin: It’s easier to see other people’s mistakes than your own, that’s for sure.

Eva Vivalt: Yep.

Robert Wiblin: My guest today has been Eva Vivalt. Thanks for coming on The 80,000 Hours Podcast Eva.

Eva Vivalt: Thank you.

Robert Wiblin: Remember if you’re listening all the way to the end, you’re the sort of person who should check out our job board at 80000hours.org/job-board.

The 80,000 Hours Podcast is produced by Keiran Harris.

Thanks for joining – talk to you next week.

Related episodes

About the show

The 80,000 Hours Podcast features unusually in-depth conversations about the world’s most pressing problems and how you can use your career to solve them. We invite guests pursuing a wide range of career paths - from academics and activists to entrepreneurs and policymakers — to analyse the case for and against working on different issues and which approaches are best for solving them.

The 80,000 Hours Podcast is produced and edited by Keiran Harris. Get in touch with feedback or guest suggestions by emailing [email protected]

What should I listen to first?

We've carefully selected ten episodes we think it could make sense to listen to first, on a separate podcast feed:

Check out 'Effective Altruism: An Introduction'

Subscribe by searching for 80,000 Hours wherever you get podcasts, or click one of the buttons below:

If you're new, see the podcast homepage for ideas on where to start, or browse our full episode archive.