#147 – Spencer Greenberg on stopping valueless papers from getting into top journals

Gene Wilder in Young Frankenstein (1974)

Can you trust the things you read in published scientific research? Not really. About 40% of experiments in top social science journals don’t get the same result if the experiments are repeated.

Two key reasons are ‘p-hacking’ and ‘publication bias’. P-hacking is when researchers run a lot of slightly different statistical tests until they find a way to make findings appear statistically significant when they’re actually not — a problem first discussed over 50 years ago. And because journals are more likely to publish positive than negative results, you might be reading about the one time an experiment worked, while the 10 times was run and got a ‘null result’ never saw the light of day. The resulting phenomenon of publication bias is one we’ve understood for 60 years.

Today’s repeat guest, social scientist and entrepreneur Spencer Greenberg, has followed these issues closely for years.

He recently checked whether p-values, an indicator of how likely a result was to occur by pure chance, could tell us how likely an outcome would be to recur if an experiment were repeated. From his sample of 325 replications of psychology studies, the answer seemed to be yes. According to Spencer, “when the original study’s p-value was less than 0.01 about 72% replicated — not bad. On the other hand, when the p-value is greater than 0.01, only about 48% replicated. A pretty big difference.”

To do his bit to help get these numbers up, Spencer has launched an effort to repeat almost every social science experiment published in the journals Nature and Science, and see if they find the same results. (So far they’re two for three.)

According to Spencer, things are gradually improving. For example he sees more raw data and experimental materials being shared, which makes it much easier to check the work of other researchers.

But while progress is being made on some fronts, Spencer thinks there are other serious problems with published research that aren’t yet fully appreciated. One of these Spencer calls ‘importance hacking’: passing off obvious or unimportant results as surprising and meaningful.

For instance, do you remember the sensational paper that claimed government policy was driven by the opinions of lobby groups and ‘elites,’ but hardly affected by the opinions of ordinary people? Huge if true! It got wall-to-wall coverage in the press and on social media. But unfortunately, the whole paper could only explain 7% of the variation in which policies were adopted. Basically the researchers just didn’t know what made some campaigns succeed while others didn’t — a point one wouldn’t learn without reading the paper and diving into confusing tables of numbers. Clever writing made their result seem more important and meaningful than it really was.

Another paper Spencer describes claimed to find that people with a history of trauma explore less. That experiment actually featured an “incredibly boring apple-picking game: you had an apple tree in front of you, and you either could pick another apple or go to the next tree. Those were your only options. And they found that people with histories of trauma were more likely to stay on the same tree. Does that actually prove anything about real-world behaviour?” It’s at best unclear.

Spencer suspects that importance hacking of this kind causes a similar amount of damage to the issues mentioned above, like p-hacking and publication bias, but is much less discussed. His replication project tries to identify importance hacking by comparing how a paper’s findings are described in the abstract to what the experiment actually showed. But the cat-and-mouse game between academics and journal reviewers is fierce, and it’s far from easy to stop people exaggerating the importance of their work.

In this wide-ranging conversation, Rob and Spencer discuss the above as well as:

  • When you should and shouldn’t use intuition to make decisions.
  • How to properly model why some people succeed more than others.
  • The difference between what Spencer calls “Soldier Altruists” and “Scout Altruists.”
  • A paper that tested dozens of methods for forming the habit of going to the gym, why Spencer thinks it was presented in a very misleading way, and what it really found.
  • Spencer’s experiment to see whether a 15-minute intervention could make people more likely to sustain a new habit two months later.
  • The most common way for groups with good intentions to turn bad and cause harm.
  • And Spencer’s low-guilt approach to a fulfilling life and doing good, which he calls “Valuism.”

Get this episode by subscribing to our podcast on the world’s most pressing problems and how to solve them: type ‘80,000 Hours’ into your podcasting app. Or read the transcript below.

Producer: Keiran Harris
Audio mastering: Ben Cordell and Milo McGuire
Transcriptions: Katy Moore

Highlights

Experimental evidence on how to *actually* go to the gym more

Spencer Greenberg: One study I think that really makes it clear how hard behaviour change is, is this really huge study that was run fairly recently. I think it was on tens of thousands of people that were gym members, and they tried to get them to go to the gym more often.

Spencer Greenberg: The basic idea is they got tonnes of researchers — I think it was like 30 different scientists working in small teams — to develop behaviour change interventions. And then they took these tens of thousands of people — like 61,000 participants who already had gym memberships — and used these text message interventions, 54 different interventions that the scientists developed, to try to get them to go to the gym more often.

Spencer Greenberg: Here are the ones I thought were more promising, more interesting. One is giving people bonuses after they mess up. So the basic idea is if you fail to go to the gym when you wanted to, you’ll be told you’re going to get a special bonus if you recover from this mistake. So the next day, if you go at the time you planned, you’ll get extra points. And I think this one probably is not a false positive, because actually in the top five, this occurred twice: there were two slight variations on it, and they both were in the top five. So that seems really promising to me.

Rob Wiblin: How can people apply that in their normal life? You have this issue of falling off the wagon that a lot of people have when they’re trying to change their habits. I suppose you need to have an extra reward for yourself if you miss a day and then you manage to get back on the next day. That’s maybe like an intervention point where you’re particularly able to make a difference by geeing yourself up.

Spencer Greenberg: Right. And I think the key is to think about a failure as not, “Now I’m screwed up and now it’s not even worth it.” It’s like, “No, no, no, wait. Now I can recover and I should feel really happy if I am able to recover.” Because think about doing a habit: you’re going to have failure days, inevitably. If you can’t recover, then you’re pretty screwed. I just think that’s just a reminder that the recovery piece is as important as doing the habit in the first place.

Spencer Greenberg: So the other one is really quite interesting. They gave people a choice of a gain frame and a loss frame for the points they earned. The idea is when you go to the gym successfully at the time you planned, you earn points, right? So that’s a gain frame. But you can equivalently think of it as you start with all the points, and every time you don’t go to the gym, you lose points.

And with this intervention, they actually let people choose. They said, “Do you want to have this many points at the beginning and every time you don’t go, you lose them? Or do you want to have zero points at the beginning and every time you go, you get them?” And of course it’s the same number of points either way. But by letting people choose, they found people actually seem to go to the gym more often. And we don’t know for sure that it’s not a false positive, but I think it’s kind of cute. And if it actually works, that’s pretty cool.

The factors that predict success

Rob Wiblin: I suppose there are all of these folk theories about what determines success, which usually highlight one particular thing. You know, it’s just grit, it’s about your ability to stick to it, it’s just luck, or it’s just intelligence, or something. People are reaching, I think, for these simpler folk theories because they want to know empirically which of these factors actually is the most important in determining success. Do you know of any evidence that can help people narrow down what’s most important?

Spencer Greenberg: So what’s really tough, if you’re looking at really high levels of success, is it’s hard to get good sample sizes, right? Because you can kind of make a collection of some of the most successful people in history, but then you’re just like anecdotally investigating each of them. On the other hand, if we’re thinking about more ordinary levels of success — like being good at your job and marrying someone you like and so on — it’s a lot easier to get datasets, so I think more is known about that kind of success.

And if we think about that literature, things that come up a lot in a work context are conscientiousness — for a wide range of work, but not all types of work. But for many different types of jobs, you don’t want to be too low in conscientiousness. There could be diminishing returns. I think that’s an open question. Do you really want to be at 99.9999% of conscientiousness, or is that too much? I suspect that at some point it actually becomes dysfunctional, but at least up until a point, it does seem like it tends to predict job performance.

IQ generally predicts job performance across a very wide range of jobs. So that’s a helpful one.

Then spending time training at a particular skill that’s relevant for that job clearly is really important — but the type of training matters tremendously. So just the number of hours someone has spent doing a thing is a much less good predictor than the number of hours they spent with high-quality feedback. If you think about someone who just plays chess every day: yes, they’re going to get better at chess. But compare that to someone who plays chess every day, and at the end of the day, they break down what they did badly, what the best chess engine said they should have done instead, compare it to what they did, try to figure out why it said that. Even if you control for the amount of time they spend, the second type is going to become vastly better at chess.

Four improvements in social science research

Rob Wiblin: What do you think the state of play is these days regarding this social science reform movement?

Spencer Greenberg: Yeah, it is kind of depressing reading papers from the 70s, and being like, “Well, yep, they’re pointing out all the things that need to be fixed.” I think we still have a long way to go to make science better, a very long way to go. However, I do think there are glimmers of hope.

Spencer Greenberg: So one thing I’m seeing, and this is just kind of anecdotally, is I see more data being shared. I see more researchers being like, “Here’s my data, go check it out.” And also more material sharing as well, like, “Here are the materials I used for this study.” The Open Science Framework has been a really positive force, where they really are encouraging people to share their materials in an open way. It’s a nice platform for helping you do that. So that’s really cool.

Spencer Greenberg: I think there’s way more replication projects, where people are trying to replicate results, so that’s really great.

Spencer Greenberg: There’s also this idea of Registered Reports. Chris Chambers has really been an advocate for this. They’re quite an amazing idea, where basically they get journals to agree to accept a paper before the study has been run. So basically the journal knows exactly what the study is going to be, but they don’t yet know the results, nor do the researchers. The journal agrees to accept it, and then the research team goes and runs the study, and it gets published regardless of whether it’s a positive or negative result. And this is really nice, because it reduces the incentive to p-hack your results just to show some cool result because your papers are already going to be published either way. So that’s really nice.

Spencer Greenberg: And I would say just generally more awareness, like increased scepticism, is probably helpful, because it means people know they’re going to be scrutinised a bit more for their research methods.

Biggest barriers to better social science research

Spencer Greenberg: I think currently something like 40% of papers in top social science journals don’t replicate, but it’s pretty dependent on what field it is. And I think ideally we should get that down to something like 15% not replicating, or something like that. You’re never going to get to zero, because there’s always things that could happen — it could be just bad luck or weird chance or stuff like that — but I think it’s just significantly too high a replication failure rate.

Spencer Greenberg: And the basic answer is that it’s an incentive problem, fundamentally. That is the super high-level answer. There’s interesting things to unpack there about what would it mean to make better incentives — but at the end of the day, if you’re a social scientist in academia, you need to get publications in top journals in order to stay in the field, and to get those tenure-track roles, and eventually to get tenure. And if you can’t get published in the top journals, you basically will get squeezed out.

Spencer Greenberg: So there’s kind of a double incentive whammy here.

Spencer Greenberg: One is that if you’re kind of doing fishy methods, you might have a competitive advantage over people who are really playing fairly, right? Because maybe the fishy methods let you publish more often. So that’s really, really bad.

Spencer Greenberg: And the second thing is that eventually you’re going to end up with a field that gets filled with the people that are doing the fishy methods, and then that becomes a norm. If you see other people doing fishy things and you’re like, “I guess that’s how research is done,” then that’s obviously going to have a negative effect on the whole field. And so one thing that’s really great about the open science movement is that by pushing back against these norms, it’s trying to create a new set of norms, a better set of norms.

Importance hacking

Rob Wiblin: Are there any issues with science practice that are a big deal that you think listeners to the show might not have heard so much about, or might not appreciate how important they are?

Spencer Greenberg: Yeah, absolutely. One really big one — that I think actually might be on the same order of magnitude as p-hacking in terms of how important it is, but is really not well known and doesn’t really have a standardised name — is something we call “importance hacking.”

Spencer Greenberg: You can get your result in a top journal by tricking the reviewers into thinking that it was a valuable or interesting finding when in fact it was essentially a valueless or completely uninteresting finding.

Spencer Greenberg: One example that comes to mind is this paper, “Testing theories of American politics: Elites, interest groups, and average citizens.” The basic idea of the paper was they were trying to see what actually predicts what ends up happening in society, what policies get passed. Is it the view of the elites? Is it the view of interest groups? Or is it the view of what average citizens want?

Spencer Greenberg: And they have a kind of shocking conclusion. Here are the coefficients that they report: Preference of average citizens, how much they matter, is 0.03. Preference of economic elites, 0.76. Oh, my gosh, that’s so much bigger, right? Alignment of interest groups, like what the interest groups think, 0.56. So almost as strong as the economic elites. So it’s kind of a shocking result. It’s like, “Oh my gosh, society is just determined by what economic elites and interest groups think, and not at all by average citizens,” right?

Spencer Greenberg: So this often happens to me when I’m reading papers. I’m like, “Oh, wow, that’s fascinating.” And then I come to like a table in Appendix 7 or whatever, and I’m like, “What the hell?” And so in this case, the particular line that really throws me for a loop is the R2 number. The R2 measures the percentage of variance that’s explained by the model. So this is a model where they’re trying to predict what policies get passed using the preferences of average citizens, economic elites, and interest groups. Take it all together into one model. Drum roll: what’s the R2? 0.07. They’re able to explain 7% of the variance of what happens using this information.

Rob Wiblin: OK, so they were trying to explain what policies got passed and they had opinion polls for elites, for interest groups, and for ordinary people. And they could only explain 7% of the variation in what policies got up? Which is negligible.

Spencer Greenberg: So my takeaway is that they failed to explain why policies get passed. That’s the result. We have no idea why policies are getting passed.

Identifying p-hackers

Rob Wiblin: Are there any techniques for tackling p-hacking that I might not have heard about?

Spencer Greenberg: Well, one that I just think is really cool is this technique developed by Simonsohn, Nelson, and Simmons called a p-curve analysis. The basic idea is you take a bunch of p-values that a researcher has published, either from one paper or from a bunch of their papers. And think about it as you’re making a histogram of all the p-values they found that were less than 0.05. So you’re kind of looking at the distribution of how often did they get different values: How often did they get ones between 0.04 and 0.05? How often did they get ones between 0.03 and 0.04?

Spencer Greenberg: Then we can think about what we should expect to see. So in a world where all of their results are false, but they don’t do any sketchy methods, like it just happens that everything they study doesn’t exist, we should expect a uniform distribution: every p-value is equally likely. And that’s just sort of the nature, based on the definition of a p-value: if there’s no effects, you expect it to be uniformly distributed, flat. And for p-curves, they just look at the values less than 0.05. So in that case, the histogram we’re talking about here would just be uniformly distributed between 0 and 0.05.

Spencer Greenberg: What if they lived in a world where they were doing everything cleanly — they weren’t using p-hacking — but some of their effects were real? How would that change the distribution? Well, real effects are going to tend to have low p-values — lower than chance — so you’re going to get a bump on the left side of the histogram. At around 0.01 or 0.02, you’re going to see a bulge: more results than that flat, uniform distribution. So if you have a shape like that, that indicates that they are finding real effects that are not p-hacked.

Spencer Greenberg: So then what if they’re in a situation where they’re just p-hacking the heck out of the results, and mostly they’re false positives that they’re publishing? Well, if you think about what p-hacking is doing, it’s getting some results, many of which are greater than p=0.05, so you can’t really publish them in most journals. And then you’re either doing fishy things to get the p-value down, or you’re just throwing away the ones that happen to be just above 0.05 and you’re keeping the ones that happen to fall below it. And so what you get is a bulge of too many results on the right of the distribution, right around 0.05. So that means if you have a bulge on the left, you’re probably finding a bunch of real effects. If you have a bulge on the right, you’re probably finding a bunch of false effects.

Rob Wiblin: Wow. OK, interesting. So people, I guess they’ll find a particular researcher, or find some particular research topic, or a whole bunch of papers on some general theme, and then grab all the p-values and see what distribution they have: whether the bulge is towards 0 or the bulge is towards 0.05. And then they can say, does this literature as a whole have this problem, or does it not?

Spencer Greenberg: Exactly. You could do it on all the primary results of one researcher. You could do it on major results from a whole field. Potentially, if there’s one paper that had like seven studies in it, and they had a bunch of p-values per study, you could even try to do that. But there are some caveats. I mean, this is far from a perfect technique, but I think it’s a really innovative idea.

Rob Wiblin: If there were people doing this on the regular, then if you apply this to a specific researcher over the course of their entire career, and then they know that they can’t end up with this shape that has a bulge towards the 0.05 side, then that would be really chastening for them, because it limits what they can do. Even though you can’t tell which specific papers have real results and which ones don’t, it places far more constraints on what they can publish within their entire body of work.

Spencer Greenberg: Yeah, I think if there was some kind of really strong norm where everyone had these curves published and everyone checked each other’s curves regularly, it could create an incentive like that.

When to go with your gut

Spencer Greenberg: The FIRE Framework is all about when should you go with your gut? And basically, what I claim is that in these four cases, usually you should just go with your gut. It’s not worth going into deep reflection.

The first case is for fast decisions. The F in FIRE stands for “fast.” So imagine that you’re driving down the highway and suddenly a car going the wrong way in your lane is careening towards you. You don’t have time for reflection. You’ve just got to decide: you’re going to the left, you’re going to the right. What are you doing? And just act, right? So a fast decision, you got to go with your gut.

Spencer Greenberg: The second is the I in FIRE. It stands for “irrelevant” decisions. These are just very low-stakes decisions that really don’t matter, where using your reflection is probably just not worth the investment of your conscious mind. You could just be spending your time thinking about something more important. This would be like, you’re at the salad place, and you’re trying to decide, “Do I get carrots in my salad?” It’s like, does it really matter? If your gut tells you to get carrots, get carrots.

Spencer Greenberg: The R stands for “repetitious” decisions. So think about someone who’s a chessmaster, who’s played thousands of games of chess with feedback on how they did. That person is going to develop an incredible intuition for what is a good chess move. It doesn’t mean that they can’t spend the time thinking about it, but it means that they don’t necessarily have to.

There’s this amazing example of Magnus Carlsen playing chess against three people who are pretty good — they’re not like top, top people in the world, but three people who are pretty good. And he only gets a few seconds per move, so he really doesn’t even have time to reflect. And he beats them all. But the craziest part is he’s blindfolded, so he has to keep all three boards in his mind simultaneously and make each move within a few seconds. Think about how ridiculous that is.

So at some point, you’ve done something enough that your intuition is just built up. I think really the key here is intuition is not magic. Sometimes people treat intuition as magic: “Your gut knows all these things.” No, your intuition has to learn, right? So if you’re in an environment where you’ve done a type of decision many times — and, really key thing, you’ve gotten feedback, so your intuition was able to update — then you can often trust your gut.

Spencer Greenberg: The E is for “evolutionary” decisions. There are certain things that are hardcoded in animals — and, people being animals, they’re hardcoded in us. They’re not always right, but they’re pretty good heuristics. Like if you are looking at a piece of food and it smells horrible, just don’t eat it. It’s just not worth it. It might make you really sick.

Articles, books, and other media discussed in the show

Spencer’s projects:

Behaviour change:

Theories of success:

Approaches to problems and solutions:

Social science research and reforms:

Decision making:

Extreme views and ideologies:

Valuism and its potential benefits:

Other 80,000 Hours podcast episodes:

Related episodes

About the show

The 80,000 Hours Podcast features unusually in-depth conversations about the world's most pressing problems and how you can use your career to solve them. We invite guests pursuing a wide range of career paths — from academics and activists to entrepreneurs and policymakers — to analyse the case for and against working on different issues and which approaches are best for solving them.

The 80,000 Hours Podcast is produced and edited by Keiran Harris. Get in touch with feedback or guest suggestions by emailing [email protected].

What should I listen to first?

We've carefully selected 10 episodes we think it could make sense to listen to first, on a separate podcast feed:

Check out 'Effective Altruism: An Introduction'

Subscribe here, or anywhere you get podcasts:

If you're new, see the podcast homepage for ideas on where to start, or browse our full episode archive.