#217 – Beth Barnes on the most important graph in AI right now — and the 7-month rule that governs its progress

AI models today have a 50% chance of successfully completing a task that would take an expert human one hour. Seven months ago, that number was roughly 30 minutes — and seven months before that, 15 minutes.

These are substantial, multi-step tasks requiring sustained focus: building web applications, conducting machine learning research, or solving complex programming challenges.

Today’s guest, Beth Barnes, is CEO of METR (Model Evaluation & Threat Research) — the leading organisation measuring these capabilities.

Beth’s team has been timing how long it takes skilled humans to complete projects of varying length, then seeing how AI models perform on the same work.

The resulting paper from METR,”Measuring AI ability to complete long tasks“, made waves by revealing that the planning horizon of AI models was doubling roughly every seven months. It’s regarded by many as the most useful AI forecasting work in years.

The companies building these systems aren’t just aware of this trend — they want to harness it as much as possible, and are aggressively pursuing automation of their own research.

That’s both an exciting and troubling development, because it could radically speed up advances in AI capabilities, accomplishing what would have taken years or decades in just months. That itself could be highly destabilising, as we explored in a previous episode: Will MacAskill on AI causing a “century in a decade” — and how we’re completely unprepared.

And having AI models rapidly build their successors with limited human oversight naturally raises the risk that things will go off the rails if the models at the end of the process lack the goals and constraints we hoped for.

Beth thinks models can already do “meaningful work” on improving themselves, and she wouldn’t be surprised if AI models were able to autonomously self-improve in as little as two years from now — in fact, she says, “It seems hard to rule out even shorter [timelines]. Is there 1% chance of this happening in six, nine months? Yeah, that seems pretty plausible.”

While Silicon Valley is abuzz with these numbers, policymakers remain largely unaware of what’s barrelling toward us — and given the current lack of regulation of AI companies, they’re not even able to access the critical information that would help them decide whether to intervene. Beth adds:

The sense I really want to dispel is, “But the experts must be on top of this. The experts would be telling us if it really was time to freak out.” The experts are not on top of this. Inasmuch as there are experts, they are saying that this is concerning. … And to the extent that I am an expert, I am an expert telling you you should freak out. And there’s not especially anyone else who isn’t saying this.

Beth and Rob discuss all that, plus:

  • How Beth now thinks that open-weight models are a good thing for AI safety, and what changed her mind
  • How our poor information security means there’s no such thing as a “closed-weight” model anyway
  • Whether we can see if an AI is scheming in its chain-of-thought reasoning, and the latest research on “alignment faking”
  • Why just before deployment is the worst time to evaluate model safety
  • Why Beth thinks AIs could end up being really good at creative and novel research — something humans tend to think is beyond their reach
  • Why Beth thinks safety-focused people should stay out of the frontier AI companies — and the advantages smaller organisations have
  • Areas of AI safety research that Beth thinks is overrated and underrated
  • Whether it’s feasible to have a science that translates AI models’ increasing use of nonhuman language or ‘neuralese’
  • How AI is both similar to and different from nuclear arms racing and bioweapons
  • And much more besides!

This episode was originally recorded on February 17, 2025.

Video editing: Luke Monsour and Simon Monsour
Audio engineering: Ben Cordell, Milo McGuire, Simon Monsour, and Dominic Armstrong
Music: Ben Cordell
Transcriptions and web: Katy Moore

Highlights

Can we see AI scheming in the chain of thought?

Rob Wiblin: So o1 is the newish OpenAI reasoning model. I guess it’s able to handle more complex questions and it’s able to go away and use a whole bunch of computational resources to think about its answer and review it and extend it before it gives the answer. So it has this kind of notepad on which it’s able to do a bunch of reasoning and then think about what it’s written down and go back and revise it. And then eventually, after all of that, it spits out an answer.

In the case where it’s a totally safe model and maybe not that capable, that would be actually indistinguishable from the case where it’s actually incredibly schemey. It’s actually able to figure out, “Wait. This is an evaluation. I probably shouldn’t reveal that I’m scheming that I am capable of doing these dangerous things, because that would maybe get me changed.” So it could be reasoning through that in the chain of thought, saying, “I need to pretend to not be able to do this,” and you would never know.

Beth Barnes: Right. And beyond just not looking at it at all, there’s also the question of, if you looked at it, is the chain of thought faithful? Is it actually tracking the real reasoning that the model is doing? And is it intelligible to humans, or is it some kind of gobbledygook?

Or has it been trained in some way that you would expect that it’s no longer faithful to the underlying reasoning? You can imagine if it’s being trained simultaneously to get high performance on tasks, but not do any thinking that humans think looks suspicious — if that thinking is still helpful for achieving high performance on the tasks — then you would expect one of the possible outcomes is that it then just does this reasoning that looks innocuous, but in fact it’s using that computation to reason about things that are not innocuous.

Rob Wiblin: Hold on, let’s unpack that. You’re saying that within that chain of thought that it’s engaging in, inasmuch as we just reward it for getting the right answer at the end, the kind of stuff that it’s writing down on the pad during its process of thinking, I suppose it could just end up being disconnected from what the answer is.

I suppose if we were looking at that and rewarding it based on whether we thought the chain of reasoning looked nice to us, then it could learn that it’s got kind of two tasks: one is to come up with some justification that looks good to people who are reviewing it and rewarding it, and also to separately come up with the right answer. But these things don’t actually have to be causally related to one another.

But you’re saying inasmuch as we just pressure it to come up with the right answer, it could end up engaging in reasoning that we can’t understand or follow at all. And I suppose at the limit, it could just write stuff that looks like gobbledygook to us, but it understands it; it kind of has its own private language or something that allows it to reason better than it can in English.

And you’re just saying, inasmuch as we don’t check the chain of thought, we don’t know which of these worlds we’re in?

Beth Barnes: Right. And there are some examples of the chain of thought that OpenAI published in the system card that I think have some signs of this kind of becoming unintelligible — in that it’s sort of weird shorthand and ungrammatical sort of half-sentences in various places. So it clearly has drifted somewhat from the prior of kind of normal English, and it’s unclear, and just the meaning is sometimes confusing.

And the drift at the moment, from these 10 examples that I’ve looked at, doesn’t seem enough that it would obviously be that it could be doing this long reasoning about gaming the evals, or deceiving the humans, or getting a high score that wouldn’t show up at all. But it’s definitely we’re moving in that direction.

Rob Wiblin: So you’re saying there’s a handful of cases where they’ve shown us what it reported that it was thinking. And it’s a bit weird because it’s not grammatical. Normally these models speak incredibly eloquently; they never make a grammatical or punctuation mistake or anything like that. But in this case they are.

And you’re like, is that a sign that there’s something going on with this text that is not just what’s written on the tin, that maybe it’s encoding extra information in there and that’s why it reads as weird to us?

Beth Barnes: Yeah. It sort of contradicts itself, or it’s just definitely confusing and weird to read. It seems like you can tell roughly what it’s thinking about, but it’s definitely just like we’re not totally preserving fidelity to English as humans would understand it. And the more training you do, the more that’ll go away.

And also, I think people are just actively trying to figure out things that you can do that are more efficient than having the model output these English language tokens, and more like having the model pass inscrutable lists of numbers to itself.

So I think this is something that we’ve started to take for granted: that these models have this very nice safety property of they’re really not very good at reasoning, except for out loud in English. So you can really just kind of see what they’re thinking and see what they’re doing, and it seems we’d have to get to a very high level of intelligence before you would be worried about the model doing elaborate reasoning in a single forward pass.

But I think we’re maybe heading for a world where that’s not true anymore. This is a thing that is a bit stressful; I hope people are going to work on making sure that it continues to be interpretable.

We have to test model honesty even before they're used inside AI companies

Beth Barnes: I think in general the most emphasis has been on this idea of pre-deployment testing: before you start serving this model on your API to the general public, you do some testing in house, and maybe ideally with third-party oversight or whatever.

The problem with this is that that’s not necessarily the best time to check or intervene. Ideally, before you start training a model and before you commit a huge amount of expensive compute to producing this artefact, you look at the most capable models you have already — you look at how far they are away from hitting various thresholds where you would need more mitigations, you look at what mitigations are you on track to have by when. And based on that, you decide whether it’s a good idea to go ahead with this big new training run and what level of scaleup would be safe.

And you might want to do an intermediate evaluation to be like, OK, how much has the model improved so far? Are we on track with our predictions? Are we getting near one of these thresholds? Because if you don’t do that, and then you get to the end of the training run and you’re like, “We now have this model, and it turns out it’s very capable and it’s above all these thresholds and we don’t have any of the mitigations in place.”

Then one thing is, as soon as you create the model, you create the ability for it to be stolen or misused. And if you don’t have good internal controls set up, people internally could be doing all sorts of things with it.

And also, you now have this artefact that you’ve invested an enormous amount of money into, and it is hard to not use that — and especially to not use it internally. Everyone’s curious, like, “Can I try out this new model? Use it for my new research ideas,” whatever.

And the thing about deployment is, unless you’re open sourcing the model, it is actually quite reversible. So if you released a model, at least now external experts can interact with it and have some oversight and could sound the alarm if it turns out it has super dangerous bioweapon-enabling capabilities or something, and then the lab could undeploy it.

I think there’s something that you would like evals to rule out, which is that companies are building arbitrarily super intelligent and dangerous models internally and the public has absolutely no ability or oversight to know that that’s happening or decide whether it’s appropriate to then scale up and build an even better model.

So in some sense — this is a bit of a hot take — it’s like pre-deployment evals could actually be bad, because delaying deployment could be bad if that means that you miss the window to be able to sound the alarm about, “Wait, this model is super scary. You shouldn’t build a more capable model.” You don’t want the gap between what labs have internally and what anyone has had meaningful oversight of to be arbitrarily large.

Rob Wiblin: Just to repeat that back to make sure that I’ve understood: you’re saying the whole eval mentality at the moment, at least from the companies, is, “We need to check what our models are able to do and willing to do before the public has access to them, before we’re selling it as a product.” And that’s pre-deployment evaluation.

And you’re saying actually maybe what is more important — certainly on the margin at this point — is the evaluations that we do and the risk assessment that we do before we even train the model, or certainly before people inside the company have access to it and start using it.

And the reason is that if you only think about deployment to the public, then there’s no limit on the thing that you could train internally. You could make a model that’s incredibly dangerous. Maybe it’s incredibly good at recursively self-improving; it would be able to survive in the wild and avoid shutdown if it escaped.

Beth Barnes: And it’s just on your servers. Covertly, it’s corrupted a bunch of the lab compute and is making it look like something else, but it’s actually doing its own little intelligence explosion there.

Rob Wiblin: Or even if not something as strange as that, someone could easily steal it or exfiltrate it, or it could be misused inside the company, or someone could incompetently use it inside the company.

It's essential to thoroughly test relevant real-world tasks

Beth Barnes: So we’ve been building this suite of diverse autonomy-requiring tasks, where the model has basically access to a computer and it can do stuff on the computer and use tools and run code and do whatever to accomplish the task…

And the main thing we’ve done that is different is we’ve had humans attempt basically all of these tasks, and we record how long it takes the humans to do the task — humans who in fact have the expertise, so hopefully we’re not timing that they had to go away and learn about this thing. But for someone who has the requisite knowledge, how long is it? How many steps do they need? How long does it take them?

So then we can get this meaningful Y-axis on which to measure model capabilities, which is diverse autonomous tasks in terms of how long they take humans. So if you order all the tasks by how long they take humans, where is the point that the model gets 50% success?

We do see a relationship between length of task for humans and how successful the models are. It’s not perfect. It’s not like they can do all the tasks up to one hour and then none of the ones above; it’s like as you double the time, it gets increasingly unlikely that the model can actually do it…

Rob Wiblin: OK, so you’ve got this general issue that people have lots of different tests of how impressive these AI models are. But there’s always an issue of how actually impressive is that outcome? We find out, wow, now it’s able to do this thing. But is that impressive? Does that really matter? Is that going to move the needle to applications in the real world? Is this going to allow it to act autonomously? It can be a little bit hard to understand intuitively whether any of it matters.

So you’ve come up with a modification of this approach. Firstly, you think about a very wide range of different tasks — so it’s harder for you to teach to the test, to just teach the model to do one very narrow thing, but then something else nearby or something else that’s important to matter in the real world that it can’t do at all. It just can only play chess super well or something like that. So a lot of diversity of the tasks.

And then in terms of measuring how impressive the performance is, you think about it in terms of how long would it have taken a human to do? Or how long did it, in your tests, take for a human to be able to accomplish this task? Are there any simple examples of the tasks? Ones that non-computer people could understand?

Beth Barnes: There’s the ML research ones, like, “Train this model to improve its performance on this dataset” or something. There are some very easy, very general research ones, like, “Go and look up this fact on Wikipedia” or something. There’s some that are like, “Build a piece of software that does these things.”

Rob Wiblin: Yeah, OK. So they’re substantial tasks, the kind of things that you might delegate to a staff member inside a company. And I guess you’ve gotten software engineers or people who generally would be able to do these kinds of things, and said, on average, this takes a human 15 minutes, it takes them half an hour, it takes them an hour, it takes them two hours.

And you find that there’s quite a strong relationship between whether the model can and can’t do it, and how long it would have taken a human being to do it — which I guess speaks to the fact that it’s hard for these models to stay on task for a long period of time. Or people talk about coherence: things that require them to think about, “How am I going to do this?” over a long period of time. It requires many different steps that have to be chained together. They currently kind of struggle with that. Is that right?

Beth Barnes: Yeah. I think there’s various different explanations for why longer tasks are harder. There’s different ways in which they can be harder.

One is just if there’s a longer sequence of things that you need to do — so you need to get them all right, and there’s a higher probability that you’ll get completely stumped by one of them. You could also imagine either the human just has to think really hard, or they try a bunch of times until they succeed. So it’s not like one totally straightforward measure.

But there’s also this argument that, as we do more RL training on tasks, it’s much more expensive to do RL training on longer tasks than on shorter tasks. And it’s a much harder learning problem if you just get a signal at the end of what in the middle you should have done differently, whereas if the task only has two steps, it’s much easier to learn what you should have done differently.

So in general, there are a bunch of reasons to expect that the shorter tasks will be easier for models, and the models will be better trained to be able to do short tasks than long tasks.

Rob Wiblin: And the upshot is… you could say that, a year ago, the models would have a 50% chance of succeeding at something that would take a typical human 15 minutes; six months later they had a 50% chance of succeeding at something that would take 30 minutes; six months later they have a 50% chance of succeeding at something that would take a human an hour. Seems like it’s roughly doubling about every six months. Could you say what the numbers there are?

Beth Barnes: Yeah, six months is about right. We’re looking all the way from GPT-2 in 2019, where it’s at a few seconds or something like that, to the latest models. And yeah, over that time there’s a pretty good fit to a doubling of something like every six months. And we also did a sort of ablation [sensitivity analysis] both over the actual noise in the data and over different methodological choices we could make, and the range of estimates you get is basically between three months and one year.

Recursively self-improving AI might even be here in two years — which is alarming

Rob Wiblin: OK, just to make sure that we underline the most important messages here: AIs currently have a 50% chance of doing something on a computer that an expert might do in two hours. This is doubling every three to 12 months. So they’re becoming substantially more useful agents: they’re able to do things over quite a longer period of time and it’s improving rapidly.

They’re able to make a substantial difference to machine learning research now. They’re already used a great deal within the companies. They are probably capable of doing a whole lot more stuff than is even appreciated at this point — because if you really try hard, then you can get much more out of them than if you just make a cursory effort.

All of this leads to the conclusion that probably we should expect recursively self-improving AI pretty shockingly soon, relative to expectations that we might have had five or 10 years ago — that two years wouldn’t be at all surprising, which is alarming to me.

Beth Barnes: Yes, I think this is very alarming. I think people should be very alarmed. The scientist in me wants to add a bunch of caveats to the two-hour number and the doubling time, whatever. But I think yes, it really doesn’t seem like two years would be surprising. It seems hard to rule out even shorter [timelines]. Is there 1% chance of this happening in six, nine months? Yeah, that seems pretty plausible.

With the exact shape of the recursive self-improvement, it still seems like we can kind of see a path to significantly superhuman intelligence that doesn’t require a mysterious intelligence explosion. It’s relatively clear how you would apply the intermediate capabilities that we expect to get to get something that’s just obviously very pretty scary.

Maybe models will continue to be really non-robust in a way that’s hard to fix, but I don’t think we care about whether we’re very confident this is going to happen — we care about whether we’re very confident it’s not going to happen.

It just feels like we’re at the point where even if you only care about current humans, the stakes are very high. Like a 1% chance of takeover risk, which itself has a 10% chance of human extinction or something, that’s a lot of deaths.

So I think we should really care about numbers that are a whole number of percent. I mean, I think we should care about numbers that are lower than that, but basically I would expect the world in general to agree that it’s not worth imposing a whole-number-percent risk of completely destroying human civilisation for getting the benefits of AI sooner.

Rob Wiblin: A little bit sooner, yeah. I think people in Silicon Valley, or certainly people in the AI industry, these sorts of companies, they’ve seen these results and they’re flipping out basically. Or their expectations are what we’re seeing here: they just see a very clear path to superhuman reasoning, to superhuman ability to improve AI, to therefore a software-based intelligence explosion. So this is kind of the main thing that people talk about in the city that we’re in right now.

It feels like policymakers are a little bit aware of this — certainly AI is in the policy conversation a bit — but by comparison, I think they haven’t yet flipped out to the extent that they probably would if they fully appreciated what was coming down the pipeline.

Beth Barnes: Right.

Rob Wiblin: One of the hopes of this show is to disseminate this information so that people can flip out suitably. Is there anything more you want to say to someone who’s in DC, who’s like, “I don’t know that I really buy any of this. AI’s not my area, or I feel like these tech people are kind of losing their heads a little bit”?

Beth Barnes: It’s kind of hard for me to have that perspective because I’ve just been focused on AI since 2017 or something, and I was previously thinking that it was further off, or that I might be totally wrong about this whole thing. But it feels like we’ve just increasingly gotten evidence that these concerns are sensible. We’re starting to see models exhibit alignment faking, all these things we’re concerned about, and it seems like the capabilities are coming very fast.

So this seems like pretty obviously the most important thing, even from a very personal, selfish perspective: for someone who’s young and healthy, this just might be your highest mortality risk in the next few years is AI catastrophe.

I think the other sense that people have that I really want to dispel is, “But the experts must be on top of this. The experts would be telling us if it really was time to freak out.” I’m like, the experts are not on top of this. There are not nearly enough people. METR is super starved for talent. We just can’t do a bunch of basic things that are obviously good to do.

And inasmuch as there are experts, they are saying that this is a concerning risk. There was the report that Bengio and others wrote, that was sort of overall, what is the overall scientific consensus on this risk? And it was like, “Yes, this seems very plausible.”

Rob Wiblin: “We’re not sure whether we will or won’t all die, but it’s totally plausible.”

Beth Barnes: Yeah. And to the extent that I am an expert, I am an expert telling you you should freak out. And there’s not especially anyone else who isn’t saying this.

Do we need external auditors doing AI safety tests, not just the companies themselves?

Beth Barnes: I think often individual safety-concerned people inside companies have a rosier picture of how the company interacts with third parties, like, “We love safety, so of course we would be supporting all of these things. And if it’s not happening, the bottleneck must be something that’s not us, because we really love safety.”

I think Nick Joseph said something like this, of the bottleneck on getting external oversight of the RSP is people who are technically competent to do the evaluations externally — which I very much disagree with, because I think METR has the technical competence. I think our evals and elicitation and things are better than stuff that the lab has published, than any lab has published, basically…

Also, it doesn’t necessarily need to be bottlenecked on the technical competence of the external evaluator to literally run all the experiments and set up the compute and things. There’s another paradigm, where the companies run the evaluations internally and they write up what they did, and the external evaluator goes through it with them and discusses it, like… “Maybe you need to do this other experiment here. This thing was insufficient for this reason.”

Or an embedded evaluator paradigm, where someone from an external third-party org is embedded on the team in the lab that’s running the evaluations, and is keeping an eye on everything…

There’s just lots of things here which we’ve proposed and been open to, and it’s not like the companies are all banging down our door to do it…

I think maybe people have a confusion about METR, and have been assuming that METR is a formal evaluator that has arrangements with all the companies to do evaluations for every model.

This is not the case. I wouldn’t want to describe any of the things that we’ve done thus far as actually providing meaningful oversight. There’s a bunch of constraints, including the stuff we were doing was under NDA, so we didn’t have formal authorisation to alert anyone or say if we thought things were concerning.

So yeah, think of it much more as we’ve been trying to prototype what this could look like. And because we’re a small nonprofit, it’s easier for us to be more nimble and try out different things, and lower stakes for the labs to engage with us. So we’ve been excited about raising the bar on how good the oversight actually is.

And labs more want a “Can you promise that you will run an eval on all of our models when we want to, so that we can say we’ve run an eval on them?” sort of thing. And that’s not what we’re excited about, if we don’t think that that proposed procedure is actually providing meaningfully more safety assurance than not.

Rob Wiblin: In general, people think that — at least within the scope of what you’re trying to accomplish — the evaluations that you’ve created are very good. I haven’t heard that many criticisms of them on the substance. And you’re saying the companies could come to you and say, “We would love for you to run all these things on our models before they go out, and for you to play with it as much as possible in order to figure out how dangerous they are.” But they are not requesting that. They’re not actively making that easy for you.

Beth Barnes: I mean, we have gotten asks like that from a bunch of labs. I think the problem is that our goals are like prototyping and de-risking what could provide really good safety assurance.

So that includes better governance mechanisms, in that this actually feeds into decisions or at least is shared with the people who are making the decisions. And we have some carveout for being able to say we made a recommendation and the lab chose to not go with that recommendation, or that employees get to see what was shared with us so that they can check that it was accurate. So there’s governance mechanisms that we could be improving on.

There’s the pre-scaleup or pre-internal deployment versus external deployment, where we think this specific time of external deployment is not that interesting; we’re much more interested in what are the best things you have internally and wanting to know that companies are not building arbitrarily powerful things internally.

There’s other stuff, just like quality of access and information shared about what agent scaffolding has this model been trained to use and can you share that with us? And a bunch of stuff that is very important to know to be able to elicit the full capabilities of models, things like can we see the chain of thought?

Rob Wiblin: I guess they would say to do things the way that you want is a lot of staff time, and they’re sprinting all the time in order to try to keep up. Maybe they tend to play the cards very close to their chest, and this would involve sharing commercially sensitive information with you, so they’re nervous about doing that or perhaps their lawyers at least are nervous about doing that.

Are there other things that they could say in their defence that have any reasonableness to them? Or to what extent would you think it’s fair versus maybe closer to being excuses?

Beth Barnes: Yeah, it definitely just is effort… The cost is more attention from more senior people at the company or something, I would guess, and thinking how much they want to really do. … The things we’ve proposed is like sharing with us anything that’s not compartmentalised within the company. If your 1,000 employees already know this, three people at METR knowing an anonymised version of it is not really going to increase the surface area very much.

Or taking staff time on the inside: it’s a much smaller fraction of your staff time than it is of METR staff time.

Or spending 1% of the effort that’s spent on the model on capabilities research on evaluations, or a few percent of that effort, would be enough to do this.

I don’t know, there’s stuff about token budgets and things. And OpenAI had their superalignment compute commitment, which was supposed to be much larger numbers than anything that we were asking for.

Rob Wiblin: Twenty percent of their commitment up to that point.

Beth Barnes: Right, 20% of the compute. So it’s not like we’re hitting those demands.

I think there are a bunch of things which are just technically and logistically challenging for them. If we’re just like, can the API please not be broken and not keep going down while we’re trying to use it? It’s like, well, before it’s stabilised for public deployment, there’ll just be issues.

A case against safety-focused people working at frontier AI companies

Rob Wiblin: Over the years, I think many people have gone into working at the various AI companies with the vision that one way they’re going to be helpful is shifting the average beliefs of people who work at the companies. Maybe they’ll get a chance to talk to their colleagues and convince them that concerns about how AGI might go are more legitimate than they might think. I guess they’re also just shifting the weight of opinion within the companies, and hopefully moving their culture in a positive, more cautious direction.

What do you think the track record of that is? Is that a reasonable approach for people to take going forward?

Beth Barnes: I basically think the track record of that particular theory of change looks pretty bad. I think if you just look at the history of people who tried to do that, there’s just a lot of people giving up and leaving.

People have started at one lab trying to influence there, and then have kind of given up on that and left.

I think there are probably some people who’ve done this and it’s gone relatively well, but I think they’re very much in the minority.

I think reasons it makes sense to go to a lab is if you think your comparative advantage is institutional politicking, and that’s basically what you plan to do with most of your work. And if you’re very good at that, I think maybe you can do a better job.

Another reason is if you are going to be doing the implementation of safety things at that company. I think if you’re just sort of advancing alignment research generally, it’s better to be outside of a lab — because it’s easier to share it with everyone else and collaborate, and easier for other labs to onboard it if it’s not this thing from a competitor.

Rob Wiblin: You said something a little bit spicy earlier, which is that you think possibly Redwood, with just a handful of people, is maybe producing about as much alignment research as all of the major companies put together. I assume that’s not actually literally true, but… Maybe you do think it’s literally true?

What would the barriers be to the companies producing more alignment publications than they are? It’s a little bit hard to understand, because the resourcing would be so uneven there.

Beth Barnes: Yeah. I may be biased in what I’m exposed to, because I’m just much more aware of the stuff that Redwood has done, but it’s just actually not implausible to me that that is the kind of current ratio.

So reasons companies are less productive: there’s stuff we mentioned a bit before about maybe actually the infrastructure and speed at which you’re able to do research is slower. I think there’s different incentives and pressure to do the thing that will unblock shipping the next generation model, rather than do the research that’s most important in the longer run.

I think to some extent it’s just like too many people in the same place or something, where it’s just like a marginal person is less useful when you already have a tonne of people.

I think also maybe companies are more focused — this is similar to short term– but it’s both short in terms of time horizon and locality. They’re just thinking about things at their own company, and not what is most important for the field as a whole, and how do we disseminate that and make it easy for other people to adopt.

And there’s a bunch of barriers to publishing, and that’s maybe a reason that some people have left, have been unhappy about not being able to publish their work enough.

Rob Wiblin: I see. Presumably they’re a lot more cautious now about what they share and what they publish, and I imagine there’s a lot of levels of review that stand between someone having some insight and it going out. I guess especially if it’s related to risks that the company might be creating, you can see a lot of people wanting to scrutinise that before it’s shared.

Beth Barnes: Yeah. The comms team in there editing the blog post, being like, “Can we make it sound a bit more optimistic here?”

Maybe another thing that I think is overrated, or I feel like sometimes people conflate alignment — in the sense of making sure the model is trying to do what we want, and not trying to do extremely bad things that we definitely don’t want — with making the model better at following complicated instructions or other things that are more just like making the product better or enhancing the model’s capabilities in the direction of following elaborate rules about what we want.

And I would just be more excited about progress that’s progress on the problem when humans don’t know which output is better, but the model knows a bunch of stuff about what’s going on. That seems more useful than, “The humans want the model to follow this complicated set of instructions, so we did the things to make the model better at following the complicated instructions, because otherwise it was forgetting that it’s supposed to also have this policy or something.” That doesn’t feel like it’s making progress on the core problem.

Open-weighting models is often good, and Beth has changed her attitude about it

Rob Wiblin: Is open weighting models in general good or bad, from your point of view?

Beth Barnes: This is something I’ve changed my mind on a fair amount…

Originally I was like, this seems very dangerous if the problem is either that these things can be misused or that they can be dangerous even without a human trying to misuse them. It seems like you don’t want them to be everywhere, and not be able to be like, “Actually, this is a bad idea. Let’s undo that.” It’s an irreversible action, and it opens up a lot more surface area for things to go wrong — if you think that, at least for some of the threat models, there might be a big offence/defence imbalance.

So I think if you’re trying to keep the risk very low, it really feels like you can’t open source anything that capable. Because again, even with current models, it seems hard to rule out that with other advancements in scaffolding or fine-tuning or something, it would then become quite easy to make this model into something very dangerous.

I’ve shifted my perspective generally on how good it is to keep everything inside the labs. Sort of similar to the thing about, would you rather just the labs know about what capabilities are coming or would you rather that everyone knows?

I think in practice, the open source models have been used to do a bunch of really good safety work, and this has been important and good. And having a more healthy, independent, outside-of-companies field of research seems good both for the object-level safety progress you made, and for policymakers having some independent people they can ask — who actually understand what’s going on, and are able to do experiments and things.

Also, in general, I have become more sceptical of some kind of lab exceptionalism. I think a lot of the companies have this, like, “We’re the only responsible ones. We’ll just keep everything to ourselves, and the government will be too slow to notice, and everyone else is irresponsible.” I just think this becomes less plausible the more independent labs are saying this, and the less they actually respond to how responsible are the other actors. And if you trust the companies less, you want less of that.

I think it’s also one of those things where everyone always thinks that you’re the exception — “It’s fine, but we’ll just actually be good and actually be responsible. We don’t need oversight.” And sunlight is the best disinfectant. Oversight is really important, and security mindset and secrecy and locking things down can be bad.

Rob Wiblin: Seems like there’s a reasonable amount to unpack there. The basic reason I’ve heard for why actually open sourcing has been better than perhaps people like me feared a few years ago is that it’s been an absolute boon for alignment and safety research, because it means external people don’t have to actually go work at the labs — they can do things independently, which is really useful for all the reasons that we’ve talked about, if they have access to basically frontier models outside of the companies.

And I guess separately from that, it sounds like you’re saying that the companies having a very secrecy-focused mindset, wanting to hold all of the information about the models within themselves, that that’s a dangerous mentality. And maybe it’s actually helpful to have the weights open and things being used all over the place, so that people can see for themselves the chain of thought, for example, and modify the models and see what is possible if they’re improved — things that wouldn’t be possible if you only accessed it through an API that’s highly limited.

Beth Barnes: I think there’s also a point about maybe you just think some of the companies are bad actors, or there are people at them who would be bad actors, and you would prefer that the rest of the world also has access to the tech…

I think the case for certain types of transparency and openness is much stronger than the case for specifically open sourcing weights. But I do think the open sourcing thus far feels pretty clearly net positive, just because of the impact on safety research. I do think you would have thought that it ought to be harder to argue that we’re in an arms race if people are open sourcing the models. But somehow this doesn’t seem to have stopped anyone.

Rob Wiblin: Never let common sense get in the way of the narrative that you would like to spin for your commercial interest! Just to clarify, you’re talking about the release of DeepSeek R1, which is this super weapon that the Chinese developed and then immediately gave to us for free, in exchange for nothing. It really shows that we’re in a very intense arms race with the Chinese to I guess give one another our greatest technology.

Beth Barnes: [laughs] Yeah.

Rob Wiblin: It is extraordinary that that did not preclude the arms race narrative. I guess you could say that maybe this shows that they will be able to make something in future that they won’t share with us. But you’d really think literally giving away your weapons designs would be definitely an olive branch to your enemy.

Beth Barnes: You would have thought.

Articles, books, and other media discussed in the show

METR is hiring!

Beth’s and METR’s work:

Others’ work in this space:

Other organisations and initiatives:

Other 80,000 Hours podcast episodes:

Related episodes

About the show

The 80,000 Hours Podcast features unusually in-depth conversations about the world's most pressing problems and how you can use your career to solve them. We invite guests pursuing a wide range of career paths — from academics and activists to entrepreneurs and policymakers — to analyse the case for and against working on different issues and which approaches are best for solving them.

Get in touch with feedback or guest suggestions by emailing [email protected].

What should I listen to first?

We've carefully selected 10 episodes we think it could make sense to listen to first, on a separate podcast feed:

Check out 'Effective Altruism: An Introduction'

Subscribe here, or anywhere you get podcasts:

If you're new, see the podcast homepage for ideas on where to start, or browse our full episode archive.