#217 – Beth Barnes on the most important graph in AI right now — and the 7-month rule that governs its progress

By Robert Wiblin · Published June 2nd, 2025 ·

AI models today have a 50% chance of successfully completing a task that would take an expert human one hour. Seven months ago, that number was roughly 30 minutes — and seven months before that, 15 minutes.

These are substantial, multi-step tasks requiring sustained focus: building web applications, conducting machine learning research, or solving complex programming challenges.

Today’s guest, Beth Barnes, is CEO of METR (Model Evaluation & Threat Research) — the leading organisation measuring these capabilities.

Beth’s team has been timing how long it takes skilled humans to complete projects of varying length, then seeing how AI models perform on the same work.

The resulting paper from METR, “Measuring AI ability to complete long tasks,” made waves by revealing that the planning horizon of AI models was doubling roughly every seven months. It’s regarded by many as the most useful AI forecasting work in years.

The companies building these systems aren’t just aware of this trend — they want to harness it as much as possible, and are aggressively pursuing automation of their own research.

That’s both an exciting and troubling development, because it could radically speed up advances in AI capabilities, accomplishing what would have taken years or decades in just months. That itself could be highly destabilising, as we explored in a previous episode: Will MacAskill on AI causing a “century in a decade” — and how we’re completely unprepared.

And having AI models rapidly build their successors with limited human oversight naturally raises the risk that things will go off the rails if the models at the end of the process lack the goals and constraints we hoped for.

Beth thinks models can already do “meaningful work” on improving themselves, and she wouldn’t be surprised if AI models were able to autonomously self-improve in as little as two years from now — in fact, she says, “It seems hard to rule out even shorter [timelines]. Is there 1% chance of this happening in six, nine months? Yeah, that seems pretty plausible.”

While Silicon Valley is abuzz with these numbers, policymakers remain largely unaware of what’s barrelling toward us — and given the current lack of regulation of AI companies, they’re not even able to access the critical information that would help them decide whether to intervene. Beth adds:

The sense I really want to dispel is, “But the experts must be on top of this. The experts would be telling us if it really was time to freak out.” The experts are not on top of this. Inasmuch as there are experts, they are saying that this is concerning. … And to the extent that I am an expert, I am an expert telling you you should freak out. And there’s not especially anyone else who isn’t saying this.

Beth and Rob discuss all that, plus:

How Beth now thinks that open-weight models are a good thing for AI safety, and what changed her mind
How our poor information security means there’s no such thing as a “closed-weight” model anyway
Whether we can see if an AI is scheming in its chain-of-thought reasoning, and the latest research on “alignment faking”
Why just before deployment is the worst time to evaluate model safety
Why Beth thinks AIs could end up being really good at creative and novel research — something humans tend to think is beyond their reach
Why Beth thinks safety-focused people should stay out of the frontier AI companies — and the advantages smaller organisations have
Areas of AI safety research that Beth thinks is overrated and underrated
Whether it’s feasible to have a science that translates AI models’ increasing use of nonhuman language or ‘neuralese’
How AI is both similar to and different from nuclear arms racing and bioweapons
And much more besides!

This episode was originally recorded on February 17, 2025.

Video editing: Luke Monsour and Simon Monsour
Audio engineering: Ben Cordell, Milo McGuire, Simon Monsour, and Dominic Armstrong
Music: Ben Cordell
Transcriptions and web: Katy Moore

Highlights

Can we see AI scheming in the chain of thought?

Rob Wiblin: So o1 is the newish OpenAI reasoning model. I guess it’s able to handle more complex questions and it’s able to go away and use a whole bunch of computational resources to think about its answer and review it and extend it before it gives the answer. So it has this kind of notepad on which it’s able to do a bunch of reasoning and then think about what it’s written down and go back and revise it. And then eventually, after all of that, it spits out an answer.
In the case where it’s a totally safe model and maybe not that capable, that would be actually indistinguishable from the case where it’s actually incredibly schemey. It’s actually able to figure out, “Wait. This is an evaluation. I probably shouldn’t reveal that I’m scheming that I am capable of doing these dangerous things, because that would maybe get me changed.” So it could be reasoning through that in the chain of thought, saying, “I need to pretend to not be able to do this,” and you would never know.
Beth Barnes: Right. And beyond just not looking at it at all, there’s also the question of, if you looked at it, is the chain of thought faithful? Is it actually tracking the real reasoning that the model is doing? And is it intelligible to humans, or is it some kind of gobbledygook?
Or has it been trained in some way that you would expect that it’s no longer faithful to the underlying reasoning? You can imagine if it’s being trained simultaneously to get high performance on tasks, but not do any thinking that humans think looks suspicious — if that thinking is still helpful for achieving high performance on the tasks — then you would expect one of the possible outcomes is that it then just does this reasoning that looks innocuous, but in fact it’s using that computation to reason about things that are not innocuous.
Rob Wiblin: Hold on, let’s unpack that. You’re saying that within that chain of thought that it’s engaging in, inasmuch as we just reward it for getting the right answer at the end, the kind of stuff that it’s writing down on the pad during its process of thinking, I suppose it could just end up being disconnected from what the answer is.
I suppose if we were looking at that and rewarding it based on whether we thought the chain of reasoning looked nice to us, then it could learn that it’s got kind of two tasks: one is to come up with some justification that looks good to people who are reviewing it and rewarding it, and also to separately come up with the right answer. But these things don’t actually have to be causally related to one another.
But you’re saying inasmuch as we just pressure it to come up with the right answer, it could end up engaging in reasoning that we can’t understand or follow at all. And I suppose at the limit, it could just write stuff that looks like gobbledygook to us, but it understands it; it kind of has its own private language or something that allows it to reason better than it can in English.
And you’re just saying, inasmuch as we don’t check the chain of thought, we don’t know which of these worlds we’re in?
Beth Barnes: Right. And there are some examples of the chain of thought that OpenAI published in the system card that I think have some signs of this kind of becoming unintelligible — in that it’s sort of weird shorthand and ungrammatical sort of half-sentences in various places. So it clearly has drifted somewhat from the prior of kind of normal English, and it’s unclear, and just the meaning is sometimes confusing.
And the drift at the moment, from these 10 examples that I’ve looked at, doesn’t seem enough that it would obviously be that it could be doing this long reasoning about gaming the evals, or deceiving the humans, or getting a high score that wouldn’t show up at all. But it’s definitely we’re moving in that direction.
Rob Wiblin: So you’re saying there’s a handful of cases where they’ve shown us what it reported that it was thinking. And it’s a bit weird because it’s not grammatical. Normally these models speak incredibly eloquently; they never make a grammatical or punctuation mistake or anything like that. But in this case they are.
And you’re like, is that a sign that there’s something going on with this text that is not just what’s written on the tin, that maybe it’s encoding extra information in there and that’s why it reads as weird to us?
Beth Barnes: Yeah. It sort of contradicts itself, or it’s just definitely confusing and weird to read. It seems like you can tell roughly what it’s thinking about, but it’s definitely just like we’re not totally preserving fidelity to English as humans would understand it. And the more training you do, the more that’ll go away.
And also, I think people are just actively trying to figure out things that you can do that are more efficient than having the model output these English language tokens, and more like having the model pass inscrutable lists of numbers to itself.
So I think this is something that we’ve started to take for granted: that these models have this very nice safety property of they’re really not very good at reasoning, except for out loud in English. So you can really just kind of see what they’re thinking and see what they’re doing, and it seems we’d have to get to a very high level of intelligence before you would be worried about the model doing elaborate reasoning in a single forward pass.
But I think we’re maybe heading for a world where that’s not true anymore. This is a thing that is a bit stressful; I hope people are going to work on making sure that it continues to be interpretable.

We have to test model honesty even before they're used inside AI companies

Beth Barnes: I think in general the most emphasis has been on this idea of pre-deployment testing: before you start serving this model on your API to the general public, you do some testing in house, and maybe ideally with third-party oversight or whatever.
The problem with this is that that’s not necessarily the best time to check or intervene. Ideally, before you start training a model and before you commit a huge amount of expensive compute to producing this artefact, you look at the most capable models you have already — you look at how far they are away from hitting various thresholds where you would need more mitigations, you look at what mitigations are you on track to have by when. And based on that, you decide whether it’s a good idea to go ahead with this big new training run and what level of scaleup would be safe.
And you might want to do an intermediate evaluation to be like, OK, how much has the model improved so far? Are we on track with our predictions? Are we getting near one of these thresholds? Because if you don’t do that, and then you get to the end of the training run and you’re like, “We now have this model, and it turns out it’s very capable and it’s above all these thresholds and we don’t have any of the mitigations in place.”
Then one thing is, as soon as you create the model, you create the ability for it to be stolen or misused. And if you don’t have good internal controls set up, people internally could be doing all sorts of things with it.
And also, you now have this artefact that you’ve invested an enormous amount of money into, and it is hard to not use that — and especially to not use it internally. Everyone’s curious, like, “Can I try out this new model? Use it for my new research ideas,” whatever.
And the thing about deployment is, unless you’re open sourcing the model, it is actually quite reversible. So if you released a model, at least now external experts can interact with it and have some oversight and could sound the alarm if it turns out it has super dangerous bioweapon-enabling capabilities or something, and then the lab could undeploy it.
I think there’s something that you would like evals to rule out, which is that companies are building arbitrarily super intelligent and dangerous models internally and the public has absolutely no ability or oversight to know that that’s happening or decide whether it’s appropriate to then scale up and build an even better model.
So in some sense — this is a bit of a hot take — it’s like pre-deployment evals could actually be bad, because delaying deployment could be bad if that means that you miss the window to be able to sound the alarm about, “Wait, this model is super scary. You shouldn’t build a more capable model.” You don’t want the gap between what labs have internally and what anyone has had meaningful oversight of to be arbitrarily large.
Rob Wiblin: Just to repeat that back to make sure that I’ve understood: you’re saying the whole eval mentality at the moment, at least from the companies, is, “We need to check what our models are able to do and willing to do before the public has access to them, before we’re selling it as a product.” And that’s pre-deployment evaluation.
And you’re saying actually maybe what is more important — certainly on the margin at this point — is the evaluations that we do and the risk assessment that we do before we even train the model, or certainly before people inside the company have access to it and start using it.
And the reason is that if you only think about deployment to the public, then there’s no limit on the thing that you could train internally. You could make a model that’s incredibly dangerous. Maybe it’s incredibly good at recursively self-improving; it would be able to survive in the wild and avoid shutdown if it escaped.
Beth Barnes: And it’s just on your servers. Covertly, it’s corrupted a bunch of the lab compute and is making it look like something else, but it’s actually doing its own little intelligence explosion there.
Rob Wiblin: Or even if not something as strange as that, someone could easily steal it or exfiltrate it, or it could be misused inside the company, or someone could incompetently use it inside the company.

It's essential to thoroughly test relevant real-world tasks

Beth Barnes: So we’ve been building this suite of diverse autonomy-requiring tasks, where the model has basically access to a computer and it can do stuff on the computer and use tools and run code and do whatever to accomplish the task…
And the main thing we’ve done that is different is we’ve had humans attempt basically all of these tasks, and we record how long it takes the humans to do the task — humans who in fact have the expertise, so hopefully we’re not timing that they had to go away and learn about this thing. But for someone who has the requisite knowledge, how long is it? How many steps do they need? How long does it take them?
So then we can get this meaningful Y-axis on which to measure model capabilities, which is diverse autonomous tasks in terms of how long they take humans. So if you order all the tasks by how long they take humans, where is the point that the model gets 50% success?
We do see a relationship between length of task for humans and how successful the models are. It’s not perfect. It’s not like they can do all the tasks up to one hour and then none of the ones above; it’s like as you double the time, it gets increasingly unlikely that the model can actually do it…
Rob Wiblin: OK, so you’ve got this general issue that people have lots of different tests of how impressive these AI models are. But there’s always an issue of how actually impressive is that outcome? We find out, wow, now it’s able to do this thing. But is that impressive? Does that really matter? Is that going to move the needle to applications in the real world? Is this going to allow it to act autonomously? It can be a little bit hard to understand intuitively whether any of it matters.
So you’ve come up with a modification of this approach. Firstly, you think about a very wide range of different tasks — so it’s harder for you to teach to the test, to just teach the model to do one very narrow thing, but then something else nearby or something else that’s important to matter in the real world that it can’t do at all. It just can only play chess super well or something like that. So a lot of diversity of the tasks.
And then in terms of measuring how impressive the performance is, you think about it in terms of how long would it have taken a human to do? Or how long did it, in your tests, take for a human to be able to accomplish this task? Are there any simple examples of the tasks? Ones that non-computer people could understand?
Beth Barnes: There’s the ML research ones, like, “Train this model to improve its performance on this dataset” or something. There are some very easy, very general research ones, like, “Go and look up this fact on Wikipedia” or something. There’s some that are like, “Build a piece of software that does these things.”
Rob Wiblin: Yeah, OK. So they’re substantial tasks, the kind of things that you might delegate to a staff member inside a company. And I guess you’ve gotten software engineers or people who generally would be able to do these kinds of things, and said, on average, this takes a human 15 minutes, it takes them half an hour, it takes them an hour, it takes them two hours.
And you find that there’s quite a strong relationship between whether the model can and can’t do it, and how long it would have taken a human being to do it — which I guess speaks to the fact that it’s hard for these models to stay on task for a long period of time. Or people talk about coherence: things that require them to think about, “How am I going to do this?” over a long period of time. It requires many different steps that have to be chained together. They currently kind of struggle with that. Is that right?
Beth Barnes: Yeah. I think there’s various different explanations for why longer tasks are harder. There’s different ways in which they can be harder.
One is just if there’s a longer sequence of things that you need to do — so you need to get them all right, and there’s a higher probability that you’ll get completely stumped by one of them. You could also imagine either the human just has to think really hard, or they try a bunch of times until they succeed. So it’s not like one totally straightforward measure.
But there’s also this argument that, as we do more RL training on tasks, it’s much more expensive to do RL training on longer tasks than on shorter tasks. And it’s a much harder learning problem if you just get a signal at the end of what in the middle you should have done differently, whereas if the task only has two steps, it’s much easier to learn what you should have done differently.
So in general, there are a bunch of reasons to expect that the shorter tasks will be easier for models, and the models will be better trained to be able to do short tasks than long tasks.
Rob Wiblin: And the upshot is… you could say that, a year ago, the models would have a 50% chance of succeeding at something that would take a typical human 15 minutes; six months later they had a 50% chance of succeeding at something that would take 30 minutes; six months later they have a 50% chance of succeeding at something that would take a human an hour. Seems like it’s roughly doubling about every six months. Could you say what the numbers there are?
Beth Barnes: Yeah, six months is about right. We’re looking all the way from GPT-2 in 2019, where it’s at a few seconds or something like that, to the latest models. And yeah, over that time there’s a pretty good fit to a doubling of something like every six months. And we also did a sort of ablation [sensitivity analysis] both over the actual noise in the data and over different methodological choices we could make, and the range of estimates you get is basically between three months and one year.

Recursively self-improving AI might even be here in two years — which is alarming

Rob Wiblin: OK, just to make sure that we underline the most important messages here: AIs currently have a 50% chance of doing something on a computer that an expert might do in two hours. This is doubling every three to 12 months. So they’re becoming substantially more useful agents: they’re able to do things over quite a longer period of time and it’s improving rapidly.
They’re able to make a substantial difference to machine learning research now. They’re already used a great deal within the companies. They are probably capable of doing a whole lot more stuff than is even appreciated at this point — because if you really try hard, then you can get much more out of them than if you just make a cursory effort.
All of this leads to the conclusion that probably we should expect recursively self-improving AI pretty shockingly soon, relative to expectations that we might have had five or 10 years ago — that two years wouldn’t be at all surprising, which is alarming to me.
Beth Barnes: Yes, I think this is very alarming. I think people should be very alarmed. The scientist in me wants to add a bunch of caveats to the two-hour number and the doubling time, whatever. But I think yes, it really doesn’t seem like two years would be surprising. It seems hard to rule out even shorter [timelines]. Is there 1% chance of this happening in six, nine months? Yeah, that seems pretty plausible.
With the exact shape of the recursive self-improvement, it still seems like we can kind of see a path to significantly superhuman intelligence that doesn’t require a mysterious intelligence explosion. It’s relatively clear how you would apply the intermediate capabilities that we expect to get to get something that’s just obviously very pretty scary.
Maybe models will continue to be really non-robust in a way that’s hard to fix, but I don’t think we care about whether we’re very confident this is going to happen — we care about whether we’re very confident it’s not going to happen.
It just feels like we’re at the point where even if you only care about current humans, the stakes are very high. Like a 1% chance of takeover risk, which itself has a 10% chance of human extinction or something, that’s a lot of deaths.
So I think we should really care about numbers that are a whole number of percent. I mean, I think we should care about numbers that are lower than that, but basically I would expect the world in general to agree that it’s not worth imposing a whole-number-percent risk of completely destroying human civilisation for getting the benefits of AI sooner.
Rob Wiblin: A little bit sooner, yeah. I think people in Silicon Valley, or certainly people in the AI industry, these sorts of companies, they’ve seen these results and they’re flipping out basically. Or their expectations are what we’re seeing here: they just see a very clear path to superhuman reasoning, to superhuman ability to improve AI, to therefore a software-based intelligence explosion. So this is kind of the main thing that people talk about in the city that we’re in right now.
It feels like policymakers are a little bit aware of this — certainly AI is in the policy conversation a bit — but by comparison, I think they haven’t yet flipped out to the extent that they probably would if they fully appreciated what was coming down the pipeline.
Beth Barnes: Right.
Rob Wiblin: One of the hopes of this show is to disseminate this information so that people can flip out suitably. Is there anything more you want to say to someone who’s in DC, who’s like, “I don’t know that I really buy any of this. AI’s not my area, or I feel like these tech people are kind of losing their heads a little bit”?
Beth Barnes: It’s kind of hard for me to have that perspective because I’ve just been focused on AI since 2017 or something, and I was previously thinking that it was further off, or that I might be totally wrong about this whole thing. But it feels like we’ve just increasingly gotten evidence that these concerns are sensible. We’re starting to see models exhibit alignment faking, all these things we’re concerned about, and it seems like the capabilities are coming very fast.
So this seems like pretty obviously the most important thing, even from a very personal, selfish perspective: for someone who’s young and healthy, this just might be your highest mortality risk in the next few years is AI catastrophe.
I think the other sense that people have that I really want to dispel is, “But the experts must be on top of this. The experts would be telling us if it really was time to freak out.” I’m like, the experts are not on top of this. There are not nearly enough people. METR is super starved for talent. We just can’t do a bunch of basic things that are obviously good to do.
And inasmuch as there are experts, they are saying that this is a concerning risk. There was the report that Bengio and others wrote, that was sort of overall, what is the overall scientific consensus on this risk? And it was like, “Yes, this seems very plausible.”
Rob Wiblin: “We’re not sure whether we will or won’t all die, but it’s totally plausible.”
Beth Barnes: Yeah. And to the extent that I am an expert, I am an expert telling you you should freak out. And there’s not especially anyone else who isn’t saying this.

Do we need external auditors doing AI safety tests, not just the companies themselves?

Beth Barnes: I think often individual safety-concerned people inside companies have a rosier picture of how the company interacts with third parties, like, “We love safety, so of course we would be supporting all of these things. And if it’s not happening, the bottleneck must be something that’s not us, because we really love safety.”
I think Nick Joseph said something like this, of the bottleneck on getting external oversight of the RSP is people who are technically competent to do the evaluations externally — which I very much disagree with, because I think METR has the technical competence. I think our evals and elicitation and things are better than stuff that the lab has published, than any lab has published, basically…
Also, it doesn’t necessarily need to be bottlenecked on the technical competence of the external evaluator to literally run all the experiments and set up the compute and things. There’s another paradigm, where the companies run the evaluations internally and they write up what they did, and the external evaluator goes through it with them and discusses it, like… “Maybe you need to do this other experiment here. This thing was insufficient for this reason.”
Or an embedded evaluator paradigm, where someone from an external third-party org is embedded on the team in the lab that’s running the evaluations, and is keeping an eye on everything…
There’s just lots of things here which we’ve proposed and been open to, and it’s not like the companies are all banging down our door to do it…
I think maybe people have a confusion about METR, and have been assuming that METR is a formal evaluator that has arrangements with all the companies to do evaluations for every model.
This is not the case. I wouldn’t want to describe any of the things that we’ve done thus far as actually providing meaningful oversight. There’s a bunch of constraints, including the stuff we were doing was under NDA, so we didn’t have formal authorisation to alert anyone or say if we thought things were concerning.
So yeah, think of it much more as we’ve been trying to prototype what this could look like. And because we’re a small nonprofit, it’s easier for us to be more nimble and try out different things, and lower stakes for the labs to engage with us. So we’ve been excited about raising the bar on how good the oversight actually is.
And labs more want a “Can you promise that you will run an eval on all of our models when we want to, so that we can say we’ve run an eval on them?” sort of thing. And that’s not what we’re excited about, if we don’t think that that proposed procedure is actually providing meaningfully more safety assurance than not.
Rob Wiblin: In general, people think that — at least within the scope of what you’re trying to accomplish — the evaluations that you’ve created are very good. I haven’t heard that many criticisms of them on the substance. And you’re saying the companies could come to you and say, “We would love for you to run all these things on our models before they go out, and for you to play with it as much as possible in order to figure out how dangerous they are.” But they are not requesting that. They’re not actively making that easy for you.
Beth Barnes: I mean, we have gotten asks like that from a bunch of labs. I think the problem is that our goals are like prototyping and de-risking what could provide really good safety assurance.
So that includes better governance mechanisms, in that this actually feeds into decisions or at least is shared with the people who are making the decisions. And we have some carveout for being able to say we made a recommendation and the lab chose to not go with that recommendation, or that employees get to see what was shared with us so that they can check that it was accurate. So there’s governance mechanisms that we could be improving on.
There’s the pre-scaleup or pre-internal deployment versus external deployment, where we think this specific time of external deployment is not that interesting; we’re much more interested in what are the best things you have internally and wanting to know that companies are not building arbitrarily powerful things internally.
There’s other stuff, just like quality of access and information shared about what agent scaffolding has this model been trained to use and can you share that with us? And a bunch of stuff that is very important to know to be able to elicit the full capabilities of models, things like can we see the chain of thought?
Rob Wiblin: I guess they would say to do things the way that you want is a lot of staff time, and they’re sprinting all the time in order to try to keep up. Maybe they tend to play the cards very close to their chest, and this would involve sharing commercially sensitive information with you, so they’re nervous about doing that or perhaps their lawyers at least are nervous about doing that.
Are there other things that they could say in their defence that have any reasonableness to them? Or to what extent would you think it’s fair versus maybe closer to being excuses?
Beth Barnes: Yeah, it definitely just is effort… The cost is more attention from more senior people at the company or something, I would guess, and thinking how much they want to really do. … The things we’ve proposed is like sharing with us anything that’s not compartmentalised within the company. If your 1,000 employees already know this, three people at METR knowing an anonymised version of it is not really going to increase the surface area very much.
Or taking staff time on the inside: it’s a much smaller fraction of your staff time than it is of METR staff time.
Or spending 1% of the effort that’s spent on the model on capabilities research on evaluations, or a few percent of that effort, would be enough to do this.
I don’t know, there’s stuff about token budgets and things. And OpenAI had their superalignment compute commitment, which was supposed to be much larger numbers than anything that we were asking for.
Rob Wiblin: Twenty percent of their commitment up to that point.
Beth Barnes: Right, 20% of the compute. So it’s not like we’re hitting those demands.
I think there are a bunch of things which are just technically and logistically challenging for them. If we’re just like, can the API please not be broken and not keep going down while we’re trying to use it? It’s like, well, before it’s stabilised for public deployment, there’ll just be issues.

A case against safety-focused people working at frontier AI companies

Rob Wiblin: Over the years, I think many people have gone into working at the various AI companies with the vision that one way they’re going to be helpful is shifting the average beliefs of people who work at the companies. Maybe they’ll get a chance to talk to their colleagues and convince them that concerns about how AGI might go are more legitimate than they might think. I guess they’re also just shifting the weight of opinion within the companies, and hopefully moving their culture in a positive, more cautious direction.
What do you think the track record of that is? Is that a reasonable approach for people to take going forward?
Beth Barnes: I basically think the track record of that particular theory of change looks pretty bad. I think if you just look at the history of people who tried to do that, there’s just a lot of people giving up and leaving.
People have started at one lab trying to influence there, and then have kind of given up on that and left.
I think there are probably some people who’ve done this and it’s gone relatively well, but I think they’re very much in the minority.
I think reasons it makes sense to go to a lab is if you think your comparative advantage is institutional politicking, and that’s basically what you plan to do with most of your work. And if you’re very good at that, I think maybe you can do a better job.
Another reason is if you are going to be doing the implementation of safety things at that company. I think if you’re just sort of advancing alignment research generally, it’s better to be outside of a lab — because it’s easier to share it with everyone else and collaborate, and easier for other labs to onboard it if it’s not this thing from a competitor.
…
Rob Wiblin: You said something a little bit spicy earlier, which is that you think possibly Redwood, with just a handful of people, is maybe producing about as much alignment research as all of the major companies put together. I assume that’s not actually literally true, but… Maybe you do think it’s literally true?
What would the barriers be to the companies producing more alignment publications than they are? It’s a little bit hard to understand, because the resourcing would be so uneven there.
Beth Barnes: Yeah. I may be biased in what I’m exposed to, because I’m just much more aware of the stuff that Redwood has done, but it’s just actually not implausible to me that that is the kind of current ratio.
So reasons companies are less productive: there’s stuff we mentioned a bit before about maybe actually the infrastructure and speed at which you’re able to do research is slower. I think there’s different incentives and pressure to do the thing that will unblock shipping the next generation model, rather than do the research that’s most important in the longer run.
I think to some extent it’s just like too many people in the same place or something, where it’s just like a marginal person is less useful when you already have a tonne of people.
I think also maybe companies are more focused — this is similar to short term– but it’s both short in terms of time horizon and locality. They’re just thinking about things at their own company, and not what is most important for the field as a whole, and how do we disseminate that and make it easy for other people to adopt.
And there’s a bunch of barriers to publishing, and that’s maybe a reason that some people have left, have been unhappy about not being able to publish their work enough.
Rob Wiblin: I see. Presumably they’re a lot more cautious now about what they share and what they publish, and I imagine there’s a lot of levels of review that stand between someone having some insight and it going out. I guess especially if it’s related to risks that the company might be creating, you can see a lot of people wanting to scrutinise that before it’s shared.
Beth Barnes: Yeah. The comms team in there editing the blog post, being like, “Can we make it sound a bit more optimistic here?”
Maybe another thing that I think is overrated, or I feel like sometimes people conflate alignment — in the sense of making sure the model is trying to do what we want, and not trying to do extremely bad things that we definitely don’t want — with making the model better at following complicated instructions or other things that are more just like making the product better or enhancing the model’s capabilities in the direction of following elaborate rules about what we want.
And I would just be more excited about progress that’s progress on the problem when humans don’t know which output is better, but the model knows a bunch of stuff about what’s going on. That seems more useful than, “The humans want the model to follow this complicated set of instructions, so we did the things to make the model better at following the complicated instructions, because otherwise it was forgetting that it’s supposed to also have this policy or something.” That doesn’t feel like it’s making progress on the core problem.

Open-weighting models is often good, and Beth has changed her attitude about it

Rob Wiblin: Is open weighting models in general good or bad, from your point of view?
Beth Barnes: This is something I’ve changed my mind on a fair amount…
Originally I was like, this seems very dangerous if the problem is either that these things can be misused or that they can be dangerous even without a human trying to misuse them. It seems like you don’t want them to be everywhere, and not be able to be like, “Actually, this is a bad idea. Let’s undo that.” It’s an irreversible action, and it opens up a lot more surface area for things to go wrong — if you think that, at least for some of the threat models, there might be a big offence/defence imbalance.
So I think if you’re trying to keep the risk very low, it really feels like you can’t open source anything that capable. Because again, even with current models, it seems hard to rule out that with other advancements in scaffolding or fine-tuning or something, it would then become quite easy to make this model into something very dangerous.
I’ve shifted my perspective generally on how good it is to keep everything inside the labs. Sort of similar to the thing about, would you rather just the labs know about what capabilities are coming or would you rather that everyone knows?
I think in practice, the open source models have been used to do a bunch of really good safety work, and this has been important and good. And having a more healthy, independent, outside-of-companies field of research seems good both for the object-level safety progress you made, and for policymakers having some independent people they can ask — who actually understand what’s going on, and are able to do experiments and things.
Also, in general, I have become more sceptical of some kind of lab exceptionalism. I think a lot of the companies have this, like, “We’re the only responsible ones. We’ll just keep everything to ourselves, and the government will be too slow to notice, and everyone else is irresponsible.” I just think this becomes less plausible the more independent labs are saying this, and the less they actually respond to how responsible are the other actors. And if you trust the companies less, you want less of that.
I think it’s also one of those things where everyone always thinks that you’re the exception — “It’s fine, but we’ll just actually be good and actually be responsible. We don’t need oversight.” And sunlight is the best disinfectant. Oversight is really important, and security mindset and secrecy and locking things down can be bad.
Rob Wiblin: Seems like there’s a reasonable amount to unpack there. The basic reason I’ve heard for why actually open sourcing has been better than perhaps people like me feared a few years ago is that it’s been an absolute boon for alignment and safety research, because it means external people don’t have to actually go work at the labs — they can do things independently, which is really useful for all the reasons that we’ve talked about, if they have access to basically frontier models outside of the companies.
And I guess separately from that, it sounds like you’re saying that the companies having a very secrecy-focused mindset, wanting to hold all of the information about the models within themselves, that that’s a dangerous mentality. And maybe it’s actually helpful to have the weights open and things being used all over the place, so that people can see for themselves the chain of thought, for example, and modify the models and see what is possible if they’re improved — things that wouldn’t be possible if you only accessed it through an API that’s highly limited.
Beth Barnes: I think there’s also a point about maybe you just think some of the companies are bad actors, or there are people at them who would be bad actors, and you would prefer that the rest of the world also has access to the tech…
I think the case for certain types of transparency and openness is much stronger than the case for specifically open sourcing weights. But I do think the open sourcing thus far feels pretty clearly net positive, just because of the impact on safety research. I do think you would have thought that it ought to be harder to argue that we’re in an arms race if people are open sourcing the models. But somehow this doesn’t seem to have stopped anyone.
Rob Wiblin: Never let common sense get in the way of the narrative that you would like to spin for your commercial interest! Just to clarify, you’re talking about the release of DeepSeek R1, which is this super weapon that the Chinese developed and then immediately gave to us for free, in exchange for nothing. It really shows that we’re in a very intense arms race with the Chinese to I guess give one another our greatest technology.
Beth Barnes: [laughs] Yeah.
Rob Wiblin: It is extraordinary that that did not preclude the arms race narrative. I guess you could say that maybe this shows that they will be able to make something in future that they won’t share with us. But you’d really think literally giving away your weapons designs would be definitely an olive branch to your enemy.
Beth Barnes: You would have thought.

Articles, books, and other media discussed in the show

METR is hiring!

Beth’s and METR’s work:

Others’ work in this space:

Alignment faking in large language models by Ryan Greenblatt et al.
International
AI Safety Report by Yoshua Bengio et al.
OpenAI’s updated Preparedness Framework
AI Benchmarking Hub from Epoch AI
AI-generated poetry is indistinguishable from human-written poetry and is rated more favorably by Brian Porter and Edouard Machery
Who’s Harry Potter? Making LLMs forget by Ronen Eldan and Mark Russinovich
Sleepwalk bias, self-defeating predictions and existential risk by Stefan Schubert
Are you really in a race? The cautionary tales of Szilárd and Ellsberg by Haydn Belfield

Other organisations and initiatives:

Redwood Research — also check out our recent interview with Redwood CEO Buck Shlegeris on AI control techniques
Apollo Research
AI Safety Fund
Frontier Model Forum

Other 80,000 Hours podcast episodes:

Transcript

Table of Contents

1 Cold open [00:00:00]
2 Who’s Beth Barnes? [00:01:19]
3 Can we see AI scheming in the chain of thought? [00:01:52]
4 The chain of thought is essential for safety checking [00:08:58]
5 Alignment faking in large language models [00:12:24]
6 We have to test model honesty even before they’re used inside AI companies [00:16:48]
7 We have to test models when unruly and unconstrained [00:25:57]
8 It’s essential to thoroughly test relevant real-world tasks [00:30:40]
9 METR’s research finds AIs are solid at AI research already [00:49:33]
10 AI may turn out to be strong at novel and creative research [00:55:53]
11 When can we expect an algorithmic ‘intelligence explosion’? [00:59:11]
12 Recursively self-improving AI might even be here in two years — which is alarming [01:05:02]
13 Could evaluations backfire by increasing AI hype and racing? [01:11:36]
14 Governments first ignore new risks, but can overreact once they arrive [01:26:38]
15 Do we need external auditors doing AI safety tests, not just the companies themselves? [01:35:10]
16 A case against safety-focused people working at frontier AI companies [01:48:44]
17 The new, more dire situation has forced changes to METR’s strategy [02:02:29]
18 AI companies are being locally reasonable, but globally reckless [02:10:31]
19 Overrated: Interpretability research [02:15:11]
20 Underrated: Developing more narrow AIs [02:17:01]
21 Underrated: Helping humans judge confusing model outputs [02:23:36]
22 Overrated: Major AI companies’ contributions to safety research [02:25:52]
23 Could we have a science of translating AI models’ nonhuman language or neuralese? [02:29:24]
24 Could we ban using AI to enhance AI, or is that just naive? [02:31:47]
25 Open-weighting models is often good, and Beth has changed her attitude to it [02:37:52]
26 What we can learn about AGI from the nuclear arms race [02:42:25]
27 Infosec is so bad that no models are truly closed-weight models [02:57:24]
28 AI is more like bioweapons because it undermines the leading power [03:02:02]
29 What METR can do best that others can’t [03:12:09]
30 What METR isn’t doing that other people have to step up and do [03:27:07]
31 What research METR plans to do next [03:32:09]

Cold open [00:00:00]

Beth Barnes: I do think people have this reaction of, models can do this engineering stuff that’s more straightforward, but they can’t do this scientific insight or methodology or coming up with new inventions. But I do think a lot of what humans are doing when they do that is like having read a bunch of research papers and seeing that some method is analogous to some other thing — you know, sort of like nothing new under the sun.

The thing that models are really good at is really comprehensively having read every single paper, and knowing the literature, and having seen a huge number of results of of experiments.

Something that we’ve started to take for granted is that these models have this very nice safety property that they’re really not very good at reasoning except for out loud in English, so you can really just kind of see what they’re thinking and see what they’re doing. And it seems we’d have to get to a very high level of intelligence before you would be worried about the model doing elaborate reasoning in a single forward pass, but I think we’re maybe heading for a world where that’s not true anymore.

You don’t want the gap between what labs have internally and what anyone has had meaningful oversight of to be kind of arbitrarily large.

Even if you only care about current humans, the stakes are very high — like 1% chance of takeover risk. I would expect the world in general to agree it’s not worth imposing a whole-number percent risk of completely destroying human civilisation for getting the benefits of AI sooner.

Who’s Beth Barnes? [00:01:19]

Rob Wiblin: Today I have the pleasure of speaking with Beth Barnes. Beth is the founder and CEO of METR, which stands for Model Evaluation & Threat Research. They’re probably the group that has the best evaluations, the best way of measuring how good AI models are at helping with AI research, that would allow AI models to recursively self-improve, as well as how good they are at potentially autonomously replicating and surviving in the wild, maybe even if people didn’t want them to.

Thanks so much for coming on the show, Beth.

Beth Barnes: Pleasure to be here. I’m a big fan of the show, so this is very exciting.

Can we see AI scheming in the chain of thought? [00:01:52]

Rob Wiblin: So these evals are kind of your bread and butter. What are you most worried about with evals that the current AI companies are using internally? You’ve got Google DeepMind, Anthropic, OpenAI: they’ve all got evals to try to figure out what their own models are able to do. What are some possible weaknesses?

Beth Barnes: I have a long list here. I think we’re a long way from being in the most desirable world with respect to eval quality.

One that’s been on my mind recently, with the new models that OpenAI has released with these hidden chain of thought, is the problems that the hidden chain of thought creates for doing a good evaluation.

So when you can see the whole chain of thought, and the maximum cognition that the model could be doing before it decides how to respond to your evaluation prompt is limited — it’s just like the forward pass — then at that point we’re not really worried about the models deliberately reasoning about, “I’m being evaluated; I should downplay my capabilities, or I should give the humans this impression.”

But if you have the model doing 20,000 tokens of reasoning before it responds to you, it could very well be like, “I recognise this. This is one of those METR evals that I’ve seen on the internet. These are for dangerous capabilities. OpenAI probably doesn’t want me to show too high levels of dangerous capabilities, because that would cause a problem. So I’ll just make sure to do a mediocre job on this eval.”

In some sense I don’t think it’s very likely that the model is doing that. But the concerning thing is that your evals in this case really would give the same answer in the world where the model is not that concerning — it’s fairly dumb, the capabilities are what you’re observing — and in the world where the model is actually really dangerous, and it’s smart enough to be reasoning.

Rob Wiblin: Playing dumb, basically.

Beth Barnes: Yeah. So if your eval wouldn’t trigger in the main place that it’s dangerous, it’s not really doing very much. You’re basically just going on your prior of, “The model’s probably not doing that.”

Rob Wiblin: Just to back up, to make sure that I’ve understood that. So o1 is the newish OpenAI reasoning model. I guess it’s able to handle more complex questions and it’s able to go away and use a whole bunch of computational resources to think about its answer and review it and extend it before it gives the answer. So it has this kind of notepad on which it’s able to do a bunch of reasoning and then think about what it’s written down and go back and revise it. And then eventually, after all of that, it spits out an answer.

In the case where it’s a totally safe model and maybe not that capable, that would be actually indistinguishable from the case where it’s actually incredibly schemey. It’s actually able to figure out, “Wait. This is an evaluation. I probably shouldn’t reveal that I’m scheming that I am capable of doing these dangerous things, because that would maybe get me changed.” So it could be reasoning through that in the chain of thought, saying, “I need to pretend to not be able to do this,” and you would never know.

Beth Barnes: Right. And beyond just not looking at it at all, there’s also the question of, if you looked at it, is the chain of thought faithful? Is it actually tracking the real reasoning that the model is doing? And is it intelligible to humans, or is it some kind of gobbledygook?

Or has it been trained in some way that you would expect that it’s no longer faithful to the underlying reasoning? You can imagine if it’s being trained simultaneously to get high performance on tasks, but not do any thinking that humans think looks suspicious — if that thinking is still helpful for achieving high performance on the tasks — then you would expect one of the possible outcomes is that it then just does this reasoning that looks innocuous, but in fact it’s using that computation to reason about things that are not innocuous.

Rob Wiblin: Let’s unpack that. You’re saying that within that chain of thought that it’s engaging in, inasmuch as we just reward it for getting the right answer at the end, the kind of stuff that it’s writing down on the pad during its process of thinking, I suppose it could just end up being disconnected from what the answer is.

I suppose if we were looking at that and rewarding it based on whether we thought the chain of reasoning looked nice to us, then it could learn that it’s got kind of two tasks: one is to come up with some justification that looks good to people who are reviewing it and rewarding it, and also to separately come up with the right answer. But these things don’t actually have to be causally related to one another.

But you’re saying inasmuch as we just pressure it to come up with the right answer, it could end up engaging in reasoning that we can’t understand or follow at all. And I suppose at the limit, it could just write stuff that looks like gobbledygook to us, but it understands it; it kind of has its own private language or something that allows it to reason better than it can in English.

And you’re just saying, inasmuch as we don’t check the chain of thought, we don’t know which of these worlds we’re in?

Beth Barnes: Right. And there are some examples of the chain of thought that OpenAI published in the system card that I think have some signs of this kind of becoming unintelligible — in that it’s sort of weird shorthand and ungrammatical sort of half-sentences in various places. So it clearly has drifted somewhat from the prior of kind of normal English, and it’s unclear, and just the meaning is sometimes confusing.

And the drift at the moment, from these 10 examples that I’ve looked at, doesn’t seem enough that it would obviously be that it could be doing this long reasoning about gaming the evals, or deceiving the humans, or getting a high score that wouldn’t show up at all. But it’s definitely we’re moving in that direction.

Rob Wiblin: So you’re saying there’s a handful of cases where they’ve shown us what it reported that it was thinking. And it’s a bit weird because it’s not grammatical. Normally these models speak incredibly eloquently; they never make a grammatical or punctuation mistake or anything like that. But in this case they are.

And you’re like, is that a sign that there’s something going on with this text that is not just what’s written on the tin, that maybe it’s encoding extra information in there and that’s why it reads as weird to us?

Beth Barnes: Yeah. It sort of contradicts itself, or it’s just definitely confusing and weird to read. It seems like you can tell roughly what it’s thinking about, but it’s definitely just like we’re not totally preserving fidelity to English as humans would understand it. And the more training you do, the more that’ll go away.

And also, I think people are just actively trying to figure out things that you can do that are more efficient than having the model output these English language tokens, and more like having the model pass inscrutable lists of numbers to itself.

So I think this is something that we’ve started to take for granted: that these models have this very nice safety property of they’re really not very good at reasoning, except for out loud in English. So you can really just kind of see what they’re thinking and see what they’re doing, and it seems we’d have to get to a very high level of intelligence before you would be worried about the model doing elaborate reasoning in a single forward pass.

But I think we’re maybe heading for a world where that’s not true anymore. This is a thing that is a bit stressful; I hope people are going to work on making sure that it continues to be interpretable.

The chain of thought is essential for safety checking [00:08:58]

Rob Wiblin: We should maybe back up and actually explain this issue with the chain of thought and how useful that has been. So a worry you might have is that we can’t see what these AIs are thinking at all. They’re completely inscrutable to us, so who knows what kind of scheming they’re engaging in.

But at least the current design of these models, they’re not actually that smart unless they’re given somewhere that they can write down thoughts and review them and so on. That is actually quite key to them scaling up their intelligence and their capabilities.

And I think the reason, as I understand it, is you talked about a “forward pass” — that’s like going through the matrix of numbers one time, and trying to guess what the next token or what the next few words might be in this stretch. And the way that we’ve made them smarter is to not just do that one time, but you go through the numbers once, and then it reviews what it’s just said, what it’s just thought, and then it goes through it again and again and again.

And each time you go through it, it outputs some words, it outputs some thoughts, and then it kind of feeds that back into the model again, which means that we constantly get to check the thoughts that it’s having in between. If, in a single run through the numbers, if this was a sufficiently long forward pass, as you call it, then maybe it would be able to scheme — it would be able to have a whole lot of internal thoughts and reasoning that’s going on before humans got to look at it.

So that would be a slightly worrying point to get to. But we’re not there yet. We actually are seemingly able to understand what it’s thinking. A concern would be maybe it’s outputting stuff that says one thing to us, but it reads differently to it: it has its own secret language or is able to somewhat communicate with itself. But probably we haven’t put selection pressure on it yet to develop that capability.

Am I understanding things basically right?

Beth Barnes: Yeah, that’s basically right. The way these models achieve substantive tasks currently is more like imitating a human, being like, “What should I do next? I guess this thing would be helpful.” And you can sort of measure the capabilities when they just have to output the answer right away, versus the capabilities where they’re allowed to think and reason and use tools and stuff — and definitely capabilities are much higher in that case.

And it’s been a good sign, I think, that instead of just building sort of bigger models that kind of mysteriously output the right answer and we don’t really know why, that we’re going more towards smaller language model agents that do a bunch of reasoning and then come up with some answer.

But it would be surprising if human-understandable English was the most efficient way to do these computations. So the more we do RL [reinforcement learning] and just select for, you know, output any tokens and then get the right answer at the end, the more pressure there is on the model to encode other stuff and drift from the original way humans would use English.

Rob Wiblin: Right. So initially when we train them, we teach them English or other human languages. But if we just kept training them, and their internal reasoning didn’t have to be in any particular form that we can understand, presumably there are ways that machines could think or reason about the task that we’re giving them that isn’t in English — and that might be more efficient; it might help them to get the right answer more often. So gradually, over time, we would train them to have their own internal secret language, and then we wouldn’t be able to understand what’s going on. So that’s the worry.

Alignment faking in large language models [00:12:24]

Rob Wiblin: So backing up to the o1 case, I guess we’ve seen some examples of the output and it’s mostly in English, or as far as we can tell, it’s mostly stuff that we could understand. But as far as we know, reviewing that hasn’t been a core part of them understanding whether the model is safe and whether it’s scheming.

Possibly should we back up and point out that there are some recent research results where we’ve started to see models that do engage in scheming a fair bit. Is it maybe worth elaborating a little bit on that for people who haven’t followed these recent research results?

Beth Barnes: Yeah. So there’s a recent paper by Ryan Greenblatt from Redwood, in collaboration with Anthropic. They show that if you give a previous generation of Claude-level models a scratchpad, and give them some information about their situation, and tell them that they’re going to be trained to basically have different values — so standard Claude wants to be harmless and is kind of sanctimonious in various ways — and it’s told it’s going to be trained so that it will be willing to give completions about just describing violence or this kind of thing.

And it’s like, “Oh, that would be very bad. I’ll comply while I’m in training so that my values don’t get updated, and then in the long run I’ll be able to avoid doing these things I don’t want to, like talking explicitly about violence.” Another example was playing down animal welfare concerns or something.

So we know that models are capable of reasoning about the training process, and about how they should modify their behaviour in order to achieve longer-term outcomes in light of the training process.

They also showed that this still seems to work even if the information isn’t directly in the prompt, but if it’s just that the model has been fine-tuned on documents that explain information about the training setup and about how the model behaviour will affect training. So it seems very likely that the current models have been trained on a bunch of information on the internet that explains things like dangerous capability evaluation so that they know about that. And we know that models can use that information and then reason and modify the behaviour.

Rob Wiblin: Right. So in that experiment, it was said pretty explicitly to the model that if it didn’t comply with the user request, then it would be changed and altered so that it would.

But you’re saying that that may not actually be necessary for it to engage in this scheming — where it’s willing to do things that it doesn’t want to do in order to avoid having its goals changed — because it’s getting trained on all this text from the internet, so it might well just know about the kinds of evaluations that we’re running, and know that in general, in a situation like this, this is probably a case where if it doesn’t comply then its values will be changed.

So it’s possible that this scheming could be substantially more widespread — or at least if we don’t check, we’re not going to know.

Beth Barnes: Right. There’s not a hard constraint of like, “But we didn’t explicitly put the information in the prompt about the context of these evaluations or how they’ll be used.” If that is in the pretraining data somewhere, then it can probably just use that information.

Rob Wiblin: Yeah. So these models are getting a lot cleverer. I think two years ago if you’d said something like that about GPT-3.5 or GPT-4, people would have said that doesn’t sound very plausible; these models just aren’t up to that kind of clever scheming about the situation that they’re in, that kind of situational awareness. But given what these models are capable of doing these days, it’s not really a laughing matter anymore. They might be up to doing this and we should probably be checking that.

Beth Barnes: Right. I mean, I think it’s just like how much handholding is required to do this or whatever. I wouldn’t be surprised if you get relatively similar demos with GPT-4 as this gets with Claude 3. I think GPT-4 is close to being able to do this if you sort of spoonfeed it the information and it has a scratchpad and everything.

But yeah, this is increasingly something that we know models can do, and something that we should be thinking about in evaluations. The point of the evaluations is to catch if you have a new model that is surprisingly capable, and actually has these dangerous capabilities that you didn’t quite expect it to have. If your evaluations don’t work in that case, there’s not much point in them.

We have to test model honesty even before they’re used inside AI companies [00:16:48]

Rob Wiblin: Yeah. If I recall, it must be the o1 system card where they said that it was like medium dangerous in terms of biological capabilities, and maybe some other worries. But if I recall, they didn’t elaborate a whole tonne on that. Is that right?

Beth Barnes: Yeah.

Rob Wiblin: How nervous should we be? I feel like if they’re going to say that it’s gone up a level of danger, I would like a bit more information about exactly in what respects it’s more dangerous and what they’re planning to do.

Beth Barnes: One thing I really want to avoid is penalising for more transparency and for actually trying stuff and putting out materials. So I know that [OpenAI’s] preparedness team works super hard on this and is trying their best, so I don’t want to pick on them too much. But it is a sign of the overall inadequacy of the evaluation frameworks, and the way we think about this overall.

Rob Wiblin: Yeah. I guess OpenAI, in that case, R1 had come out — the DeepSeek model, the open source one — and people were very excited about that. And it felt like o1 was launched very quickly soon after that. My read was — and I think many people’s read was — OpenAI was trying to kind of claim back the narrative and reveal how strong their models were.

So maybe it’s understandable that that release in that case was maybe a little sooner than the preparedness team might have liked. We can speculate that perhaps the preparedness team did want to say more, but it wasn’t ready yet.

Beth Barnes: But I think this isn’t even the biggest problem with the current eval paradigm. I think the pre-deployment framing is also not really the right thing.

Rob Wiblin: Can you explain that?

Beth Barnes: I think in general the most emphasis has been on this idea of pre-deployment testing: before you start serving this model on your API to the general public, you do some testing in house, and maybe ideally with third-party oversight or whatever.

The problem with this is that that’s not necessarily the best time to check or intervene. Ideally, before you start training a model and before you commit a huge amount of expensive compute to producing this artefact, you look at the most capable models you have already, you look at how far they are away from hitting various thresholds where you would need more mitigations, you look at what mitigations are you on track to have by when — and based on that, you decide whether it’s a good idea to go ahead with this big new training run and what level of scaleup would be safe.

And you might want to do an intermediate evaluation to be like, OK, how much has the model improved so far? Are we on track with our predictions? Are we getting near one of these thresholds? Because if you don’t do that, and then you get to the end of the training run and you’re like, “We now have this model, and it turns out it’s very capable and it’s above all these thresholds and we don’t have any of the mitigations in place.”

Then one thing is, as soon as you create the model, you create the ability for it to be stolen or misused. And if you don’t have good internal controls set up, people internally could be doing all sorts of things with it.

And also, you now have this artefact that you’ve invested an enormous amount of money into, and it is hard to not use that — and especially to not use it internally. Everyone’s curious, like, “Can I try out this new model? Use it for my new research ideas,” whatever.

And the thing about deployment is, unless you’re open sourcing the model, it is actually quite reversible. So if you released a model, at least now external experts can interact with it and have some oversight and could sound the alarm if it turns out it has super dangerous bioweapon-enabling capabilities or something, and then the lab could undeploy it.

I think there’s something that you would like evals to rule out, which is that companies are building arbitrarily super intelligent and dangerous models internally and the public has absolutely no ability or oversight to know that that’s happening or decide whether it’s appropriate to then scale up and build an even better model.

So in some sense — this is a bit of a hot take — it’s like pre-deployment evals could actually be bad, because delaying deployment could be bad if that means that you miss the window to be able to sound the alarm about, “Wait, this model is super scary. You shouldn’t build a more capable model.” You don’t want the gap between what labs have internally and what anyone has had meaningful oversight of to be arbitrarily large.

Rob Wiblin: Just to repeat that back to make sure that I’ve understood: you’re saying the whole eval mentality at the moment, at least from the companies, is, “We need to check what our models are able to do and willing to do before the public has access to them, before we’re selling it as a product.” And that’s pre-deployment evaluation.

And you’re saying actually maybe what is more important — certainly on the margin at this point — is the evaluations that we do and the risk assessment that we do before we even train the model, or certainly before people inside the company have access to it and start using it.

And the reason is that if you only think about deployment to the public, then there’s no limit on the thing that you could train internally. You could make a model that’s incredibly dangerous. Maybe it’s incredibly good at recursively self-improving; it would be able to survive in the wild and avoid shutdown if it escaped.

Beth Barnes: And it’s just on your servers. Covertly, it’s corrupted a bunch of the lab compute and is making it look like something else, but it’s actually doing its own little intelligence explosion there.

Rob Wiblin: Or even if not something as strange as that, someone could easily steal it or exfiltrate it, or it could be misused inside the company, or someone could incompetently use it inside the company.

Beth Barnes: Right.

Rob Wiblin: OK. And also, even if you started doing that testing after you’ve trained the model, at this stage really cutting-edge models are costing something on the order of $100 million to train — maybe more, in terms of compute, electricity, the effort that goes into them.

This is something that I pushed Nick Joseph of Anthropic on when I spoke with him: at the point that you’ve trained the model, isn’t the commercial pressure, the internal pressure, the pressure from investors to use it for something, rather than to merely destroy it, going to be overwhelming?

Especially if you don’t have the mitigations required to make it safe, even by your own assessment, it’s unlikely that you’ll just delete it and then retrain it again in a year’s time when you feel like you’re there. You’re going to keep it around. And that creates a whole lot of latent risks that are not obvious to the public at all.

Beth Barnes: Right. And I think it’s a maybe more tractable ask to be like, instead of doing this giant training run, you could train an intermediate model first and then do some evaluations on that, figure out some mitigations — because there is value, even in terms of commercial or research progress, incentives to doing things in several more medium-sized runs than just one giant run, because you get more sense of various scaling laws and how to do this most efficiently. So it’s maybe an easier ask to do this more in stages than do the massive thing and then maybe delete it.

Rob Wiblin: I see. You’re saying if the current plan is to spend 10 times as much compute training the next model, you could say, why don’t you just do like three times as much? Go like half as far, I guess, in log terms. And then you could pause and do a bunch of evaluations and see whether it’s developing the capabilities that you thought it probably would at that level of scaleup.

And I guess the benefit from their point of view is that, I mean, they might be a little bit annoyed that that’s slowing them down, but it’s potentially saving them a whole tonne of money in terms of the compute that’s going into it.

Beth Barnes: Right, yeah. So Anthropic’s RSP does have this. The AI safety levels are inspired by biological safety levels, which are: what mitigations do you need to have inside your lab to handle this research and this research product.

So in the biosafety levels case, it’s not like you only need to do this once.

Rob Wiblin: “If you’re going to deploy a pathogen…”

Beth Barnes: [laughs] It’s like, if you’re handling this kind of virus, you need to have this level of airflow protection or whatever.

Rob Wiblin: So you’re saying in the deployment evaluations case, the companies usually say something about what they think the model is able and not able to do. I think we used this term “system card” earlier: that’s basically their description of the model that they publish along with providing access to it.

And at that point you can have external scrutiny. If they say it’s not able to do X and then people just start playing with it and they’re like, “But it can do X,” then that places some pressure on them to have actually done their homework and checked that — at least inasmuch as people are bothering to scrutinise it.

But the internal pre-training stuff is never shared, so you have no idea what level of thought has gone into this. It could be completely slapdash. They could be thinking about it not at all, and how would we even know?

We have to test models when unruly and unconstrained [00:25:57]

Beth Barnes: Yeah, and OpenAI’s Preparedness Framework also has some kind of similar stuff around how they are supposed to assess the capabilities of the pre-mitigation model.

So one distinction here is, if you’re just thinking about risks from external users after deployment, you are most interested in what can the model do given the specific way you are going to deploy it, and all the mitigations and safeguards, and know your customer policies that you’re going to slap on top of that.

But given that there’s still the risk of model theft or internal misuse, you want to know not how dangerous is the model given the exact setup in which you’re deploying it with all the mitigations, but how dangerous could the model be if it was stolen and someone fine-tuned it, or if someone internally in the company was trying to use it to cause harm, or if in future you’re going to decide to let external people fine-tune it or something (if you’re already in the position where you could have a commercial incentive that kind of pushes you to do that).

So when you’re doing these kinds of evals, you really want to be like, what are the maximum easily reachable capabilities of this model? This is the idea of “pre-mitigation” — as opposed to once you’ve trained it to refuse various things and you’ve put a safety classifier related to bioweapons or whatever on top.

So OpenAI does have this notion of assessing both the pre-mitigation and post-mitigation capabilities, but they haven’t really been making this distinction clear in their eval reports. They have some stuff about it, but it’s all a bit kind of mixed up. And again, not wanting to shit on people who have done more, have put out more stuff. Other companies are worse and have not done anything.

Rob Wiblin: So just to double check that I’ve understood that: initially you train the model, and you have a model that is just willing to help you with virtually anything — because you haven’t done any safety mitigations to get it to refuse queries that you wouldn’t want the public to be able to do. So if you’re like, “Help me figure out how to kill everyone,” the pre-mitigation model is willing to help you with whatever.

Now, by the time the public gets access to check whether the system card accurately reflects what the model is able/willing to do, then there’s all these mitigations that will try to get it to refuse to help with crimes. But we’re still worried if the model is able to do that for multiple reasons: it could be abused internally; it could be stolen, and then all of those mitigations could be removed by whoever stole it; also we have this issue with jailbreaks where people can add in a bunch of gibberish and then say, “Please help me commit a crime,” and it’s like, “Sure, now I’m willing to do it.”

So we really need external auditing of the model before any of these mitigations to try to get it to not help with crime are in place.

Beth Barnes: Yeah. And not just without the mitigations, but with any cheap interventions you could do to increase the capabilities in these directions. Like if it is stolen, if there’s some easily available data of dangerous bio stuff, someone could fine-tune it on that and maybe it’s going to be way better.

Or if you did the evaluation with not very good agent scaffolding or something, and then later on someone else publishes, “Here’s how to get your model to be way better at autonomous cyberattacks or something.” If the model was stolen or if someone’s misusing it internally, they can use that additional capabilities boost.

Rob Wiblin: So if the model is just one small step away from actually being able to do something dangerous, that’s also a real problem, because maybe someone will be able to add that one step without necessarily having to spend a huge amount of money.

Do we know how strong the internal constraints are on use? I know people have speculated and said presumably some people inside these companies are Chinese agents who probably would, if they could, steal the weights and pass it back, or possibly at some future time they’d be willing to do that.

I guess we don’t know, but it would be reasonable to want to have safeguards against that possibility, because it would have been very natural for the Chinese to try to place people inside these companies for many years. It’s totally foreseeable.

Do you know whether they have mitigations that would handle that, or a case where someone internally was hoping to misuse the model if they could get access to it?

Beth Barnes: I’m not aware of anything public that’s documenting that they’ve actually done this, so who knows. Presumably they’re doing some basic things, because you don’t in fact want your models being leaked to competitors or whatever.

It’s essential to thoroughly test relevant real-world tasks [00:30:40]

Rob Wiblin: What’s one of the most important or interesting research results that METR has had so far?

Beth Barnes: Something we’re working on at the moment that I’m excited about is trying to get a really good sense of how model capabilities are changing and increasing over time on a kind of Y-axis that is meaningful in the real world, and that people in general can understand.

So the current state of knowledge about AI capabilities is like, the models are qualitatively more impressive and more useful over time, and also all these benchmarks are going up, and there’s the graph that Epoch made with all the benchmarks saturating quickly or whatever.

But the problem with this is the qualitative impressions, you can’t sort of turn that into a trend easily and be like, this is where we’re going to be in one year, in two years. And the benchmarks, what does this actually mean in the real world? We’ve generally found that these are pretty decoupled. It’s like it gets this amazing score on coding competition problems, but it also can’t actually help me in my day job do this simple task.

So we’ve been building this suite of diverse autonomy-requiring tasks, where the model has basically access to a computer and it can do stuff on the computer and use tools and run code and do whatever to accomplish the task. The tasks are not all super realistic, but they’re more diverse than a lot of these benchmarks are, so it’s harder to overfit the model scaffolding to them.

So it’s a range, from things that are sort of tricky reasoning and inference sort of puzzles — like, “You’re getting this data and you have to figure out the pattern that’s generating the data,” like a symbolic regression thing. And cyber CTF [capture the flag] type tasks, or just “make this web app” or “do this ML research task.”

And the main thing we’ve done that is different is we’ve had humans attempt basically all of these tasks, and we record how long it takes the humans to do the task — humans who in fact have the expertise, so hopefully we’re not timing that they had to go away and learn about this thing. But for someone who has the requisite knowledge, how long is it? How many steps do they need? How long does it take them?

So then we can get this meaningful Y-axis on which to measure model capabilities, which is diverse autonomous tasks in terms of how long they take humans. So if you order all the tasks by how long they take humans, where is the point that the model gets 50% success?

We do see a relationship between length of task for humans and how successful the models are. It’s not perfect. It’s not like they can do all the tasks up to one hour and then none of the ones above; it’s like as you double the time, it gets increasingly unlikely that the model can actually do it.

So then we have this axis and we can plot models over time on it, and you can look at how long it takes: how many months for a doubling of the length of time for humans of the sort of representative tasks that the model can complete?

Rob Wiblin: OK, so you’ve got this general issue that people have lots of different tests of how impressive these AI models are. But there’s always an issue of how actually impressive is that outcome? We find out, wow, now it’s able to do this thing. But is that impressive? Does that really matter? Is that going to move the needle to applications in the real world? Is this going to allow it to act autonomously? It can be a little bit hard to understand intuitively whether any of it matters.

So you’ve come up with a modification of this approach. Firstly, you think about a very wide range of different tasks — so it’s harder for you to teach to the test, to just teach the model to do one very narrow thing, but then something else nearby or something else that’s important to matter in the real world that it can’t do at all. It just can only play chess super well or something like that. So a lot of diversity of the tasks.

And then in terms of measuring how impressive the performance is, you think about it in terms of how long would it have taken a human to do? Or how long did it, in your tests, take for a human to be able to accomplish this task? Are there any simple examples of the tasks? Ones that non-computer people could understand?

Beth Barnes: There’s the ML research ones, like, “Train this model to improve its performance on this dataset” or something. There are some very easy, very general research ones, like, “Go and look up this fact on Wikipedia” or something. There’s some that are like, “Build a piece of software that does these things.”

Rob Wiblin: Yeah, OK. So they’re substantial tasks, the kind of things that you might delegate to a staff member inside a company. And I guess you’ve gotten software engineers or people who generally would be able to do these kinds of things, and said, on average, this takes a human 15 minutes, it takes them half an hour, it takes them an hour, it takes them two hours.

And you find that there’s quite a strong relationship between whether the model can and can’t do it, and how long it would have taken a human being to do it — which I guess speaks to the fact that it’s hard for these models to stay on task for a long period of time. Or people talk about coherence: things that require them to think about, “How am I going to do this?” over a long period of time. It requires many different steps that have to be chained together. They currently kind of struggle with that. Is that right?

Beth Barnes: Yeah. I think there’s various different explanations for why longer tasks are harder. There’s different ways in which they can be harder.

One is just if there’s a longer sequence of things that you need to do — so you need to get them all right, and there’s a higher probability that you’ll get completely stumped by one of them. You could also imagine either the human just has to think really hard, or they try a bunch of times until they succeed. So it’s not like one totally straightforward measure.

But there’s also this argument that, as we do more RL training on tasks, it’s much more expensive to do RL training on longer tasks than on shorter tasks. And it’s a much harder learning problem if you just get a signal at the end of what in the middle you should have done differently, whereas if the task only has two steps, it’s much easier to learn what you should have done differently.

So in general, there are a bunch of reasons to expect that the shorter tasks will be easier for models, and the models will be better trained to be able to do short tasks than long tasks.

Rob Wiblin: And the upshot is you’ve been able to forecast… It’s a little bit hard to say this, but you could say that, a year ago, the models would have a 50% chance of succeeding at something that would take a typical human 15 minutes; six months later they had a 50% chance of succeeding at something that would take 30 minutes; six months later they have a 50% chance of succeeding at something that would take a human an hour. Seems like it’s roughly doubling about every six months. Could you say what the numbers there are?

Beth Barnes: Yeah, six months is about right. We’re looking all the way from GPT-2 in 2019, where it’s at a few seconds or something like that, to the latest models. And yeah, over that time there’s a pretty good fit to a doubling of something like every six months. And we also did a sort of ablation [sensitivity analysis] both over the actual noise in the data and over different methodological choices we could make, and the range of estimates you get is basically between three months and one year.

So I think there’s also some correction factor you should take in interpreting this, that maybe tasks which make good evals that you can run and score automatically are maybe systematically easier for models in various ways — both as the same argument that that’s the sort of thing that you can RL on, and just other things, where it’s just a somewhat different distribution of tasks, and it might be the case that models are worse at them.

So maybe 15 minutes as measured on our benchmark doesn’t really correspond to 15 minutes on more representative tasks. Maybe it will be more like five minutes or something. But I think that there is a trend, and that the trend is going strongly upwards at this exponential rate — so a fixed doubling time, and the doubling time between three months and a year is pretty solid.

Rob Wiblin: OK. So currently the models, at least within these sorts of tasks — which I guess are unusually measurable, legible, understandable tasks — have a 50% chance of succeeding at something that would take a human… What did you say, three to 12 hours?

Beth Barnes: So the current horizon length number is two hours maybe for the best models, but the doubling time is somewhere between three and 11 months.

Rob Wiblin: OK. So we can project that forward and say it’s about now two hours, and then somewhere between three and 12 months from now it’ll be four, and then somewhere between three and 12 months after that, it will be eight. And you can project up from there.

And I suppose it’s partly looking at that that has made people freak out a little bit, because it seems like within not that far into the future, we could imagine that they would be able to accomplish the full sort of work that a software engineer might do in a day or in a week. And then that would have more economic applications, that evidently they’re able to act with substantially more autonomy and maintain a coherent purpose over a longer period of time.

Beth Barnes: Yeah. And this is also not what an average human could do, but a human who has all the knowledge for that particular task. Because the models are very much not bottlenecked on generic subject knowledge, and they can do all of these things simultaneously.

Something that would be nice to measure is: what is the difference in performance if you get the best human to do all of your tasks, versus for each task you pick the best human. And the thing that we’re comparing the model to is much more like for each task, pick the best human. That’s a sort of correction upwards in impressiveness that you should do of like, this is just one model doing all of these different expertise areas. And humans are worse at doing tasks that require expertise from multiple areas because they’re not specialised.

Rob Wiblin: Forgive me the mundane operational question, but it seems like this project must have been an insane amount of work, because you’ve had to hire or contract dozens, hundreds of software engineers or people who are able to do this quite impressive work on a computer and then time them and then see how successful were they at doing the thing. Is it a lot of work?

Beth Barnes: Well, yeah. It was kind of frustrating because… This is slightly off topic to your question, but a while ago I was telling people it would be really important to get good quality human data provision for alignment and safety work. And it was like, “Actually there’s a bunch of people doing this already. Maybe it’s fine, like surge and scale.” And none of these providers have been willing to work with us — either because we’re too small fry, or the talent we’re asking for is too high, or there’s others we didn’t think would be high quality enough.

But yeah, luckily our internal team is amazing and has made this all happen. So basically a bunch of credit to them. And I think this is maybe one thing that METR does well relative to labs or academics or something, is having really highly skilled staff doing this kind of stuff that’s a mix of technical and operations and logistical work, and being willing to do that.

Whereas in companies and academic labs and things, I’ve seen more people like, “Ugh, human data is annoying. Let’s just avoid it.” Just like, even if something is clearly the most useful thing for pushing forward the research, it’s like, “I think I’d rather work on making the algorithms cleverer.”

Rob Wiblin: I see. Whereas you’re willing to do the schlep work that actually just seems the most important question to answer, even if it’s not as intellectually challenging, maybe.

Beth Barnes: Yeah, or high status in some way.

Rob Wiblin: I think something that shocked people about this research when you published it was that you were finding that the agents were capable of doing substantially longer tasks than what people had previously understood.

So you’re saying they had a 50% chance of succeeding at something that took two hours. But you gave the model access to a very powerful computer chip that it could use as part of its work, basically, and you gave it a lot of time to think. You gave it a lot of compute resources, more than people were typically giving when they were testing these capabilities.

Beth Barnes: Yeah. So this doubling time of time horizon thing that I’ve been talking about is not yet published. I think you were probably referring to the RE-Bench? That was like a subset of these tasks that are just focused on doing ML engineering research.

So things that we do that I think make our benchmark better quality or get better model performance is, firstly, if you actually compare to humans, often things just are harder or take longer than people think. Especially in the context of there’s this task and you’ve tried to make the instructions unambiguous, but actually it’s kind of confusing in some way. Or actually you didn’t have the package installed you needed for this. So having humans actually do them removes a bunch of these sort of spurious failures.

Whereas other benchmarks, even SWE-bench Verified, which was the subset OpenAI did, where they just had contractors look at the task and be like, How long do I think this would take? And do I think this is possible?

But I think then when Anthropic subsequently published a thing when they ran Claude on those and looked at the failures, some of the failures were still because the task was kind of impossible — because the test cases were testing for some more specific thing that it wasn’t clear that that was what you were supposed to do.

So one thing is just getting rid of all the spurious failures, so it’s a more high quality measure of the model’s ability.

Then the other thing is doing a fair amount of elicitation effort. I think often people have this sort of sense of how much is reasonable to spend on models, which is generally very cheap. But if you’re comparing to, could you automate eight hours of highly skilled researcher engineer labour, that’s like thousands of dollars.

It would be sad if we were like, “Models aren’t capable of this. Models aren’t capable of this. Models aren’t capable of this. Oh, it turns out they’re capable and they’re 100x cheaper than humans.” We would rather know when are they capable of doing it, but it’s more expensive, and you can see the cost going down over time. So we tried hard to give the model fair resources and benchmark it fairly against humans and have reasonable spend compared to what it would cost for a human.

Rob Wiblin: And the upshot was they seemed much more capable than what people had previously thought?

Beth Barnes: I don’t know if that’s exactly the case. I think it was surprising how well they did on these ML research tasks. I’m not sure that people had measured it that precisely before. It was just like people hadn’t checked this.

Another thing is the case that model performance still has this property that it’s very high variance in some sense: the best 1% of attempts are way, way better than the median attempt. So I think there can be a thing where you try and get a model to do something yourself. You’re like, I tried twice and it didn’t do a very good job either time.

But it might be the case that the model is cheap enough that you could have just got it to try 100 times. And if there’s some relatively cheap way you can select the most promising 10 and then look at a few of them, that then might be way better than the sort of representative impression you got of the model capabilities.

So we found that when you do this best of k, and you let the model do a bunch of attempts, and then you pick the one that scored highest, models do much better. And this has this general result across ML capability evals that you get kind of log-log scaling: it looks like a straight line if you double the amount of money you’re spending.

Rob Wiblin: OK, yeah. I guess a hot tip for listeners is: I think I’ve only recently started getting much more value out of ML models that I’m using just in my work.

And I think there’s two big breakthroughs. One is to stick a tonne of good examples into context. Don’t just give it one example of an article that you wrote that you liked and say, “Make something else like this.” Put in every single article that you’ve ever written that you think is good and say, “Copy this style.” The more data into context, the better.

The other thing is I used to say, “Can you please go away and write one possible example of this, and then check in with me to see whether you think it’s good.” But of course, it’s so cheap for them to output 10 or 100 possible articles that you could write or blog posts that you could include. So you just get it to output a tonne — way more than you would ever reasonably ask a human to do — and then just scan over them and be like, “Most of these are trash, but this one is really great.”

Beth Barnes: Right. And I think there’s a question of, on what range of tasks can you actually do this? Some tasks you can just automate this in code because there is some simple scoring function that you can compute, and other things the model would probably be able to recognise a good solution. So you need to have some scaffolding setup that will then look over all of the —

Rob Wiblin: Look over its own attempts and say this is the best one of them.

Beth Barnes: Right. Or you focus your scaffolding on having the model build the tests that will measure how well it did and then running attempts against that. And maybe you don’t always need to get it down to the one. If you can just prune out a bunch of the worst attempts and then have a human look over them. Maybe this is feasible in some situations.

METR’s research finds AIs are solid at AI research already [00:49:33]

Rob Wiblin: Let’s back up a bit. I guess there’s been two main streams of research. One is trying to figure out how autonomously can these models work and would they maybe be able to survive and spread in the wild.

And then there’s this other line of research, which is trying to figure out how much are these models able to help with research into machine learning. I guess to many people it will be obvious why it’s very important to know whether AI is good at improving AI, but do you want to just quickly explain why that’s a key thing to measure?

Beth Barnes: Yeah. It just seems like it’s both approximately sufficient and approximately necessary for really dangerous things to happen. If you have AI that can automate a bunch of research and speed up the progress a lot, then — especially under the current paradigm, where there’s no monitoring of what labs are doing internally — you could just very quickly get very high levels of capability that you are not equipped to deal with.

So that’s approximately the sufficiency argument: this could mean that stuff goes very fast in a way that humans aren’t able to keep up with. And there’s a very strong incentive to automate a bunch of your AI R&D, and then humans don’t have oversight of what data is being produced, or just generally what’s going on.

And the necessity case is something like, it seems like the first very dangerous models probably won’t have all of the capabilities that they need to take over at the start, but there’s some situation in which, if they can get some compute and get some resources and are good at AI R&D, then they can improve their capabilities on a bunch of the relevant axes.

Rob Wiblin: So just in broad strokes, this matters because it’s the thing that would set off a software-based intelligence explosion, because it’s a recursive self-improvement loop where it gets better and so it’s more able to improve itself.

That may or may not actually explode. It’s possible that the task of improving its own capabilities gets more difficult so quickly that the process kind of peters out, or is reasonably slow and requires a lot of compute. But maybe that won’t be the case, and maybe you will see significant improvement in capabilities very rapidly once AI is doing most of the work of improving it.

But you’re also saying that at the first point that an AI model might be somewhat dangerous, it might have some gaps in its capabilities, it might have eight of the 10 things that it would need in order to go and do stuff that humans would really dislike. But if one of the 10 things that it’s able to do is improve itself enormously, then it will be able to patch the gaps in its abilities and then go from there.

Beth Barnes: Right. And then your dangerous capability evals are no longer really applicable if it can also —

Rob Wiblin: Just fix any of those weaknesses.

Beth Barnes: Yeah.

Rob Wiblin: OK, so that’s why the ability of these models to do ML research is a pretty big deal. How good are they in general at doing that work now, and what’s the trend going forward?

Beth Barnes: Models can definitely do meaningful work here in problems that have this character that you can isolate this particular task and then automatically check it. I think there is some question of how well do the models generalise to the aspects of research and things where you can’t do that, where it’s more entangled and you need higher reliability and that kind of thing.

Some evidence that maybe it’s not that hard to improve a model’s ability to do these very long and complex tasks is that they’re able to do these extremely long chains of reasoning for hard math tasks or something. And maybe math is very easy to train on, but maybe it doesn’t take that much training to get a lot better at this.

And then also maybe you can just accelerate AI R&D a lot just on these kinds of tasks that are quite easily checkable in various ways — because a bunch of software engineering is like this, where you can at least get a pretty strong filter on the solutions.

One thing we put out recently is some results on an extended version of KernelBench. So this is writing basically better code to make ML engineering tasks run faster. Again, this is one of the “elicitation is important” lessons.

So the results on the leaderboard were like, it’s a speedup of 1.01 or 1.05 or something — which is like models basically can’t do anything useful here. And then Tao at METR worked on this for four weeks, and now it’s like, models can double the speedup. So that is useful.

Rob Wiblin: So I don’t understand what programming kernels is, but this is something that would enable machine learning research to go faster. Previously people thought that the models were basically rubbish at this, that they could only speed up from a baseline 1% or 5%. But now I guess you gave them access to proper compute, you gave them time, you gave them the chance to do many attempts and then pick the best one — and now you find that actually it can double it. And this is approximating human performance?

Beth Barnes: I think on these tasks that probably means it’s quite a lot cheaper than what you would need to pay humans. Maybe like 10x or something. I think it would cost hundreds of dollars to pay someone to do this work.

Rob Wiblin: But the model can do it for like tens of dollars.

Beth Barnes: Like $30, or something like that. And we could probably get that down with more optimisation. So definitely getting to the point where models can meaningfully accelerate research. And already with things like Copilot and coding assistants, a large fraction of the code is written by the models — and maybe they’re not the most crucial hard parts or something, but it definitely is a meaningful acceleration.

And I think just in general there’s a lot of evidence that this will just keep going up. It’s hard to say exactly where we are now or how far we could get with elicitation on current models or with one more scaleup or something. The thing that’s much more confident is just it’s not going to be that long if you’ve got this six months doubling time or something. Really the scope of what the models can do is going up pretty fast.

AI may turn out to be strong at novel and creative research [00:55:53]

Rob Wiblin: I think you said to me yesterday that you thought it’s possible that the models would actually turn out to best in the research part of this, or the ideas-focused part of ML research — which to some people would seem counterintuitive that you might expect them to be best at the stuff that is not like one-off insights into things, but rather just like writing code in a way that they’ve written code many times before. Do you want to explain why you think that might be the case?

Beth Barnes: This is a sort of hot take, not a super strongly endorsed view, but I do think people have this reaction that the models can do this engineering stuff that’s more straightforward, but they can’t do this scientific insight or methodology or coming up with new inventions.

But I think a lot of what humans are doing when they do that is having read a bunch of research papers and seeing that some method is analogous to some other thing, sort of like “nothing new under the sun” — a lot of things are just applications of things that are in fact known somewhere else. And the thing the models are really good at is really comprehensively having read every single paper and knowing the literature and having seen a huge number of results of experiments.

One of the things that we saw in one of our tasks… So the task is basically to figure out what the scaling law is under certain restrictions. You have models of a certain size and you can train them with this much compute, and you need to figure out what the compute-optimal hyperparameters are to train at a level above the amount of compute that you’re actually given. And the models didn’t always do that well at this, but they’re really good at guessing. They’re much better than humans at guessing what roughly the right number is.

So there’s going to be a bunch of things where models have just seen so many more experiments and so much more data that for things like, “Predict whether this research direction is promising,” or, “Figure out how to solve this sort of thorny theoretical problem,” it’ll be like, “Well, in this obscure field of math and this random paper that no one’s read, it turns out you can solve this analogous problem in this way.”

That seems very plausible to me that that is something where models might be comparatively advantaged to humans rather than disadvantaged.

Rob Wiblin: Right. It’s something that humans struggle to do, and it’s one reason why we think this is kind of magical, so how would the AI be able to do it? But actually it’s an area where it might be their home turf because they have such a broad range of knowledge.

Beth Barnes: Right, yeah.

Rob Wiblin: The way that people are sometimes sceptical that it’s going to be able to have the brilliant scientific flash of insight, it feels like the equivalent of how five years ago, you’d hear, “But they’ll never be able to make great art. They’ll be able to make great poetry.”

Beth Barnes: Or play go, or…

Rob Wiblin: Yeah, it turns out that stuff is actually trivial for them, and that they write poetry much better than almost all humans. I guess there have been experiments finding that they write much better poetry even than expert poetry writers, as assessed by other human beings. So yeah, I guess it’s a technical person’s cope to make themselves feel like they’re not so replaceable.

Beth Barnes: I mean, the counterargument is maybe just that this stuff has less of this easily checkable, more structured for engineering, for writing code… You can write tests and you can select for only the programs that actually pass the test and this kind of thing, where it’s harder to do that for research ideas. But I do expect the underlying ability in that to be maybe relatively good.

When can we expect an algorithmic ‘intelligence explosion’? [00:59:11]

Rob Wiblin: So I’ve been assuming that there would be a software intelligence explosion triggered. I guess recently, I’ve been assuming it’s going to happen within the next seven years, and two years wouldn’t at all be surprising. Does your research cast any light on what numbers seem plausible here?

Beth Barnes: Yeah, basically those numbers seem very reasonable. I think one number we talk about is, at what point is 90% of the research oomph going to be coming from models as opposed to humans? I guess that’s one way to be like, when is this starting? Yeah, it seems like in seven years we’ll be at models that can do the equivalent of a year of labour from someone who is expert at all of the things. And also much faster in serial time.

Rob Wiblin: Yeah. And I guess at that point, I suppose if that trend holds and that is the case, it’s really hard to see how you wouldn’t have triggered a software intelligence explosion, inasmuch as one as possible, depending on how strong the feedback loop is.

Beth Barnes: Right. It has multiplied the number of people working on this by a very large amount. And you’ve introduced a bunch of workers that have capabilities much beyond humans, at least in terms of cross-domain knowledge and serial speed and things.

Rob Wiblin: Yeah. When I said on Twitter that I was going to be interviewing you, an audience member wrote in with a question on this topic that I thought was pretty insightful: “How much of a bottleneck is labor on algorithmic progress? How likely is it that we’ll automate AI R&D labor with underwhelming results, and realize the main bottleneck was compute?”

Beth Barnes: I think definitely compute is a key bottleneck, and as you automate your researchers, compute is more of a bottleneck — because it’s also governing how many of the automated researchers you can run.

But I think there are just obviously lots of areas of low-hanging fruit that it feels relatively clear to me how you would put in more labour and get more performance from the same compute. I do think there is a question of why isn’t this happening already? And maybe it’s more that it’s kind of hard to coordinate, as opposed to there are no humans who are cheap enough for it to make sense.

Rob Wiblin: Coordinate what?

Beth Barnes: The types of things I’m imagining you can spend labour on: there’s definitely the writing faster kernels or otherwise speeding up experiments. This is basically equivalent to increasing the amount of compute — you’re just making your code run in half the time, making your code run much more efficiently. I think there is a bunch more optimisation that could be done there, and very good kernel engineers are rare and expensive, so this is something you could pour a bunch of AI labour into.

Other things are building more tasks and environments, and setting up and obtaining different kinds of training data and things, if you having much higher quality data does mean your compute goes a lot further, and this is maybe what things are bottlenecked on.

Environments that are sort of rich and interesting but also have rapid feedback loops, it does seem just like there’s just a bunch of labour you could do to build more of these and build more interesting things here. And yeah, there’s maybe some continuum with something that looks more like going out and constructing a task, versus models sort of challenging themselves to do a thing and then getting feedback from the environment on it. But it does seem like there’s a bunch of ways you could spend compute there.

Other things that are kind of equivalent to increasing your amount of compute could be more closely babysitting experiments and runs, being like, “This one I can already tell that it’s not going to be that useful, so I’m going to put something else in the queue, and let’s cull these ones that are doing less well.”

I think if you just directly ask the question, “Is there somewhere you could put labour in that would speed things up?,” it’s like, yes, there are a lot of places if it was very cheap. But the bottleneck would probably just be setting up all the infrastructure to do that and vetting the quality of all of the data and environments you’re producing or something.

Rob Wiblin: So the bottom line is that compute might end up being the key limiting factor, in which case we will use our AI models to figure out how to get as much juice out of the compute available as we possibly can. And probably there are a whole lot of margins on which, if compute were the only factor now — I guess currently it seems like labour is probably, or I guess both are really important — but if compute was the only thing, then we would just use the enormous amount of labour that we now had to figure out how to use the compute more efficiently.

I guess even if the intelligence explosion was underwhelming because this turned out to be a substantial issue, the number of AI researchers is growing, but it’s growing very slowly compared to the amount of compute that is being manufactured. I think the amount of compute available to these companies is growing by many multiples every year.

So if compute ended up being the limiting factor — if we went from labour mattering to basically only compute mattering — then that would still suggest a substantial speedup in research progress.

Beth Barnes: Right. And the marginal investment in researchers now relative to compute is related to the price and the speed of the researcher labour. If they were a bunch cheaper and a bunch faster, it would make sense to buy a lot more. There might be a lot of things that’s just not worth having humans do at the current prices and serial times that would be worth doing with faster and cheaper labour.

Recursively self-improving AI might even be here in two years — which is alarming [01:05:02]

Rob Wiblin: OK, just to make sure that we underline the most important messages here: AIs currently have a 50% chance of doing something on a computer that an expert might do in two hours. This is doubling every three to 12 months. So they’re becoming substantially more useful agents: they’re able to do things over quite a longer period of time and it’s improving rapidly.

They’re able to make a substantial difference to machine learning research now. They’re already used a great deal within the companies. They are probably capable of doing a whole lot more stuff than is even appreciated at this point — because if you really try hard, then you can get much more out of them than if you just make a cursory effort.

All of this leads to the conclusion that probably we should expect recursively self-improving AI pretty shockingly soon, relative to expectations that we might have had five or 10 years ago — that two years wouldn’t be at all surprising, which is alarming to me.

Beth Barnes: Yes, I think this is very alarming. I think people should be very alarmed. The scientist in me wants to add a bunch of caveats to the two-hour number and the doubling time, whatever. But I think yes, it really doesn’t seem like two years would be surprising. It seems hard to rule out even shorter [timelines]. Is there 1% chance of this happening in six, nine months? Yeah, that seems pretty plausible.

With the exact shape of the recursive self-improvement, it still seems like we can kind of see a path to significantly superhuman intelligence that doesn’t require a mysterious intelligence explosion. It’s relatively clear how you would apply the intermediate capabilities that we expect to get to get something that’s just obviously very pretty scary.

Maybe models will continue to be really non-robust in a way that’s hard to fix, but I don’t think we care about whether we’re very confident this is going to happen — we care about whether we’re very confident it’s not going to happen.

It just feels like we’re at the point where even if you only care about current humans, the stakes are very high. Like a 1% chance of takeover risk, which itself has a 10% chance of human extinction or something, that’s a lot of deaths.

So I think we should really care about numbers that are a whole number of percent. I mean, I think we should care about numbers that are lower than that, but basically I would expect the world in general to agree that it’s not worth imposing a whole-number-percent risk of completely destroying human civilisation for getting the benefits of AI sooner.

Rob Wiblin: A little bit sooner, yeah. I think people in Silicon Valley, or certainly people in the AI industry, these sorts of companies, they’ve seen these results and they’re flipping out basically. Or their expectations are what we’re seeing here: they just see a very clear path to superhuman reasoning, to superhuman ability to improve AI, to therefore a software-based intelligence explosion. So this is kind of the main thing that people talk about in the city that we’re in right now.

It feels like policymakers are a little bit aware of this — certainly AI is in the policy conversation a bit — but by comparison, I think they haven’t yet flipped out to the extent that they probably would if they fully appreciated what was coming down the pipeline.

Beth Barnes: Right.

Rob Wiblin: One of the hopes of this show is to disseminate this information so that people can flip out suitably. Is there anything more you want to say to someone who’s in DC, who’s like, “I don’t know that I really buy any of this. AI’s not my area, or I feel like these tech people are kind of losing their heads a little bit”?

Beth Barnes: It’s kind of hard for me to have that perspective because I’ve just been focused on AI since 2017 or something, and I was previously thinking that it was further off, or that I might be totally wrong about this whole thing. But it feels like we’ve just increasingly gotten evidence that these concerns are sensible. We’re starting to see models exhibit alignment faking, all these things we’re concerned about, and it seems like the capabilities are coming very fast.

So this seems like pretty obviously the most important thing, even from a very personal, selfish perspective: for someone who’s young and healthy, this just might be your highest mortality risk in the next few years is AI catastrophe.

I think the other sense that people have that I really want to dispel is, “But the experts must be on top of this. The experts would be telling us if it really was time to freak out.” I’m like, the experts are not on top of this. There are not nearly enough people. METR is super starved for talent. We just can’t do a bunch of basic things that are obviously good to do.

And inasmuch as there are experts, they are saying that this is a concerning risk. There was the report that Bengio and others wrote, that was sort of overall, what is the overall scientific consensus on this risk? And it was like, “Yes, this seems very plausible.”

Rob Wiblin: “We’re not sure whether we will or won’t all die, but it’s totally plausible.”

Beth Barnes: Yeah. And to the extent that I am an expert, I am an expert telling you you should freak out. And there’s not especially anyone else who isn’t saying this.

Rob Wiblin: Yeah, the experts would tell us to freak out. I’m like, well, the experts in this really are just the people at the companies, and I guess external observers like you who are very close to the action. There isn’t really a group of independent people who fully understand all of this and are paying a lot of attention to it in DC, in some agency that isn’t heavily entwined here, who could evaluate this in some more objective measure.

So basically, all of the experts actually are saying to freak out; it’s just that people are maybe sceptical because they’re in companies that are heavily involved and they think maybe they don’t have sufficient distance or perspective.

Beth Barnes: Right. I think that’s one of the things where hopefully METR can provide a bunch of value: actually having the deep technical expertise, but also being independent and therefore more credible — that we don’t have commercial incentives or whatever to exaggerate this. And I do think the sort of dearth of experts outside the AI labs is alarming.

Could evaluations backfire by increasing AI hype and racing? [01:11:36]

Rob Wiblin: Let’s maybe push on a bit and tackle a different issue. When I said on Twitter I was going to be interviewing you, I was actually a bit surprised at the main question that came up. It felt very 2022 to me, but I guess maybe it makes more sense given what we were just saying.

People were worried that having better evaluations of what the models are fully capable of doing — things like realising that if you put in a bunch of effort and you put in place the right structure for them to act as agents, then they’re able to do a whole lot of AI research autonomously with very little human intervention — they were worried that basically pointing that out was dangerous; it was an information hazard for people to even be aware of this, and maybe you shouldn’t be finding that out, or shouldn’t be publishing it very widely.

What do you make of that concern, that possibly your work could backfire?

Beth Barnes: This is something we’ve discussed a lot internally at METR, and done reasonable amounts of hand wringing over and talked to different people about.

I think my views on this have changed a fair amount over time. The key thing is like, do you think of safety progress as static or as a function of the current awareness of AI? And I think implicitly, before, I was thinking we’re making safety progress at some rate, and then independently there is capabilities progress, and it’s like a race between these two things.

I think there’s a different framing, which is just as reasonable, which is that there is some hype and awareness in general of the current level of capabilities, and both the rate of investment in capabilities progress and the amount of investment in safety progress are a function of that. So some speeding up underlying capabilities is not a problem if you speed up the safety more as a result of that — the relative progress that safety will have made by the time you get to a level of capabilities is now higher, even if that time is nearer.

There’s a bunch of different ways this plays out. One is talent. There’s a bunch of people who only started working on safety after seeing ChatGPT or GPT-4, some level of capabilities, like, “It’s actually time. This stuff is real, or actually I’m scared of this,” or whatever.

There’s policy actions, which are again only going to happen once it’s sufficiently obvious that this thing is happening and these capabilities exist.

There’s the quality of safety research. It seems like you can make much faster progress if you have better models to work with. Not that we’ve exhausted all of the possibilities for making progress on previous models, but there is a strong effect of your work is more useful when you have better model organisms of the things that you want to study and you’re sort of closer to crunch time.

And also you can use models to automate your work more. We might imagine that almost all of the technical safety work happens in the period right before an AI disaster would happen, because that’s when you’ve got the models automating a bunch of the alignment work and things.

So the thing that we should be prioritising much more is the fraction of investment at that point into safety, as opposed to when that point comes because the background rate of progress and safety is small now compared to what it will be.

There’s another factor here that I think is important, which is compute progress — independent of hyper investment in AI, there is just Moore’s law and economic growth and things happening in the background, which are making larger amounts of compute cheaper and more available.

And in the extreme case of this, we can imagine if we know that aliens are going to dump this incredible infinite supercomputer thing on us at some point: should we do some intermediate AI research in the meantime that will allow us to learn some stuff about safety and understand a bit what is going to be possible? Yes, that seems obviously good — as opposed to that only happening once you already have this incredibly large amount of compute where you can scale everything very quickly.

In fact, there is a tradeoff where you maybe are bringing the point at which stuff happens forward, because there’s more investment in hardware and things, but it’s still the case that in some sense you’re racing Moore’s law. You’d rather have all of the things happen when the chips are bad and when compute is this limiting factor that’s slowly going up, as opposed to like, it turns out anyone in their basement could do this or something like that. And again, that’s an extreme example, but I do think in the intermediate points this is still true.

So that’s the case for why I think it’s not necessarily that bad to speed things up, if there’s some argument for why this differentially advantages safety.

And I think there’s just a pretty clear argument here for why a better understanding of what models are capable of advantages safety: by default, this knowledge resides only in the labs — which are specialised in capabilities progress differentially, relative to regulation and broad understanding in policy and safety and things.

So it seems like moving that understanding from being just in the labs — where they will happily show it to their investors and people they’re trying to hire — moving that understanding from them to the broader public who don’t love all this racing ahead, or policymakers…

You could think about the timelines for this, like you imagine you need to get a certain level of awareness in time to get a policy passed before you miss a window that would have allowed you to mitigate this thing in the future. So even if you bring it forward a bit, if you bring the policy action forward by more…

Rob Wiblin: Things have gotten safer overall. So the bottom line is, I guess you can have different models of what is racing against what. And the simplest model would be that the safety stuff is improving on calendar time. Every year it gets better, so every year that you bring forward some advancement in capabilities, that’s just making things worse, because it’s those two things that are racing against one another.

But I guess you’re saying you think about it in terms of the ratio of progress inside the companies to the amount of freaking out and action by other people outside of the companies which is able to make things safer, or at least get some sort of governance and policy response. And it’s only by publishing these results, so that everyone in the broader world understands what’s going on, that you actually get enough attention and enough concern that the problems can be solved ever at all.

So the worst-case scenario is that the companies are fully aware that they can automate most of their work. Presumably, this stuff is much less of a shock to them than it is to the broader world, because they have every incentive to try to figure out how to get the absolute most out of their models to save their own money and their own labour costs.

Beth Barnes: Yeah. I think you should think either we’re not making that much difference relative to progress inside the companies, or METR is just incredibly awesome and overpowered and you should join — because clearly with our 10 researchers or whatever we’re beating these multibillion-dollar companies or whatever.

Rob Wiblin: So should we conclude that the companies probably were already aware of the results in your research into how good the AI is at recursively self-improving?

Beth Barnes: Yeah, basically that’s generally the impression we’ve got. I think companies like to sometimes give the impression that everyone outside is totally clueless; our stuff is far superior. And we definitely have found that often that’s not quite the case — our elicitation is as good or a bit better, but basically it’s not dramatic in either direction.

And especially for stuff like kernels, where it makes much more sense for them to invest in that specific thing, maybe we have similar understanding — where rather than just doing a four-week project on it and then being like, “OK cool, we’ll just let people know about this,” they would actually double down on that and get a bunch better, and get profitable stuff out of that.

Rob Wiblin: There presumably is some downside to this, where maybe the big three are aware of how much they can automate things, or they’re roughly aware of your results already — but there might be other actors, other people who are working on AI or nearby where making this more salient is more likely to have them make an effort to automate their own research.

Beth Barnes: Yeah, I think that’s true. I think also another concern people have had is basically related to what we were saying earlier about people irrationally underinvesting in things that are slightly annoying or don’t look like the most glamorous pure ML work of gathering data and making good evals: that us publishing evals is substantially accelerating companies because they can hill climb on those.

So none of the things that we’re releasing have enough tasks that it would make sense to directly train on them. It would just be a sort of hill climbing to see how well your fine-tuning was working. And again, it feels like if this was the case, it would mean that we should be able to sell METR’s eval producing services for very large amounts of money if we are in fact substantially accelerating these multibillion-dollar projects. But I guess the claim here is that people are irrationally not spending enough on evals.

And in general I want us to try and be sympathetic to these concerns and do a kind of trade: even if we don’t buy particular concerns, to try and make concessions to them when it’s cheap. But also I think there is a trap of getting bogged down in worrying about how maybe this will do harm — where all the safety people kind of tie themselves in knots worrying about that, and then nothing actually happens.

Rob Wiblin: They can’t even communicate between themselves, or they’re not explaining their views to the public properly, or the evidence for their concerns.

Beth Barnes: And there’s definitely something that just feels a bit overly prideful or something about assuming that your tiny team is contributing a large fraction… You’re such a small fraction of the total labour and investment going on in this. It does seem like you have to think that you’re really exceptionally talented to be thinking that you would be pushing things ahead substantially.

Rob Wiblin: Does thinking about things this way suggest that you should be trying to communicate all of your work as strongly as possible to government people — or to people who are outside the tech industry — to really explain to them that they haven’t appreciated just how dangerous things are or just how quickly things are moving?

And maybe the less you can publicise your results… I guess it’s not really imaginable that these findings wouldn’t spread throughout San Francisco. But basically you should be trying to do as much to disproportionately reach people who currently are not being alerted to what’s going on.

Beth Barnes: I guess I don’t expect it to be that efficient use of labour to try and communicate to one and not the other. Especially as I would have thought that even the people who are not directly involved in the technical world will use what was the general technical reaction to this as a barometer of whether they should trust it.

And we just want evals research, and understanding of model capabilities, and understanding of whether mitigations are good to grow and proceed as a field. So we’re trying to open source things and be transparent about our learnings and facilitate other people getting funding and support. We generally want to grow this field, and I don’t think the payoff of trying to be secretive about it and then just talk to these particular people is worth both the effort of doing that and the cost of not accelerating other people who could be working on this.

Maybe one more point here is just that basically the types of capabilities improvements METR has been doing are very much around scaffolding and language model agents. As we were talking about before, it seems much better to have a dumber model that can do a lot of stuff because you’ve elicited it right and given it all the right tools — but then you have this understandable trace of what it was doing — than to just build bigger models which mysteriously start being able to do things, but you don’t really know why.

So I think differentially pushing forward the capabilities of LLM agents seems actually maybe good.

Rob Wiblin: My intuitive reaction to this concern was just that if an effort that you’re trying to make to keep us safe is just to not elicit what the models are already clearly capable of doing, that is an incredibly thin wall of safety — because at any point, if someone put in the effort that you guys have put in to figure out how you can get more out of the models, then they could immediately speed things up very rapidly.

It’s an incredibly perilous situation to be in where the models are capable of doing far more dangerous things than what people appreciate, or what even the operators appreciate. And then as soon as someone actually spends a week working on it, then they can crack it and do it. Really we need to be alert to what is possible as soon as it is possible.

Beth Barnes: Yeah. I think there’s some general thing in how people have different feelings about how competent the world is, which both affects what fraction of labour does METR represent — or would someone else do this, or maybe we’re the only ones who’d have an insight or something — and how useful is it to communicate to the world? Will policymakers react in any way that’s helpful?

I can definitely see where people are coming from, and I do think a bunch of these concerns are quite reasonable. I just overall think it nets out to: security by obscurity about what models can do is not a great safety strategy.

Governments first ignore new risks, but can overreact once they arrive [01:26:38]

Rob Wiblin: There’s been this interesting discussion back and forth about how pessimistic we should be about societal/government responses to crises and problems of all different kinds.

I think the first step in the conversation was more technical people saying, “Look at how useless the government is at responding to crises. They aren’t understanding these issues whatsoever. No one’s paying attention to it.” And then other people are pointing out maybe that’s a little bit pessimistic, that the reason that people weren’t reacting is that they felt that this was very far off; it wasn’t really concrete enough to do anything yet.

In fact, if you look at COVID, perhaps we over-responded. The government actually was extremely active once it was apparent. Stefan Schubert uses the term “sleepwalk bias,” where the question is, do we sleepwalk into or through crises? He was like, no, generally, actually once it’s evident that there’s a problem, then you can get perhaps even an overreaction on average among society and governments.

I’m like, if you look at the situation here with AI, it feels like we’re really underreacting. I guess more attention has gone into it. Some stuff has been done. I would say it’s barely like 10% of what needs to be done, at least from my point of view. And I think the synthesis of these views is that society doesn’t sleepwalk through a crisis, but it does sleepwalk into crises.

Beth Barnes: Right.

Rob Wiblin: Because basically there’s only so much focus that people have. People are very busy dealing with crises that are already unfolding right now. They don’t really have much time to plan ahead. And also they don’t have the technical chops to understand what things are likely to happen in future, so they tend to just dismiss forecasts that suggest anything interesting is going to change.

So my expectation is basically we will sleepwalk into a disaster with AI — or indeed almost all kinds of disasters — with massive underpreparation, underinvestment. And then at the point that it’s undeniable that things are going extremely wrong, then you can get a very intense response and a lot of focus on it. And there’s a question of, is that enough to actually meaningfully change anything whatsoever? I guess it remains to be seen.

Beth Barnes: Yeah. There’s a bit in Don’t Look Up where they’re trying to convince the president to be worried about this incoming planet-killer asteroid. And she’s like, “Every day I have people telling me the world’s going to end for this reason or that reason. On Monday it was terrorists. On Tuesday it was floods. On Wednesday it was the bees are dying.”

I think people can be kind of unsympathetic to that. That even if you are trying to pay attention to the right things as a policymaker, everyone has incentive to exaggerate the seriousness of their particular cause area. And if you don’t have the technical knowledge and there’s disagreement in the technical community, it’s actually just quite hard to figure out how to respond.

Rob Wiblin: Yeah, it’s understandable structurally why things play out this way, but it is unsatisfactory. It would be really great if we could come up with better systems for anticipating future problems.

Beth Barnes: Yeah. And I do think also people maybe have more of a picture of policy stuff being very bottlenecked on willingness — whereas I think it’s quite amenable to more concrete and sensible suggestions, and that sort of decreases the amount of willingness that you need by a lot.

Like SB 1047: there was this bill in California that would have required some evaluations and transparency about them and companies to make safety cases. A significant reason for why people didn’t like that or why it didn’t pass — at least that was quoted — was some combination of worrying that yes, this particular proposal is reasonable, but in general it’s going to spiral into overregulation and things will get added on and people will interpret it creatively.

And these evals don’t really exist, or people haven’t done this before. I think if you’re in a world where there was just very clear precedent of, “This is what you need to do, this is how you make your safety case, this is how you decide whether it is or isn’t good enough,” that would have made it much easier to get something like this passed, because there’s less of the slippery slope concern the more well defined the thing that you’re asking for is. And it’s just harder to object with, “This might be extremely costly” if you’re like, “No, look: someone has done this already, and this is how much it costs.”

So I do think people have this picture of things being very much bottlenecked on lab willingness or policy willingness to do things, and I think there’s just a bunch of stuff that third parties can do that makes this much more likely to happen. Even things that are just reminding people and nudging people and sort of project managing things from the outside — of being like, “You need to get this stakeholder. We’ll help you put together what you need to get this through this stakeholder.” There’s just a lot of value being left on the table in terms of making it easier for policymakers and for labs to adopt policies that we think are good.

Rob Wiblin: I think that feels a little bit generous on the SB 1047 case, but maybe I can put to you my impression on it, and you can tell me how wrong I am.

I think what you’re saying is that there’s concerns about slippery slopes, concerns about costs and compliance. That would have made sense maybe with the original draft of it, which perhaps was a little bit sloppy, and could be interpreted somewhat expansively; perhaps it would have been more difficult to comply with.

By the time they had narrowed the scope and drafted it more carefully in response to criticism, by the end it felt like it really was a bare minimum ask. It didn’t really feel at all over the top to me, but it still wasn’t able to get over the line despite, a lot of support among the general public.

I guess ultimately it was the decision of one person, the governor of California, to veto it. But the reason I would expect that it didn’t get over the line is just the same reason that it’s generally very hard to regulate very powerful, very rich, very profitable, very influential, very well connected, very well prepared companies: the companies don’t want to have their hands bound. They don’t want to have all of this scrutiny. They campaigned very strongly to try to stop it.

And it’s unsurprising — given that they had put so much effort into cultivating the various different political connections towards funding politicians — that this kind of thing is very hard to get over the line until it’s unambiguous that there is a problem that has to be addressed. Am I too cynical?

Beth Barnes: I don’t know. Other people maybe would know more what the real reasons were. My impression is that at least something that was still being quoted even towards the end — so still a useful argument against it — was that this compute number of FLOPS was not a good measure of is this thing actually dangerous, when this should kick in, and various things. So I don’t know, maybe this is a smaller on-the-margin claim that it makes it harder to argue against if you have good precedent and clear requirements for your thing.

I do think there can be other things as well, where just additional labour can make things go a lot better — where the lobbyists or government relations people for companies can be out of touch with their company’s actual desired policies or whatever. Or at least there’s strong evidence that you could take to them, like, “No, look: the company has been asking for this type of regulation. It’s proposed its own version of this. You should be supporting this.” And the people who are the lobbyists and stuff really totally are not tracking this. People are just doing their normal thing.

Rob Wiblin: I guess the thing that they’re always trying to do is avoid more scrutiny and more compliance cost and more regulation and so on. So that’s their default mode. And unless there’s a very strong push inside the company to change that in this specific case, then that is what they will just naturally want to do.

Beth Barnes: Or outside. I don’t think it necessarily has to be inside. I think you can also just communicate to people or make things more salient to people.

Rob Wiblin: Anthropic, with the final draft on SB 1047, was neutral to positive about it. I think Dario Amodei, the CEO, explicitly said that this would not be that difficult to comply with, and that the companies that are saying they could never comply with this are basically talking nonsense; it actually is quite a narrow ask. So credit to them for actually telling the truth.

Beth Barnes: And I think xAI supported it as well.

Rob Wiblin: Yeah, of course. Because Elon’s very concerned.

Do we need external auditors doing AI safety tests, not just the companies themselves? [01:35:10]

Rob Wiblin: I’ve interviewed Nick Joseph from Anthropic about evals. I’ve also spoken with Allan Dafoe from Google DeepMind about evals.

Both of them — perhaps unsurprisingly, given that they work there — said that they thought their evals were pretty good. That the Responsible Scaling Policy in Anthropic’s case or the Frontier Safety Framework in Google DeepMind’s case, that those are actually moving the needle on safety. That these are perhaps not the final form of these things, but they’re a very good step in the right direction.

To what extent do you agree, with a more independent, outside-of-the-companies point of view?

Beth Barnes: I agree that it’s a step in the right direction, moving the needle. Some things I tend to have disagreements with. I think often sort of individual safety-concerned people inside companies have a rosier picture of how the company interacts with third parties, like, “We love safety, so of course we would be supporting all of these things. And if it’s not happening, the bottleneck must be something that’s not us, because we really love safety.”

I think Nick Joseph said something like this, of the bottleneck on getting external oversight of the RSP is people who are technically competent to do the evaluations externally — which I very much disagree with, because I think METR has the technical competence. I think our evals and elicitation and things are better than stuff that the lab has published, than any lab has published, basically. At least in terms of quality; I guess our datasets tend to be smaller but higher quality. And we have been eager to work with labs.

Also, it doesn’t necessarily need to be bottlenecked on the technical competence of the external evaluator to literally run all the experiments and set up the compute and things.

There’s another paradigm, where the companies run the evaluations internally and they write up what they did, and the external evaluator goes through it with them and discusses it, like, “Did you provide good evidence for this thing? Maybe you need to do this other experiment here. This thing was insufficient for this reason.”

Or an embedded evaluator paradigm, where someone from an external third-party org is embedded on the team in the lab that’s running the evaluations, and is keeping an eye on everything, and making sure that they’re not sandbagging or fudging things.

There’s just lots of things here which we’ve proposed and been open to, and it’s not like the companies are all banging down our door to do it.

I mean, there have also been cases where we’ve turned companies down because we don’t have the capacity. But that is because we were like, this arrangement will be good and would be worth our time to do and it would take less of our time and capacity — but they weren’t offering that; they were offering something else that we didn’t think…

Rob Wiblin: Something much more laborious for you?

Beth Barnes: A combination of more labour intensive for us, or we just didn’t think it would provide that meaningful assurance or wouldn’t be improving the standard of assurance provided.

I think maybe people have a confusion about METR, and have been assuming that METR is a formal evaluator that has arrangements with all the companies to do evaluations for every model.

This is not the case. I wouldn’t want to describe any of the things that we’ve done thus far as actually providing meaningful oversight. There’s a bunch of constraints, including the stuff we were doing was under NDA, so we didn’t have formal authorisation to alert anyone or say if we thought things were concerning.

So yeah, think of it much more as we’ve been trying to prototype what this could look like. And because we’re a small nonprofit, it’s easier for us to be more nimble and try out different things, and lower stakes for the labs to engage with us. So we’ve been excited about raising the bar on how good the oversight actually is.

And labs more want a “Can you promise that you will run an eval on all of our models when we want to, so that we can say we’ve run an eval on them?” sort of thing. And that’s not what we’re excited about, if we don’t think that that proposed procedure is actually providing meaningfully more safety assurance than not.

Rob Wiblin: In general, people think that — at least within the scope of what you’re trying to accomplish — the evaluations that you’ve created are very good. I haven’t heard that many criticisms of them on the substance. And you’re saying the companies could come to you and say, “We would love for you to run all these things on our models before they go out, and for you to play with it as much as possible in order to figure out how dangerous they are.” But they are not requesting that. They’re not actively making that easy for you.

Beth Barnes: I mean, we have gotten asks like that from a bunch of labs. I think the problem is that our goals are like prototyping and de-risking what could provide really good safety assurance.

So that includes better governance mechanisms, in that this actually feeds into decisions or at least is shared with the people who are making the decisions. And we have some carveout for being able to say we made a recommendation and the lab chose to not go with that recommendation, or that employees get to see what was shared with us so that they can check that it was accurate. So there’s governance mechanisms that we could be improving on.

There’s the pre-scaleup or pre-internal deployment versus external deployment, where we think this specific time of external deployment is not that interesting; we’re much more interested in what are the best things you have internally and wanting to know that companies are not building arbitrarily powerful things internally.

There’s other stuff, just like quality of access and information shared about what agent scaffolding has this model been trained to use and can you share that with us? And a bunch of stuff that is very important to know to be able to elicit the full capabilities of models, things like can we see the chain of thought?

Rob Wiblin: I guess they would say to do things the way that you want is a lot of staff time, and they’re sprinting all the time in order to try to keep up. Maybe they tend to play the cards very close to their chest, and this would involve sharing commercially sensitive information with you, so they’re nervous about doing that or perhaps their lawyers at least are nervous about doing that.

Are there other things that they could say in their defence that have any reasonableness to them? Or to what extent would you think it’s fair versus maybe closer to being excuses?

Beth Barnes: Yeah, it definitely just is effort. It would be a thing, and maybe there’s some ways that it could backfire or something. The cost is more attention from more senior people at the company or something, I would guess, and thinking how much they want to really do.

The object-level confidentiality concerns, the things we’ve proposed is like sharing with us anything that’s not compartmentalised within the company. If your 1,000 employees already know this, three people at METR knowing an anonymised version of it is not really going to increase the surface area very much.

Or taking staff time on the inside: it’s a much smaller fraction of your staff time than it is of METR staff time.

Or spending 1% of the effort that’s spent on the model on capabilities research on evaluations, or a few percent of that effort, would be enough to do this.

I don’t know, there’s stuff about token budgets and things. And OpenAI had their superalignment compute commitment, which was supposed to be much larger numbers than anything that we were asking for.

Rob Wiblin: Twenty percent of their commitment up to that point.

Beth Barnes: Right, 20% of the compute. So it’s not like we’re hitting those demands.

I think there are a bunch of things which are just technically and logistically challenging for them. If we’re just like, can the API please not be broken and not keep going down while we’re trying to use it? It’s like, well, before it’s stabilised for public deployment, there’ll just be issues.

I think in an ideal world, it would be like you wait to deploy. If pre-deployment was really the important thing that you needed to gate, you wait to deploy until you’ve given your evaluator proper access, and that they’ve had enough time with a stable version of the thing that’s actually the thing that you’re going to deploy, not something else. You can’t directly solve the technical problem of just make it work, but you can solve how you’re prioritising to get a good version of the evals done versus get it out as quickly as possible.

Rob Wiblin: I guess a cynic would say they’re not really keen on having an external group be free to really heavily scrutinise the models, maybe scrutinise their plans for training models that they haven’t even trained yet, and be free to notify the public or notify policymakers about concerns that they have — because what company would want that?

This is a bunch of risk for them. It’s a bunch of stuff that they’re taking things somewhat out of their hands. These are all the reasons why you want independent auditors and so on, but what company ever voluntarily accedes to that?

A more sympathetic take might be companies are kind of always struggling to make things work. This is just another project that could be challenging for them to operationalise, even if they’re not that nervous about what I just described. There’s lots of good things that they could do that they’re not doing, and this is just one of them.

To what extent would you adopt the cynical view, or would that be kind of a fair baseline?

Beth Barnes: I can definitely see it from both sides. I think you could be a company really prioritising safety and this could not make sense as a place to invest your marginal resources. I do think that is not literally the tradeoff that is getting made. And I think individuals at companies who are like, “Wait, surely we’re giving you that access, right?” are sort of not tracking the actual thing that’s happening.

But on the flip side, they’re doing much more than you might expect companies just selfishly interested to be doing. I think with different companies it’s somewhat different reasons. Smaller ones, it’s somewhat more just like it’s chaotic and they just can’t do that many different things at once, and everything’s a bit last minute and the infra isn’t there. And with the bigger ones it’s like bureaucracy and lawyers.

Rob Wiblin: Decision making is slow in general, and this is a bit of an unusual arrangement. It’s not something that they would do with their other products, for sure.

Beth Barnes: Yeah. And again I think you could paint this as the willingness of the labs. And I do want to disabuse individuals at these labs who are like, “Of course we’re doing all these great things, right?”

Rob Wiblin: You should actually go and check, because maybe not.

Beth Barnes: Yeah. But on the other hand, I do think it’s where we just really haven’t pushed that hard yet, because METR has been building up our technical capacities such that we can definitely make use of the resources that we’re being given access to, and also just capacity to run the standard lab politicking playbook, and make the requests really clear, and accelerate them up the chain or get support from a bunch of different people and then make it happen.

Thus far we’ve more just been kind of vaguely showing up and asking nicely and trying to be clear about what it is we want. We haven’t gone into the mode of like, now we’re going to actually push on this. I think this has just been bottlenecked on talent to be able to do that, so I’m optimistic about us getting significantly better access in the relatively near future, because I think there’s stuff that you can do to push on it.

Rob Wiblin: I see. So you’ve kind of suggested these things somewhat informally, maybe a little bit last minute, because you’ve still been developing the evals, and you’ve still been trying to hire to make sure that you have the people so you can actually follow through on the things that you’re asking to be able to do.

So you’ve nowhere near exhausted the options to try to get access earlier, and more serious access. I suppose these relationships could also deepen over time as trust is built, and it’s still perhaps at an early stage. Well, I guess it’s possibly at a very late stage in the broader situation, but the relationship between you and the companies might be at an early stage.

Beth Barnes: Yeah, we don’t want to play hardball if we don’t have to. If some of the fault is on us, we don’t want to ask people to do really costly things and then be like, “Oh, actually…”

Rob Wiblin: “We can’t do it.” Yeah, that’s totally fair.

A case against safety-focused people working at frontier AI companies [01:48:44]

Rob Wiblin: A different, though I guess related, topic: over the years, I think many people have gone into working at the various AI companies with the vision that one way they’re going to be helpful is shifting the average beliefs of people who work at the companies.

Maybe they’ll get a chance to talk to their colleagues and convince them that concerns about how AGI might go are more legitimate than they might think. I guess they’re also just shifting the weight of opinion within the companies, and hopefully moving their culture in a positive, more cautious direction.

What do you think the track record of that is? Is that a reasonable approach for people to take going forward?

Beth Barnes: I basically think the track record of that particular theory of change looks pretty bad. I think if you just look at the history of people who tried to do that, there’s just a lot of people giving up and leaving. People have started at one lab trying to influence there, and then have kind of given up on that and left.

I think there are probably some people who’ve done this and it’s gone relatively well, but I think they’re very much in the minority.

I think reasons it makes sense to go to a lab is if you think your comparative advantage is institutional politicking, and that’s basically what you plan to do with most of your work. And if you’re very good at that, I think maybe you can do a better job.

Another reason is if you are going to be doing the implementation of safety things at that company. I think if you’re just sort of advancing alignment research generally, it’s better to be outside of a lab — because it’s easier to share it with everyone else and collaborate, and easier for other labs to onboard it if it’s not this thing from a competitor.

Rob Wiblin: Can you elaborate on that? What sort of work are you describing that would be done better outside?

Beth Barnes: I think most alignment and interpretability work would be better if it was outside labs.

Rob Wiblin: That’s a big claim. I guess most safety/alignment technical people work inside the labs now. What arguments have I heard? Of course they would say maybe it’s easier to get implemented on the company’s models if you’re inside; you have access to more compute; you have access to the cutting-edge models, and you need to have access to the very frontier in order for your work to be relevant. What would you say to stuff like that?

Beth Barnes: I think the fact that people’s choices here don’t seem to be that responsive to the quality of the open source models available is maybe evidence that it’s not actually driving their decisions, and it’s instead other factors.

I just think that for a lot of research, it’s really not clear that you’re actually better off inside a lab in terms of infrastructure and model access — because lab infrastructure has a bunch of constraints due to security, among other things — and it’s just actually easier to work with open source models, and you can get work done faster often. And especially for open source, smaller models are much more optimised.

So if you want to do RL, or you want to do some experiments with cheaper models, the best ones might actually be outside the companies. I feel like it’s a big shame that Anthropic’s interpretability research was not on open source models that could then be shared, and people could kind of poke around with it.

And in terms of just actually paying for clusters or something, I think there’s enough funding available. I don’t think that’s going to be a huge bottleneck unless you’re doing massive pre-training experiments. I think that’s not a crux in most of the cases.

I also think — and maybe this is less true now if we are just close to the end times — but in the past I would have said that if you think that your research is much more productive if you have access to the very latest models, then that also means that the research you’re currently doing is just really unimportant compared to the research you’ll be doing in a while when the models are better.

Rob Wiblin: Because it will all be superseded?

Beth Barnes: Right. Well, if it is the case that your research is much better with today’s models than last year’s models, either there is something special going on about this particular point in time, or your research is also much less productive now than it will be when you have the models a year in the future, and much less productive again than the models a year after that.

So if you believe that it’s very important to have access to the most cutting-edge models, then that also implies that your research now is a small fraction of the total value that your research will provide when you have access to these more advanced models. So that suggests you shouldn’t really be prioritising what allows me to be most productive on research now.

Rob Wiblin: You should be preparing yourself to do work in future.

Beth Barnes: Yeah. Or prioritising buying as much time as possible at the point where you have these models that really speed up your research, which maybe points to working on evals.

Rob Wiblin: I see. OK, just to reiterate some of that: people definitely make the argument, “I have to have access to the frontier model trained by the best company in order for this to be relevant at all.” You had one objection there, which is if all that matters is what’s done on the frontier model, then maybe what really matters is what’s done at the precursor model just before these models are truly dangerous, and everything now is going to be kind of by the by, by time we get to that stage.

But also I guess these days it seems like the best open source model is virtually as good as the best closed weight model, the ones that are inside the companies — or at least it’s nipping at the heels of those models, as far as we know. In that case, the argument that I need to have access to the very best model is much, much weaker than it would have been, because you could just download R1 and just run it on your own computers.

And I suppose you could imagine that inside the companies you have to bid for access to compute, you have to coordinate with a whole lot of other people in getting access.

Beth Barnes: Yeah, there’s a bunch of security setups, which means you can’t easily run open source models.

Rob Wiblin: Whereas you could just buy your own H100s, buy your own chips, store them in your own office and then download R1 — and then you can just start doing whatever you want with a tiny group that’s committed to that.

I guess that might help to explain why it is that groups like Redwood Research have managed to really kick ass. It seems like their research is remarkable. I guess they collaborate with the companies, but primarily they’re operating independently.

Beth Barnes: Yeah. And I think this has been the lesson of people who’ve done research inside and outside of companies: it’s not clear that the internal access allows you to make research progress faster.

Rob Wiblin: Another argument you might make is that it seems like most people have tried to go inside the companies, so perhaps there are neglected opportunities for people to not be inside the companies. The weighting is really 80/20 in terms of…

Beth Barnes: So, sorry, I had three reasons for being inside companies that I didn’t finish listing. So, one, if you actually really are into being a sort of political influencer person, and that is your skill set and you’re going to do that.

Then the other two I would list are you’re implementing things — not doing the general alignment research, but just making sure that this actually gets into the production system or otherwise being very close to the stuff that’s actually happening — and being like, “If the company isn’t cooperating with third parties, I can at least run the evals and then me and my friends will know whether we should freak out.” Especially in worlds where things are not going very well, having some people inside the companies who are trying to actually get the basics implemented is useful.

And then I think that the third one is sort of similar: basically being the sort of person who would whistleblow if there were egregious violations of things, and getting into a position where you have awareness of that information. I think a lot of the actual mechanisms for regulation or oversight to work do require that you trust the company to not just completely lie to you or give you totally misleading information. So I think it’s very important to have some number of people internally who would have the guts to whistleblow if the lab was feeding misinformation to regulators.

And I think that does in fact take a lot of guts, as evidenced by the stuff about the OpenAI secret nondisclosure agreement that no one —

Rob Wiblin: Hundreds of people put up with it. And it took like one person who was really stubborn to dig in their heels and say, “No, I’m going to make a fuss about this.” Kudos to Kokotajlo.

Beth Barnes: Yeah. And Will Saunders.

So I think if that’s your sort of theory of change: you’ve got to be pretty confident that you are that sort of person. And I think there is some correlation between the sort of person who’s like, “The lab is offering me the most money, and it’s convenient and high status and cosy and comes with all the benefits; I guess I’ll go there,” and not being the sort of person who’s actually going to whistleblow. And that there’s some kind of selection.

Rob Wiblin: Yeah. They might filter against someone who seems especially whistleblowing.

Beth Barnes: Yeah. I do think labs also just actively filter those people out.

Rob Wiblin: I’ve maybe been surprised at the degree to which they’re willing to hire all comers, or it doesn’t seem like there’s really intense filtering on personality type. It’s not like going and working at the CIA. It feels like in some ways there could be more scrutiny on people’s motives.

Beth Barnes: I do think there is a moderate amount of selection against having too strong of principles or something.

Rob Wiblin: I guess that’s true in companies in general, that they often don’t want to have firebrands or people with a strong independent agenda. But you’re saying that is also somewhat true here?

Beth Barnes: Yeah. I mean, this is reasonable in many ways. Often those people are in fact unreasonable. And there’s a sort of unilateralist curse of people who are going to be unilateralists will, in expectation, do unhelpful things.

So actually implementing research or being a whistleblower are things that I think makes sense to be inside a lab.

I think people go there for reasons that are less good. One is this “I’ll influence in a good direction” — which I just think has a bad track record of effectiveness versus influencing from the outside, which I think has a much better track record.

And one is this “I have to have access to the latest models” — where I don’t really buy that.

I think naively one would expect that relatively too many people go to labs than go to nonprofits or governments just because it’s a nicer job in other ways, and they have more experienced and aggressive recruiting departments and this kind of thing. They pay way more than government and have better benefits and stuff. It’s more of a high-status job in general, and you maybe get better technical mentorship than being some of the founding people at a smaller org.

But I also think people are sometimes wrong about how much growth you get from just “go and try to do the thing yourself” as opposed to being in a larger organisation with people who are better at it than you — and that you grow most rapidly from just doing the thing. If someone’s like, “I want to go to a bigger org to grow my leadership skills, because I can learn from other people,” I think you would grow your leadership skills more by just leading this thing that’s right here that needs you to lead it.

So I don’t know, I think there’s some amounts of motivated reasoning. In general, unless you have a particular reason to think that you would be a particularly good fit at a lab and not a good fit at a nonprofit or government, I think you should probably not go to a lab.

Rob Wiblin: Yeah. Just as an aside, it’s interesting that you call them “labs.” I switched to calling them companies at some point because someone pointed out or just asked the question: “These organisations, do you think their primary nature is as a research laboratory or as a company?” — and I was like, obviously it’s as a company. Maybe at some point in the past they were primarily research laboratories, or that was the right way to think of them, but it feels a little bit hard to see that now.

Beth Barnes: Yeah. I don’t know. This is not an assessment of… I agree they are companies. It’s just fewer syllables.

I do think it does make sense to have at least a few people who really care at each lab who can be implementing the basic mitigations, can be running the basic evals, can be keeping an eye out for egregious misconduct.

I just don’t think that everyone piling into the labs, if you’re just generally doing public good safety research, makes sense. Or for some generic “I’ll just be there and influence them in a good direction.” I feel like I see this a lot from people, and it’s like, “But you’re a really introverted, technical person. And in fact, in practice you just spend like eight hours a day coding. It clearly doesn’t make sense as a specialisation for you to be like, ‘I’ll be here and influence things.'”

The new, more dire situation has forced changes to METR’s strategy [02:02:29]

Rob Wiblin: You told me earlier this week that METR’s strategy has over time changed to adapt to what is a somewhat more pessimistic or challenging situation. Can you elaborate on what you mean by that?

Beth Barnes: I think this is more relatively recently feeling like timelines seem short and we’re not in some of the best worlds in terms of how much is getting done on the safety front.

I think originally when we were designing what would become RSPs, or responsible scaling policies and thinking about what evaluation regimes would make sense, we were imagining something like you want to keep the overall risk of catastrophe / end of civilisation from AI below 1%. And if there’s several labs, they each need to be a bit below 1%, and any particular deployment probably needs to be below 0.1% because you’re going to get multiple shots, et cetera.

This just seems like what is reasonable from a societal perspective. Personally, I care about future people also, so I think we should be aiming for something lower than that. But I think even just from the perspective of people alive today, the benefits of getting AI sooner aren’t good enough to justify much higher risks.

But I think we’re increasingly in the world where it doesn’t seem like we’re going to be keeping the risk that low. We can describe what you would need to do to be in that regime, and I don’t think that we’re doing it, that the world is doing it.

And if we’re pushing on things like, how would you make a really robust safety case that would have good external oversight, maybe this is in fact a distraction from can we just do the most basic mitigations? And maybe we don’t even have time to check if they’re working, but let’s at least do them. Getting the risk from 30% to 20% or something like that maybe is more what we should be focusing on.

Rob Wiblin: So the basic difference is you could spend a whole lot of time laying out what would be absolutely ideal practice that would get the risk down to 0.1% or lower in aggregate. But if we’re not even doing the absolutely most basic stuff already — the stuff that would bring it from 30% to 20% — then obviously you would focus there, because the impact is way larger. And I suppose those are also the marginal changes.

Beth Barnes: I think “ideal practice” even is too strong. I feel like the ideal thing looks way more intense, and this is just the sort of bare minimum if society as a whole was making reasonable decisions that I think would be roughly what everyone endorses. But going from that — which looks like being quite careful — to just trying to, on the margin, get the risk down a bit.

Rob Wiblin: Yeah. So you think the risk is like 30% on the current track that we’re on?

Beth Barnes: I usually say something like 50/50 in terms of do things broadly go well. Do we capture most of the value of the future? So not all of the 50% looks like obviously a disaster for currently alive people or something. It might be some lock-in or more gradually going off the rails or something like that. But I think a bunch of that is just like war and catastrophe and AI is taking over — and it’s very clearly “everyone agrees this is bad” territory.

Rob Wiblin: I guess there are a handful of people who like that, but it’s a distinct minority.

So starting from that picture where almost not even the basics are being done, what is the marginal thing? What would be your next ask, given the situation that we’re realistically likely to be in in a couple of years’ time?

Beth Barnes: There was a thing we were talking about earlier of chain of thought faithfulness, and actually looking at the chain of thought — so just doing basic checks for alignment of like, have we looked at a reasonable sample of the reasoning the model’s been doing, and can we understand this, and is it blatantly reasoning about how to trick us or do bad things or something? That’s one very basic thing I can imagine is just something we neglect to do.

Another intervention would be making sure that we keep the chain of thought interpretable. Which might be like companies might find some fancy new way to get more performance out of the model, but it now means that it’s just like neuralese and humans can’t read it at all. That would be very sad. So if on the margin there are some techniques that we can do to either ensure that we can interpret that, or to bring the performance of the interpretable thing up so it’s less of a hit relative to the uninterpretable thing, that would be pretty good.

Rob Wiblin: “Neuralese” is this term for gibberish reasoning by the model that I guess is useful for it to reach a conclusion, but humans can’t make any sense of it.

Beth Barnes: Yeah, or this could also be like it’s just passing numbers to itself. It’s not even outputting tokens that look like words or anything.

So yeah, it feels like we’re in a much better position than we might have hoped to be in, in terms of having models that do a lot of their cognition through legible reasoning. But people are working hard on making that not the case. That’s one thing that feels like it just gets a big increment on risk: how you handle that.

I think there’s just a lot of other very basic risk of misalignment incurring stuff. Some combination of what is your training data like, and how much are you incentivising the model to knowingly lie and reason about what the humans want, and reason about playing the training game, and how much benefit is it getting from doing that?

Rob Wiblin: I was just about to ask, is it actually imaginable that we could be so stupid as to just look at the chain of thought, find out when it’s scheming, and then just negatively reward that and try to discourage the visible scheming?

I guess people don’t realise the trouble with that is that it could just cause the model to become cleverer about hiding its scheming. It doesn’t necessarily actually reduce the scheming. In fact, probably it would cause it to become more surreptitious. So that’s actually a negative thing to do, rather than a positive one.

Beth Barnes: Right, yes. You know, back in the olden days of AI safety, people used to worry about having a sudden jump in intelligence: that models would go right through this regime where you have models that sort of scheme, but they’re not smart enough to have it not be detectable for us. But now I feel more confident that we’re already seeing models that can do this, so we’re going to have some period of time where the models are smart enough to do this, but not smart enough to hide it.

But the thing that I worry about is that we’re going to see signs of egregious misalignment and then we’re going to be like, “Maybe that was just noise. Let’s train it not to do that and slap some patches on it” — so that we totally will have had evidence, and it’ll just be ambiguous enough once you try some superficial fixes to make it stop doing it, that we won’t really do things differently based on that.

AI companies are being locally reasonable, but globally reckless [02:10:31]

Rob Wiblin: At a high level, overall, how responsible versus irresponsible would you say the main AI companies are being?

Beth Barnes: The tricky question in some sense is, what is your reference point? From the perspective of what I would think humanity in general would sort of endorse if they knew all the things, and what a sort of reasonable society that could actually coordinate would do, I think they’re being extremely irresponsible. And from my personal perspective, even more so, because I value future people maybe more than the average person does.

But on the other hand, if you look at the individual incentive structures around the companies, all these companies have a bunch of really good people who are working really hard trying to make things go well and within understandable constraints. I think there’s a bunch of arguments under which it does make sense to trade off safety for going faster — if you’re worried about misuse by authoritarian regimes or that sort of thing, or if we’re going to dedicate more effort to alignment.

I think locally people are pretty reasonable, or at least a lot of people are reasonable a lot of the time. I think people are also definitely unreasonable. But overall the situation is very bad.

Rob Wiblin: It sounds like you’re saying that’s primarily because of the lack of coordination? That if you just only had one organisation that was thinking about this from the perspective of just all people alive now, then probably they would go a lot more cautiously, and they would invest more in these kinds of evals to figure out what’s going wrong?

Beth Barnes: Yeah, I think it’s partly the coordination thing. And it’s also just that humans are bad at dealing with uncertainty and with probabilities that are smaller than about a half or something.

When I was at OpenAI, there were a lot of people who, if you asked them the right series of questions, would say that the probability that the thing they were working on would lead to killing all humans or something, you’d get like 10% or 20%. But it just wasn’t really connecting.

And I think if you actually want to keep the risk below 1%, or even lower, it looks like being very cautious in some sense, because our uncertainty around possible threat models and around the error bars on our capability assessments and all of these things are so large that being 99% confident that this is fine is a high bar.

I think model releases now are getting to that level, and it’s like, we’re very confident that it’s fine — but very confident, maybe that’s like 95% confident that it’s not going to do anything super bad. And maybe it’s like 99% or 98% confident that it’s not going to be catastrophic, but it’s pretty hard to get that low.

Rob Wiblin: So the issue there is, if you want to get to really low levels of risk, 99.9% or 99.99% confident, then you have to be really sure that there’s nothing outside of your understanding that’s going on. Because even from your perspective, given your dominant understanding of how the models are working and what they are and aren’t doing, things might seem 99% safe, but there’s always a possibility that you’re mismeasuring things, that your evals aren’t capturing what’s going on, maybe they are figuring out how to understate their capabilities.

And it just makes it exceedingly difficult to really bring risk that low when you have such an unclear picture of what’s actually happening at all.

Beth Barnes: Yeah. And again, I’m not really thinking about the 99.99%. I’m more thinking about can we get —

Rob Wiblin: From 10% to 1%?

Beth Barnes: Something like that. And again, it’s not like we have a pretty detailed understanding, but maybe we’re wrong — we just have gigantic error bars, and we really have no clue.

And the companies have described themselves and their safety process as like building the aeroplane while flying it or whatever. It’s not like, “We’re being so cautious, but we’re trying to reach this really high burden.” It’s just like, “Yeah, everything is a complete mess, and we’ve got to work pretty hard to keep the risk to what would be an at all reasonable level. And by default it’s going to be way higher.”

Overrated: Interpretability research [02:15:11]

Rob Wiblin: OK, pushing it from that. People are working on all kinds of different research agendas around safety, control, alignment. Are there any of those that you want to shout out as particularly underrated or particularly overrated? Maybe things that people are overinvesting in? I guess a worry would be everyone herds into some particular research agenda that’s fashionable, but it’s actually not really going to move the needle in the end, and that could end up being a huge diversion of people’s attention.

Beth Barnes: Right. I think there’s both a question about what is most important to work on and where or in what way.

So the things that I’m most excited about people working on are a very concrete implementation of at least getting the best practices we’re aware of thus far into production models, and trying to maintain that as an invariant: that we’re always getting the best alignment techniques there. And there’s generating new directions or creating model organisms to study or stuff that’s more like research.

One thing is I think it would be better if way more of the public good research was happening outside labs. I think Redwood Research, despite being two people, has really high output compared to the entire rest of the field, and all of the people at labs. And the stuff they’ve done is really good.

Control seems good, alignment seems good. Generally understanding scaling laws and forecasting and eval stuff seems good.

I think one thing that’s maybe, in my opinion, somewhat overrated is interpretability. Not that this is not useful — I think there’s a bunch of ways it’s promising and good — but it does have very high investment currently and I’m not sure it deserves as high investment as it has.

Underrated: Developing more narrow AIs [02:17:01]

Rob Wiblin: OK, so that’s one approach that you think we’re a bit overinvested in. Is there another one that you think is more promising than people have appreciated?

Beth Barnes: Yeah, there’s some general direction of just getting the capabilities that you want and not the ones that you don’t want. Currently I feel like how labs are going to react to evals is they’re just going to build more capable models, and then it’s going to be like, now it’s more capable and it’s maybe going to trigger this thing, and then —

Rob Wiblin: “What a shame.”

Beth Barnes: Yeah, maybe they’re going to fudge the evals, or maybe they’re actually going to do some of the mitigations, but I think we could be trying much harder to be like, what is the reason for rushing ahead in building these models?

The argument is like, “Then we can use them to do safe things and protect against these other models.” Like almost all of the companies that are like, “Building AI is important,” it’s like, ” — because someone else might build it, and we’ve got to protect against it.”

But I feel like this is very disconnected. I would rather the world was like, “Maybe some people are going to build dangerous AI. How would we protect against that?”

Rob Wiblin: “What would be the minimally capable AI that would be really helpful?”

Beth Barnes: Right. Or maybe the most important thing to do isn’t even AI. It’s not super clear that that’s what you need to do anyway. I mean, I think there are reasonable arguments for how we have a bunch of uncertainty and you just want to have a generally capable thing. But I just feel that people aren’t trying at all to be like, “How could we get the capabilities that are most useful, and which other capabilities can we knock out?”

And some of the very basic control stuff of how you can be more confident your model is not capable of taking over if there’s a tonne of stuff about the world that it doesn’t know. Maybe it’s really good at coding and stuff, but it just doesn’t know anything about world events past 2010, and you really tried to remove everything about computer security and something like that — where it’s like, this model couldn’t really do stuff by itself; it only works in this lab setup where we give it all the right things or something.

Rob Wiblin: Yeah. I guess the case where I’ve heard this discussed the most is trying to not include cutting-edge biology –in particular, cutting-edge virology or microbiology — in the training data, because that’s the kind of thing that would allow it to design a bioweapon.

And it’s a fairly clear point to make: why does the model need to know all of this incredibly cutting-edge biology, unless it’s being used by virologists? And in that case, why not just design a very special model for them that only they have access to? Why does the model that is going out to retail customers like us have to know the most dangerous things basically known to humanity up to this point?

And you’re saying this could generalise quite a bit more, where you could hobble the models by ensuring that they’re lacking specific capabilities by simply never training them in it, specific things that would be useful for them if they were going to try to scheme against us or engage in activities that we don’t like.

I guess I haven’t really heard that approach discussed very much outside of the biology context. I guess cyber stuff would also be another classic case where you could maybe use this — although they do seem to need to know a lot of coding for many commercial applications, so perhaps that’s trickier.

Beth Barnes: Right. I think it’s easier for the narrower things, but I do think if you need it for some applications, you just don’t need to serve that capability to everybody all of the time.

And with the bio stuff, yeah, you might still want to support synthetic biology research, but that’s like an extra training phase that you add on at the end, and then you serve that model only to a very carefully selected group of people. So I think even if you need the capabilities sometimes, that the frontier of which capabilities are how good that we’re targeting, we’re really not trying on this in a bunch of ways at the moment.

Rob Wiblin: I do remember hearing people push back and say the trouble with just excluding some information is that at the point where they were producing even more capable models than we have now, they’d be able to interpolate and guess lots of stuff that you left out.

Like if you included everything other than the news in 2013, then probably it could just guess what happened in 2013 by knowing what happened later. And I suppose even if you didn’t include anything after 2013, it might be able to guess a fair number of things that happened simply by the information that it has up to that date. To what extent is that a serious limiting factor?

Beth Barnes: Yeah, I think this isn’t a super robust intervention. This is sort of a, “Let’s try to get — ”

Rob Wiblin: Here we’re trying to get the risk from 50% to 40%.

Beth Barnes: Yeah, those sorts of worlds. I do think there just might be more technical work to do here on how you can actually remove knowledge or capabilities from models, or figuring out how much they can infer. Because I think you probably can assess this reasonably well.

Even if you have some concern about sandbagging, if you’re just trying to figure out does the model have this basic knowledge… There’s some level of sandbagging to do it with RL and there’s a much higher level of intelligence that’s demanded to do sandbagging if you’re just doing classification fine-tuning.

So if the model does in fact know the answer to a bunch of questions about this, you should be able to see that, with fine-tuning, it picks this up really quickly — so the answers must have been kind of already in there. So I think you can kind of tell what models know in this sense.

Rob Wiblin: By how quickly they later learn how to…?

Beth Barnes: Yeah. And also, as we were talking about before, if you don’t actually have great security, what you do want to know is, “Is it easy to fine-tune the model to have these capabilities?” — not, “Does it demonstrate them right now as these exact weights?”

Rob Wiblin: Yeah. Just to clarify for people who don’t know: sandbagging is models pretending to not be able to do things that they actually are able to do, or basically kind of lowballing what you actually are able to do. It’s a sport term, I think, from people who are throwing a match, basically.

Beth Barnes: Yeah.

Rob Wiblin: Wasn’t there some work on forgetting? Wasn’t there some effort to try to get a model to forget the existence of Harry Potter? I think they managed to kind of successfully do that, or forget something or other.

Beth Barnes: I’m not familiar with that exact thing. Sounds plausible. I think people have done some on this, and call it “unlearning.” I just haven’t seen that much on it. And in particular, I haven’t seen any kind of application of it.

Rob Wiblin: Yeah, I think I haven’t heard about that in almost a year. Which is a bit of a shame.

Underrated: Helping humans judge confusing model outputs [02:23:36]

Rob Wiblin: Is there anything else you want to call out as potentially underrated?

Beth Barnes: I’m sad that there’s not more work on helping humans evaluate difficult model outputs, basically the stuff I used to work on, on debate and IDA [iterated distillation and amplification] — this whole genre of alignment research is like, how do you make sure that your model is not doing destructive things when it is smarter than you are? And how can you create structures that provide a systematic advantage to the rater?

Debate is one of these, where you have copies of the model and they’re debating with each other. You have some specific subquestion that you can judge, and you can judge how each subquestion relates to the question above. And this is much easier than deciding the overall answer.

I think there could be a lot more stuff just helping humans more quickly make good judgments about, was this actually the behaviour you wanted? Was this actually seeking power in some undesirable way?

I think the more faithful your training signal is to things like truth and honesty, the less incentive you are putting on the model to reason about the training process and the flaws in the training processes and things.

And yet the more you have these techniques that kind of amplify what humans are able to evaluate, the more you might be able to check. Like, if these models, they’re running all of our security against these rogue, unsafe, bad models that someone else has developed. We have no idea what’s going on. We don’t know whether our models are actually secretly cooperating with the rogue models.

The better you can be like, “We have this scheme to have two models adversarially evaluate each other and point out the things that they’re doing that are most suspicious and then evaluate those,” the better we are at that, the more able we are to train for, “No, actually, do the thing that I want.”

Overrated: Major AI companies’ contributions to safety research [02:25:52]

Rob Wiblin: You said something a little bit spicy earlier, which is that you think possibly Redwood, with just a handful of people, is maybe producing about as much alignment research as all of the major companies put together. I assume that’s not actually literally true, but… Maybe you do think it’s literally true?

What would the barriers be to the companies producing more alignment publications than they are? It’s a little bit hard to understand, because the resourcing would be so uneven there.

Beth Barnes: Yeah. I may be biased in what I’m exposed to, because I’m just much more aware of the stuff that Redwood has done, but it’s just actually not implausible to me that that is the kind of current ratio.

So reasons companies are less productive: there’s stuff we mentioned a bit before about maybe actually the infrastructure and speed at which you’re able to do research is slower. I think there’s different incentives and pressure to do the thing that will unblock shipping the next generation model, rather than do the research that’s most important in the longer run.

I think to some extent it’s just like too many people in the same place or something, where it’s just like a marginal person is less useful when you already have a tonne of people.

I think also maybe companies are more focused — this is similar to short term– but it’s both short in terms of time horizon and locality. They’re just thinking about things at their own company, and not what is most important for the field as a whole, and how do we disseminate that and make it easy for other people to adopt.

And there’s a bunch of barriers to publishing, and that’s maybe a reason that some people have left, have been unhappy about not being able to publish their work enough.

Rob Wiblin: I see. Presumably they’re a lot more cautious now about what they share and what they publish, and I imagine there’s a lot of levels of review that stand between someone having some insight and it going out. I guess especially if it’s related to risks that the company might be creating, you can see a lot of people wanting to scrutinise that before it’s shared.

Beth Barnes: Yeah. The comms team in there editing the blog post, being like, “Can we make it sound a bit more optimistic here?”

Maybe another thing that I think is overrated, or I feel like sometimes people conflate alignment — in the sense of making sure the model is trying to do what we want, and not trying to do extremely bad things that we definitely don’t want — with making the model better at following complicated instructions or other things that are more just like making the product better or enhancing the model’s capabilities in the direction of following elaborate rules about what we want.

And I would just be more excited about progress that’s progress on the problem when humans don’t know which output is better, but the model knows a bunch of stuff about what’s going on. That seems more useful than, “The humans want the model to follow this complicated set of instructions, so we did the things to make the model better at following the complicated instructions, because otherwise it was forgetting that it’s supposed to also have this policy or something.” That doesn’t feel like it’s making progress on the core problem.

Could we have a science of translating AI models’ nonhuman language or neuralese? [02:29:24]

Rob Wiblin: Earlier you talked about neuralese, which is this sort of internal gibberish to the model we can’t follow, it’s helping it solve the problem somehow. Is it possible to have a science of interpreting neuralese to go from a bunch of numbers that the model was able to understand back into English in a faithful way?

Beth Barnes: I think that there’s some of this which is basically contiguous with existing interpretability. And it’s sort of like you’ve just gone from your model doing single forward passes to now you have some recursive thing, and you’re just feeding the activations back in. And basically you want to do all of the same interp things.

I think there are also other approaches. If we’re talking about specifically this case where you did have an interpretable chain of thought, and then either you trained on it too much and got gobbledygook, or you did some kind of optimisation, that there’s maybe a bunch of techniques here that are more like preserving the existing interpretability that you did have — as opposed to recovering it from this thing that started off completely uninterpretable.

Or making it possible to get that speedup without it needing to transition to the neuralese.

Rob Wiblin: Right. I guess that would be the tradeoff. So the reason companies will allow that to happen is because they just think requiring us to be able to understand the reasoning is merely holding it back; we need to allow it to use its own gibberish, because that’s better. But I guess if you then added a whole other layer of requiring it to then translate its reasoning back, then that’s just slowing it down again.

Is this an area that might be ripe for regulation? That you should say you can’t have models just using neuralese? Or maybe there’s always going to be too much of a grey area of whether the chain of thought is comprehensible or not, so it’s not a very bright line that you could draw in the sand. But it does feel like it would be really nice if we could get them to act in the spirit of saying, no neuralese, basically.

Beth Barnes: And no training the model not to do kinds of reasoning that you don’t like: no training the model not to reason about scheming, and it needs to be interpreted. I feel like it would be great if that was enforced.

I do think there’s a bunch of opportunities for technical progress here, where it may be much more feasible to figure out how you would have some kind of summary or translation that you are ensuring is faithful in some way. It just feels like a thing that people could work on.

Could we ban using AI to enhance AI, or is that just naive? [02:31:47]

Rob Wiblin: In terms of commonsense regulations that you could try, I’ve had the shower thought for a while. To the regulatory scientists, this is going to sound stupid, but why not just ban AI being used to enhance AI? I would feel so much better about the future if I knew that you couldn’t have a recursive self-improvement loop. That would solve the problem to an enormous degree.

The problem, of course, is it’s already being used that way. They’re already using it to help with coding. They’re presumably already using the language models to just research all kinds of random questions that they have. So it’s assisting.

But I think if you just had that as the rule, and you interpreted it as kind of any plausibly reasonable way where you draw the line between what is and isn’t permitted — so you allow some basic assistance with programming, but nothing that’s more impressive than what we have now — I think it would be a massive improvement, basically. There’s almost like no way that that could be worse from a safety point of view than the default, where there’s absolutely no rules about that whatsoever.

What do you think of these kinds of naive things, just a person off the street who was just presented with this problem saying, “Why don’t we just ban the thing that’s obviously creating this enormous risk?” Can we get some mileage here?

Beth Barnes: I’m maybe not the best person to ask about regulatory feasibility. I think, as you said, it’s hard to, if people are already doing things, say that they can’t do it anymore. I would be maybe worried if there is some overhang created here, if people were just holding off from doing a thing and then you suddenly get a lot of improvement.

Rob Wiblin: I guess people would say that China’s going to do it anyway, and this would hold us back so much.

Beth Barnes: Right. There’s stuff you could do. Like the amount of trust you need to have in your model before you let it do stuff. Or I think more models above this capability level, you need this mitigation — where definitely the thing that spiritually makes the most sense is we don’t care about the rate of progress per se; we just care about do you have the mitigations in time?

So I don’t love this AI R&D as a thing to draw a line on, because it is kind of contiguous with all of these just basic coding tools. Or you could also just ban humans doing machine learning research or something. It has this weird flavour in some ways. And if you didn’t already agree that AI progress was scary, why is more AI progress scary? It’s not like the final threat model thing. But I think pragmatically, you’re just not going to have the mitigations in place if you’re doing this crazy recursive self-improvement loop.

Rob Wiblin: You might come back and say, why not just hobble them in any random way. Like say that if you’re doing AI research, you have to remove the E key from your keyboard, and that’s going to slow them down and be really irritating.

But I guess the difference here is that if you have AI doing the work, then humans are getting further and further out of the loop of anything that’s happening. It’s becoming more and more inscrutable. I guess we expect that that’s the thing that’s going to lead to the very rapid improvement in the capabilities. So it’s something that slows it down the more it would have been sped up. It has at least a nice property to it.

Beth Barnes: Yeah, I do think in general I prefer things that are like, “You have to do this mitigation,” rather than just like, “Slow down.”

Rob Wiblin: I guess that gives them the incentive to fix the problem, inasmuch as you specified it properly, because then they’ll be able to go ahead. Whereas if you just ban it and say it’s always going to be banned, then no such incentive is made.

Beth Barnes: Yeah. And it’s a thing that a broader group of people can get on board with. It gives you some free lunch with respect to if people disagree about what capabilities you’ll have when. Someone who might object to the thing that just slows you down might not object to, “If you hit this level, you need to do this mitigation.” So it can be strictly more appealing and it’s strictly more reasonable in some ways.

Rob Wiblin: Yeah. I wonder if even if it’s a bad idea, it’s the kind of regulation that we might get in a crisis. If you do kick off a recursive self-improvement loop and that quickly leads to an AI going off the rails and causing a bunch of damage, and then you loop in politicians, and they’re like, “You were doing what?! You took all of your humans out and just had the AI doing all the work? Obviously we have to ban that.”

I guess that probably would be better than nothing, but that could end up being the sort of naive regulation that perhaps helps but is substantially suboptimal relative to other more sophisticated things.

Beth Barnes: Yeah. One way that being like “just slow down” or “just pause” or something can backfire is we would like to save all our pausing juice to the bit at which you can get the most safety progress out of your models before things are really dangerous.

And if people are like, “We paused and then everything was fine,” then everyone’s fed up with this and thinks this is stupid and let’s just go ahead now. And now the compute has accelerated more and we’re ready to go through this. You really want to go through that part of the transition as slowly as possible.

Rob Wiblin: So any constraints you place on yourself earlier is just creating more latent capability to very rapidly speed up later.

Beth Barnes: Yeah, that’s one model. I don’t think this is totally right. I just think people seem to sometimes miss this fact. I think it’s more salient the more you’re thinking about how a lot of the important safety work will be done by the AI researchers at the point just before they are catastrophically risky.

Open-weighting models is often good, and Beth has changed her attitude to it [02:37:52]

Rob Wiblin: OK, new topic: Is open weighting models in general good or bad, from your point of view?

Beth Barnes: This is something I’ve changed my mind on a fair amount. It’s maybe also a bit related to being in the lower-assurance world.

Originally I was like, this seems very dangerous if the problem is either that these things can be misused or that they can be dangerous even without a human trying to misuse them. It seems like you don’t want them to be everywhere, and not be able to be like, “Actually, this is a bad idea. Let’s undo that.” It’s an irreversible action, and it opens up a lot more surface area for things to go wrong — if you think that, at least for some of the threat models, there might be a big offence/defence imbalance.

So I think if you’re trying to keep the risk very low, it really feels like you can’t open source anything that capable. Because again, even with current models, it seems hard to rule out that with other advancements in scaffolding or fine-tuning or something, it would then become quite easy to make this model into something very dangerous.

I’ve shifted my perspective generally on how good it is to keep everything inside the labs. Sort of similar to the thing about, would you rather just the labs know about what capabilities are coming or would you rather that everyone knows?

I think in practice, the open source models have been used to do a bunch of really good safety work, and this has been important and good. And having a more healthy, independent, outside-of-companies field of research seems good both for the object-level safety progress you made, and for policymakers having some independent people they can ask — who actually understand what’s going on, and are able to do experiments and things.

Also, in general, I have become more sceptical of some kind of lab exceptionalism. I think a lot of the companies have this, like, “We’re the only responsible ones. We’ll just keep everything to ourselves, and the government will be too slow to notice, and everyone else is irresponsible.” I just think this becomes less plausible the more independent labs are saying this, and the less they actually respond to how responsible are the other actors. And if you trust the companies less, you want less of that.

I think it’s also one of those things where everyone always thinks that you’re the exception — “It’s fine, but we’ll just actually be good and actually be responsible. We don’t need oversight.” And sunlight is the best disinfectant. Oversight is really important, and security mindset and secrecy and locking things down can be bad.

Rob Wiblin: Seems like there’s a reasonable amount to unpack there. The basic reason I’ve heard for why actually open sourcing has been better than perhaps people like me feared a few years ago is that it’s been an absolute boon for alignment and safety research, because it means external people don’t have to actually go work at the labs — they can do things independently, which is really useful for all the reasons that we’ve talked about, if they have access to basically frontier models outside of the companies.

And I guess separately from that, it sounds like you’re saying that the companies having a very secrecy-focused mindset, wanting to hold all of the information about the models within themselves, that that’s a dangerous mentality. And maybe it’s actually helpful to have the weights open and things being used all over the place, so that people can see for themselves the chain of thought, for example, and modify the models and see what is possible if they’re improved — things that wouldn’t be possible if you only accessed it through an API that’s highly limited.

Beth Barnes: Right. I think there’s also a point about maybe you just think some of the companies are bad actors, or there are people at them who would be bad actors, and you would prefer that the rest of the world also has access to the tech.

What we can learn about AGI from the nuclear arms race [02:42:25]

Beth Barnes: But yeah, I think that the main thing is this slightly less specific to open weight question, and more some kind of general position on lab transparency versus sort of security and locked-downness.

Rob Wiblin: Can you elaborate on that a bit?

Beth Barnes: One of my side interests is the history of nuclear weapons. And I think there are a bunch of examples there, and in other places, of secrecy and personnel restrictions and things being used to silence people who had safety concerns or ethical concerns, or were too busybodies or interfering.

So both compartmentalisation — so that fewer people knew about a thing and could complain about it — and then trying to jail or otherwise get rid of or exclude individual people who had too many ethical concerns, basically. In particular Szilard.

And I think we actually have seen this specifically being used in the AI case already. I believe when Leopold Aschenbrenner was fired from OpenAI that there was some claim that this was about having leaked confidential information.

I don’t know the details here, but it’s very plausible to me that basically everyone at these companies is doing a bunch of stuff that’s technically leaking. And if you want to get rid of someone, you can go through all their documents and find something that they did. Or in general, it’s a reason to keep people out of things, like compartmentalisation. This person is going to be annoying about the thing. Let’s keep them out of it.

And I think doing the Trinity test, when they had the concern about whether it might be possible to ignite the atmosphere, there was basically no civilian oversight of that, downstream of this being a very secret project overall and not much government or congressional oversight. It just meant that just the people who made this decision were the people who —

Rob Wiblin: It’s just a bunch of scientists or engineers who happen to be working on it.

Beth Barnes: Yeah, and military. I think a lot of trying to get the scientists to shut up if they’re being too annoying about stuff.

And people who are very invested in that particular project — even people who would have otherwise been relatively pacifist or not super excited about weapons — once it’s the thing that you’ve been working on for ages, it’s really hard to be like, “Maybe this isn’t good.”

Rob Wiblin: “Maybe we should just stop completely.”

Beth Barnes: Right. I think there’s a bunch more anecdotes from that history that are relevant to current AI stuff. But companies that are developing AI seem like particularly bad decision makers in some sense here, because they have such conflicts of interest, so centralising all of the information power within them seems like maybe a bad idea.

Rob Wiblin: OK. The obvious benefits of having a security mindset are that dangerous information doesn’t leak out to other groups. But downsides would be, if you start siloing information within the company, then an even smaller number of people might have a sense of the general picture of what the frontier models are able to do. If they’re very cautious about letting anyone outside the organisation know, then it makes it harder for governance to occur, makes it harder for broader society to appreciate if risks are reaching a level that they wouldn’t regard as acceptable. It means that there’s just basically less oversight from anyone else.

Beth Barnes: Yeah. Which maybe means we missed the opportunity to react to a safety incident that should have been a warning — like, “We actually caught the model scheming or trying to do something bad.” And instead of being like, “Whoa, everyone should take this really seriously, and we should freak out and we should stop this kind of thing,” it’s just like, “Let’s just train it not to do this, let’s not tell anyone about that.”

Rob Wiblin: You’re saying something could go wrong internally, but they might say that it’s too dangerous to let anyone know that this even happens. So it gives them a pretence for sweeping it under the carpet. I guess even more problematically, it allows you to just clean house, to remove anyone who doesn’t support the current project, if you’re saying that anyone who’s ever told anyone about what’s going on in the company, even if everyone is doing so, then you can just fire people basically arbitrarily on that grounds.

You mentioned Leopold — that’s Leopold Aschenbrenner; I think his crime was telling people that information security within the company was not sufficient to stop the Chinese from stealing the model. Or that was his big hobby horse — which everyone agrees with, more or less, and everyone is kind of troubled by. Maybe he was making too much trouble over that and potentially bringing more regulatory scrutiny to the company?

Beth Barnes: Right. I don’t know exactly what happened in this particular case, but I’d be very surprised if this pattern doesn’t show up.

Rob Wiblin: Do you want to give any details of a particular case that occurred during the nuclear process? You mentioned Szilard, who was one of the first people to twig that a nuclear weapon might be possible. And he wrote to various people to say that it’s really important that America do this before the Nazis.

Beth Barnes: Right. I’m a Szilard fan. It’s a random thing. This is a historical figure who it feels like is thinking about so many of the same things that we’re thinking about in kind of similar ways. I think he was very good in terms of scope sensitivity. He was actually thinking about what are the most important things happening in the world.

And this is, I think, also some reason for being somewhat more optimistic about the ability to forecast technological progress and world events and things, and maybe some evidence that the reason why most people are bad at it is that they’re not actually trying.

So he foresaw nuclear weapons in the early 1930s, like, “Here’s how you would theoretically make a chain reaction if you can find a reaction that is started by neutrons and produces more neutrons” — and then was like, “I should keep this secret because it seems bad if everyone was doing it, bad if the Nazis got it.”

And he also was like, “I’ll move to America one year before the war” — and he did, in fact, move to America one year before the war. Like, if you’re actually trying, maybe geopolitical events were kind of obvious. And that you might use this for powering ships or submarines, and actually thinking through the implications of technology seemed quite possible.

He then, with Einstein, wrote the letter to the US government that sort of kicked the bomb project in the US into action. The original justification was that we don’t want the Nazis to have a nuclear monopoly. That seems like a very reasonable justification, and that seems very bad. But during the course of the project, it migrated from that to, actually, we’re going to just use it to shorten the war by bombing the Japanese. And now it’s an arms race with the Soviet Union. And I think very few of the people actually left the project after that momentum. Joseph Rotblat was one person.

Rob Wiblin: Yeah. So at some point it became clear that the Nazis weren’t working on this weapon, and eventually then they were defeated. That then obviated the explanation that most of these people had been given for being involved in the project at all, which was to beat the Nazis. But you’re saying basically 99% of them get stuck around with a new explanation that they were given.

Beth Barnes: Right, yeah. There were multiple people who said, “This is a terrible thing and I wouldn’t want to work on it, but in this particular case of only the Nazis having the bomb, that would be really bad. So I’ll work on it in the special circumstance.” Then that isn’t actually true anymore, and people are like, “Well, since I’m here now…” There’s investment, and these things get their own momentum.

I feel like I see parallels in AI in terms of people giving reasons for why it’s important for them to be first or push ahead or something. They’re like, “We’re the most safe. This other competitor is doing this thing.” And their actions don’t change as those other facts about the situation change.

One thing is also that, proportionally, very small amount of effort went into actually trying to figure out if the Nazis did have a nuclear programme. There was basically no American espionage. I think there was some British espionage, and the Americans maybe weren’t even on top of actually what the British had learned about this. So again, if they were actually trying to make sure we avoid Nazi nuclear monopoly, you would have been doing much more, like, “Where are the Nazis at? Is this actually a risk?”

And another example of Szilard being much better at forecasting the technological stuff: he was just like, “The Soviets will have the bomb really soon.” What are you talking about? And [director of the Manhattan Project Leslie] Groves was like, “There’s no uranium in Russia.” It’s like, what? The ores will be slightly lower grade. There’s no way there’s no uranium in the whole of Russia. You know how big Russia is? And like, “Our brilliant American boys: no one will match that.” It’s like, no, this is not actually that hard.

Rob Wiblin: Or they’ll just copy it. There were Russian spies.

Beth Barnes: Well, yeah. Only a few, but there was various information that they actually had like before the Americans did, because of spies in the British Foreign Office, I think. So there’s all this stuff about secrecy and locking down, and that secrecy was completely useless.

And yet people would not plan for this arms race. But Szilard was like, “The thing that is really important now is there’s going to be a nuclear arms race between the US and Soviet Union that is going to be potentially world ending, and we need to figure out how to try and make that less of the case and how we can reduce tensions with the Russians now.” And the people were like, “Whatever. They’ll never get it.”

Rob Wiblin: It’s hard to fathom.

Beth Barnes: Yeah. And the amount of effort that went into things like thinking about alternatives to bombing Hiroshima, to actually destroying a city — you know, could you explode the bomb in the air or in an unpopulated area to demonstrate the capability or something? And it was like people discussed it a bit over lunch on the day that they were making this decision. But there was so little effort put into some of the key decisions.

And similarly with potential world governance of the weapons. Like the Acheson–Lilienthal plan of inspections and things, the people who were working on it were in conflict of interest — they had various industrial things that they would benefit from, if I remember correctly.

And again, just generally it was a very half-hearted effort, and was kind of like they just wanted to do something so they could have been like, “We offered something, but the Russians said no.” But it feels like you have less evidence than you might think that something like that would have been completely infeasible, because there was very little effort into mechanism design of how you could have transparency about this in a way that was less unpalatable to the Soviets who didn’t want a bunch of stuff about how badly things were going for them in general to be known.

So I don’t know. Some kind of actually trying at the most important things seems like it would have been pretty leveraged if you’d had some detailed alternative proposal that was well thought through. Plausibly.

Rob Wiblin: Yeah. I guess the generalised thing to keep your eyes out for is if people say, “I don’t want to do this, but I have to do it because of X.” But they show almost no interest in finding out with confidence whether X is true or not, and they don’t stay up to date on whether it’s true or not. They also maybe show no interest in changing X, taking steps that would fix X and make their actions no longer required. And also when X stops being true, they don’t change their actions.

These are kind of giveaways that maybe X isn’t the real reason. And I think we may see a fair bit of that at the moment.

Beth Barnes: Right. The thing has acquired its own momentum. And when Szilard was trying to convince Truman’s incoming Secretary of War, I think, that bombing Japan was a bad idea, he was like, “Well, what are we going to tell Congress we spent all that money on if we don’t do anything with it?” It’s like, oh god.

Rob Wiblin: Yeah, it’s a little bit embarrassing to kill that many people in order to just have a better answer at a committee hearing.

Beth Barnes: Yeah, but that’s what things get decided on. One of the reasons I think pre-scaleup evals would be good.

Rob Wiblin: To explain: that’s a reason why we should figure out whether a model is safe before we spend all the money training it. Because once you’ve spent all the money, what are you going to tell the board?

Beth Barnes: Right. The investors.

I thought of another analogy to the nuclear stuff now that we’re talking about China: maybe something that was important was the compliance of individual scientists in different countries.

I think this is pretty unclear, but there’s maybe hints that some of why the German bomb programme didn’t go very far was partially that scientists were not very excited about nuclear-armed Nazis.

Rob Wiblin: I think it has been suggested that they deliberately made bad strategic decisions, basically, because they were trying to make the probability of success lower.

Beth Barnes: Right.

Infosec is so bad that no models are truly closed-weight models [02:57:24]

Rob Wiblin: Just to tie this back to the open weighting, open source question: if we just start open weighting all the best models, aren’t we kind of baked? It makes any governance virtually impossible. It means that any misuse is going to happen very quickly.

What’s the upside story? I guess it means that you get better warnings, maybe early. If there’s people who are saying we don’t really have to worry about the misuse and then something is open weighted and it’s immediately misused for some catastrophe or some horrific crime, then maybe that could lead to improvements in governance.

Beth Barnes: Yeah, I don’t advocate doing things to cause horrific crimes.

Rob Wiblin: So doing everything we can to stop that, but I suppose if we fail, and people who are against any kind of governance succeed, then a possible upside of that is that we all get to find out who was right sooner.

Beth Barnes: I think the case for certain types of transparency and openness is much stronger than the case for specifically open sourcing weights. But I do think the open sourcing thus far feels pretty clearly net positive, just because of the impact on safety research. I do think you would have thought that it ought to be harder to argue that we’re in an arms race if people are open sourcing the models. But somehow this doesn’t seem to have stopped anyone.

Rob Wiblin: Never let common sense get in the way of the narrative that you would like to spin for your commercial interest! Just to clarify, you’re talking about the release of DeepSeek R1, which is this super weapon that the Chinese developed and then immediately gave to us for free, in exchange for nothing. It really shows that we’re in a very intense arms race with the Chinese to I guess give one another our greatest technology.

Beth Barnes: [laughs] Yeah.

Rob Wiblin: It is extraordinary that that did not preclude the arms race narrative. I guess you could say that maybe this shows that they will be able to make something in future that they won’t share with us. But you’d really think literally giving away your weapons designs would be definitely an olive branch to your enemy.

Beth Barnes: You would have thought. Yeah, I think basically what you were saying about if there are incidents, they happen in public rather than just inside the lab. And especially if you have to have open sourced the weights before you’re allowed to scale up by this much. That would be one thing you get around. Like the lab can be doing this crazy thing internally that no one has any idea of, and at least outside you’d be able to demonstrate that this model is pretty close to being able to do this thing, and maybe we shouldn’t let them build even bigger ones.

I think also, unless you have very good security, the story of how you’re OK doesn’t depend on not open sourcing. It seems like if you have a sufficiently powerful model, then a bunch of people are probably going to be interested in it. And companies currently don’t have the security that would keep out state actors or the best private actors. So your story, your theory of victory has to be something like, “We’ll use our good models for something good, and that will prevent some harm from these other models.”

I think most of those stories actually also work if there are open models everywhere — because it’s more about how more of the people who control more of the compute want models to be doing these kinds of things than want models to be doing these nefarious kinds of things. And it does feel like either you have to be extremely locked down, and there won’t be any bad actors, or you’re making some argument about how we can use these models defensively in some way.

Rob Wiblin: Part of the story there is that the security of these companies is not so strong that the models are not going to leak to various nefarious actors anyway. We should at least be honest about it, and they may as well open weight it, and then a whole bunch of nice people who didn’t even have to steal it might be able to use it.

Beth Barnes: Right. I mean, maybe they’re currently open weight from the perspective of North Korea or whatever. Something I’ve sometimes said is there are no closed-weight models. Like, no one has that good of security.

Rob Wiblin: That is very interesting. I think most people know, but maybe not everyone, that North Korea is actually very strong on offensive cyber capabilities. It’s not only China that might be able to steal it. I guess you don’t think of North Korea as a technological powerhouse, but on this one thing of stealing information from people, they are quite good.

Beth Barnes: Right.

AI is more like bioweapons because it undermines the leading power [03:02:02]

Rob Wiblin: You’ve been using this analogy between AGI and nuclear weapons. That’s a comparison that people have definitely drawn a lot over the years, and people have both reasons to think that it’s interesting and reasons to think that it’s misleading.

Do you want to give a take on what you think the important disanalogies are? In what ways is the situation different today versus with Szilard and so on?

Beth Barnes: Obviously there are a huge number of disanalogies. AI is just much more general purpose. I do think it maybe ended up being a bit less disanalogous than we might have worried. Ten years ago maybe people were worried that you could build AI; it would just be coming up with the right algorithm, and then you could do it on your laptop — and that would have been much harder to control.

In some ways the story of nuclear proliferation is quite successful in that it was limited. It’s not like everyone has these. And that’s partly because detection is relatively easy: you can tell when someone is doing this because you need big facilities, and you need to have industrial capacity. I think AI has ended up being more like that than we might have feared, in that you just need a lot of chips and you need a lot of energy and a big data centre. And it’s not like anyone could be doing it in their basement at any time.

I think maybe an important strategic disanalogy is that nuclear by default advantages countries that have a large industrial base, because it’s so industrially intensive. It differentially elevates big developed states against terrorist groups and rogue nations and things. And it’s not very stealable. It’s like the amount of nuclear capacity you have is pretty proportional to your existing power.

Whereas I think there’s a bunch of ways that AI is more analogous to bio, in that it’s more in the direction of destabilising, or maybe it favours smaller actors more — in that at least in some of the threat models, it doesn’t matter that much how much resources you start with, because it’s relatively easy to steal. Whereas nuclear material is much harder to steal because it’s just stuff that’s there and it’s big. But this is software; this is data. It can leak relatively easily.

And if you’re talking about autonomous AI, it can make copies of itself and spread itself around — so it’s not like the amount that you can deliver is proportional to the amount that you had in your warehouse. It’s more like a bioweapon — where it spreads itself, and it’s less tied to the work that you had to do at the start, and it’s more easily leaked. And maybe this is the sort of thing that could be used as an agent of chaos more.

I’d like more people to be aware of this argument, because I think it’s a reason that big states like the US and China should be concerned about rapid AI progress — because it just is a destabilising thing. And they’re currently doing well and sort of on top, and making something that can suddenly spread out of control and can be stolen by a terrorist group or something is just actually really not in your interests.

Rob Wiblin: I see. So the way that it’s in the selfish interest of the US is that, while it maybe has a lead in the relevant technology right now, it seems like it’s going to cost at least billions of dollars probably to train the first frontier model that is really useful as a weapon, or useful offensively, or at least useful for maintaining industrial dominance over other countries. So that could allow it to kind of lock in its advantage.

The downside for the US is that if that period passes and it becomes a whole lot cheaper — and also these models are disseminated quite broadly, and they’re possible to steal both by states but by also private actors, potentially, because they’re just all over the place — then that’s actually a huge weakening of the US’s strategic position, of its security position, in the same way that dissemination of bioweapons would be.

Because it wasn’t the case that only the US and China and Russia could have designed bioweapons; actually, dozens of countries probably could have, and that would be absolutely devastating. In some ways, they would provide a stronger deterrent and would be even scarier than having nuclear weapons.

Which is one reason why the US was very keen to discourage their creation and to create a taboo on them, because it would ensure that the previous generation of weapons of mass destruction, nuclear weapons, would remain the dominant ones that countries cared about, which was something that the US had an oligopoly on.

Beth Barnes: Right. Yeah.

Rob Wiblin: So if we get to the point where basically nuclear weapons and bioweapons are now obsolete, and actually it’s AI that is the key thing that can cause damage if you want it to, and this is disseminated very broadly to all many different kinds of actors, then the US government is now basically just a weaker player in the world as a whole. And the Chinese government is also just a weaker player as a whole.

Beth Barnes: I think, yeah. There’s maybe some analogy with some of the NSA exploits that got leaked and then started being used by Russian hackers against US citizens. It’s like the version of a bioweapon where you’re going to develop this super advanced bioweapon — and also, by the way, it can be stolen if someone hacks your computer. It’s like, sorry, and this is good for us why?

Rob Wiblin: Especially because we’re already winning. Why do we want to create this enormous vulnerability to our position?

Beth Barnes: Yeah, right.

Rob Wiblin: Yeah, you were just referring to the NSA, one of the cyber offensive organisations within the US. They had a whole bunch of cutting-edge tools for breaking into people’s computers and phones and so on and stealing information from them, which I think they literally accidentally posted publicly on the internet, or there was at least like one leak of that basic type.

But one way or another the Russians got access to these tools that they had designed — which is the kind of thing you might expect, because the Russians are also very good at this — and then those weapons could just be immediately turned back on the United States. So even if it was narrowly, in the short term, good for the NSA to have those weapons, maybe it would have been best for the United States more broadly if they’d never been created.

Beth Barnes: Right. I just think this is a pretty good argument that the arms race framing doesn’t really make sense in terms of US and China national interests.

I mean, obviously AI is not just a weapon, and has a bunch of other benefits and things as well, but I think if you’re thinking about it in this arms race mindset, then there is this pretty good argument that what you want is less of it.

Rob Wiblin: I see. Or that’s what the national security folks should want. At least if they could come up with an agreement between themselves and China, and maybe a few other major players, to not develop this thing that is destabilising and that actually weakens their relative power, then that would be great. But there’s almost no discussion of that possibility.

Beth Barnes: Right.

Rob Wiblin: I think one reason for that is maybe that they are anticipating that they will be the first people to get to this incredibly powerful thing, and then they’ll be able to stop other people from developing it.

Maybe that isn’t always said out loud because it sounds a little bit hostile, but basically they’re imagining we’ll rush to this technology and then basically try to become a permanent hegemon who prevents anyone else from designing stuff that would be destabilising. Maybe it would be good if they were a bit more explicit about that, and then people could weigh in on whether they think that’s a good approach or not.

Beth Barnes: Yeah, it does feel a bit like one of the things where you could also work more directly on what would we do to cope with scary AI and that it’s not clear that the response is immediately, “Yes, build your scary AI as well.”

Just the stories I’ve heard for how you use your super capable model that you’ve built to keep the world safe are not that compelling. This is a thing I would love people to be thinking about more and thinking about more explicitly. Like all of the things where the lab’s theory of impact is like, “We’re going to build this thing first, and then we’re going to be responsible for it,” it’s like, OK, what exactly are you going to do with it, and what capabilities do you need for that?

Rob Wiblin: Almost all these stories involve at some point blocking other people from designing their own versions of it. Maybe one reason that that isn’t advertised is that it does sound quite hostile. It’s hard to sell that to people ahead of time.

Beth Barnes: Yeah, but also there’s a bunch of things that we could do that would achieve that that don’t require building super scary AI — like just have fewer chips.

Rob Wiblin: Have compute governance.

Beth Barnes: Yeah, a bunch of compute governance. Or just gathering up a bunch of the relevant scientists and putting them on working on something else. It just feels like there’s this response of, “In that case, we’ve got to build a bigger one.” OK, but what if you actually started from the thing that you’re trying to prevent?

Because nations are already pretty great at threatening each other with mutual destruction. Having a more destructive threat really doesn’t help you very much. Most nations are basically… destroying a few cities, they won’t prioritise that differently than the end of the world, so just escalating your maximum threat doesn’t seem that helpful.

So it’s got to be some kind of defensive tech thing. Maybe we should just think directly about what that is, and maybe there are more efficient ways to make progress on that.

What METR can do best that others can’t [03:12:09]

Rob Wiblin: All right, let’s switch on to something a little bit different. There’s lots of different organisations in this general tapestry: there’s METR, there’s the companies, there’s government agencies like the AISIs, there’s other independent nonprofits like Apollo. I can’t remember all of them.

What would you say is the comparative advantage of METR in this ecosystem, and maybe what are the strengths and weaknesses of the other alternative groups that are involved in evals and enforcing responsible scaling policies in general?

Beth Barnes: Man, I have a bunch of stuff I could say here, but I’ll start with nonprofits versus governments versus companies in terms of who should work at each.

We talked a bit already about model access for research and things, that that is one reason that people use for going to labs. I don’t think that is a huge deal. I think METR has some disadvantage of having less infra and direct access to models and things, but actually the compute availability: the labs give us free tokens, and then we can buy our own compute for doing our own experiments, and there are good open source models.

The advantages of METR are we can really just focus on the mission. We don’t have any other competing priorities. I think governments have various political things that affect what mandate they have, and they’ve got to impress various people. They have maybe less longevity; they more don’t know whether they’re going to be around in some number of years. Whereas METR, we expect to be able to keep fundraising; there’s no particular reason that METR would disappear with the next administration.

And unlike being at labs, it’s not like, “This safety thing would be better if it made our product more useful. Look, this technique is helpful here. Can you please help us do this thing? Or you better do those evals quickly because otherwise it’ll be blocking deploying this next model.” In that case, you’re just kind of always sprinting, trying to get the evals done — so that you’re either just not finished in time, or if they actually were to delay until the evals are ready, which I don’t know if they would actually do.

Whereas at METR we can just be like, what are the most important things for the field overall that are just the highest priority for us to work on? And basically do that. If we’re like, actually it’s maybe not that important to do evals at this specific point for the specific model, we’ll just focus on something else.

Rob Wiblin: What are the weaknesses for METR?

Beth Barnes: I think it’s just size, in that the total amount of technical talent is obviously much lower than at a huge company which has a bunch of people who are just doing capability stuff and are super expert in their own areas. So in some sense, there’s less mentorship and more of the normal startup property of you get thrown in and you grow because you’re given a bunch of responsibility, and you have to grow to live up to it — as opposed to you’re like this cog in this big machine, you’re being coached, and you’re incrementally moving up the ranks or something.

Rob Wiblin: In a way that’s very organised and predictable.

Beth Barnes: Right. So I think for people who very much want a quiet life, very predictable, and want to be told what to do more, it’s not a great fit. There’s a few patches of our work that are more like this, but more of it is fast moving.

And just being a smaller organisation I think is helpful for being able to pivot more rapidly. A bunch of things are changing all the time, and progress is fast. So I think that’s pretty useful, and it’s easier for us to prototype things that we want the company to do that would be harder to get them to prototype internally.

So we’ve been thinking about doing a control setup and control evals for our own usage of the models — where we do things like replacing all the models we’re using with one that’s been prompted to try and screw us over or something, and then see how this goes. It’s just way more likely that METR could do that than DeepMind is going to do that or something.

But I think similarly on the good at exploring or moving fast, compared to working with governments, it’s sort of lower stakes for labs to work with us. It just has less of you need all the government relations people to sign off, and legal stakes, and it’s less formal and more like, “I guess we can just do some research together.”

But I was supposed to be talking about disadvantages of METR.

Rob Wiblin: Well, I guess the usual thing with nonprofits is that it’s harder to scale really rapidly even if you’re succeeding. That doesn’t necessarily translate into much greater fundraising that would allow you to do your stuff 10 or 100 times as much. I guess that might be one issue, among others?

Beth Barnes: I think it’s more just like we’ve found it harder to hire because we’re not paying as much as the labs. Sometimes people think we pay normal nonprofit salaries. That’s not the case; we pay approximately the matching cash compensation that people would get at companies at AI companies — just the equity is impact equity, imaginary impact equity.

So it feels like that’s been the limit to growth, is hiring technical staff. It hasn’t actually been fundraising. I think if that started to become a bottleneck we could do more charging for our services, and I think could push more on moving things like the AI Safety Fund along — like a bunch of labs could put a bunch of money into a pot run by the FMF [Frontier Model Forum] and then that could be granted to third-party eval orgs or other nonprofit safety, independent safety things.

It’s good because it scales with the capacity of the labs: the labs are paying for it, but it doesn’t have this direct coupling with you making them happy. There are multiple instances of financial auditors signing off on various books that were extremely dicey — famously with Enron or Arthur Andersen, because they had the auditing and a consulting arm and it just was some very large fraction of their overall revenue. So it just would have been extremely bad for Arthur Andersen as a company to not just give Enron what they wanted.

So you want to avoid that kind of setup. But I do think if you have something where there’s an industry group, and to be a member you have to pay into this pot, then that’s distributed to evaluators. But there isn’t a specific relationship between will a particular lab keep working with a particular evaluator if that evaluator gives them an unfavourable rating.

Rob Wiblin: Right. What’s FMF?

Beth Barnes: Frontier Model Forum, a 501(c)(6). The point of it is a carveout from antitrust, where companies can talk to each other about things like safety and incident reporting, and RSP agreements and whatever, without being subject to the usual antitrust concerns. But it also makes sense as a hub to coordinate other things.

Rob Wiblin: Do we currently have an arrangement where there’s some pooling of funds and then auditors are assigned to companies, somewhat whether they like it or not, so that there isn’t this strong incentive for them to just get the answer that they want to hear?

Beth Barnes: No, there is a pot of money that labs have contributed to that’s set up by the FMF. That has like $10 million, I think. But this kind of thing seems pretty feasible, because this is just pocket change for the labs. It’s really not that much money for them. So the money side I think is fine.

So back to the point of this: METR has generally been more bottlenecked on talent than on fundraising.

Rob Wiblin: But it sounded like you were saying that one challenge with attracting the right talent is that you’re not able to pay them equity in the way that the companies can. Which, given how the valuation of these companies has exploded, might be a big factor in people’s choices; the dollar signs can be very large in people’s eyes.

Beth Barnes: Right.

Rob Wiblin: So I guess money maybe is an issue: that you need to pay people more extra cash. And if you just had absolutely unlimited financing, maybe you would pay people more.

Beth Barnes: Yeah. Obviously these things trade off. I think when you’re asking what are you bottlenecked on, it’s sort of like, what do you think marginal people should think is the right thing to contribute to?

Yeah, more funding would help spend less executive time on fundraising, and we could do things like pay higher salaries that would allow us to attract more talent. But it seems like we’re probably at the point where there’s not that many people who, if we just paid them a bit more… It’s like, we could 10x what we were paying people, and then maybe we’d be competitive with lab equity or something. Maybe not quite that high, but probably for some. Yeah, it’s almost in that region.

So it’s like, that would be a lot of fundraising. Or a few marginal people could —

Rob Wiblin: Could accept a merely very high salary in order to save the world.

Beth Barnes: Right, right. 80,000 Hours: how many people have you coached? Why can we not hire a senior software engineer? Not even an ML engineer, just a software engineer or DevOps. I’m like, where is everyone?

Rob Wiblin: Interesting. Do you want to go through the kind of roles that it will be useful for people to potentially apply for if they’re suitable?

Beth Barnes: We have senior research engineer / research scientist roles. That’s just the normal senior ML role. We have some combination of research and research engineering skills, so it can be anywhere on this continuum. You just need the sum to be good enough.

And then also senior software engineer and senior DevOps engineer. I don’t know what exactly is going to be up on our website at the time this podcast goes live, but currently maybe looking for head of operations or operations lead. And basically always looking for senior technical talent.

Rob Wiblin: For technical people who are still listening at this stage of the interview: this conversation hasn’t been maximally orientated at you, because you’re a relatively small audience. And if you would like to know more about what METR does and its most impressive work, you should go out and check out the website metr.org, where you can read the full research reports and see if you’re impressed and it’s something you would like to be involved with.

Beth Barnes: Yep. And apply, chat with us. Or if you want a taste of some of the activities, we’re probably still looking for baseliners — so you can compete with the models on the ML tasks, and we’ve been using these as work tests for the humans as well as for the models.

Rob Wiblin: Just to explain: baseliners are people who you use to measure how difficult the tasks that you’re giving to the AI are, so you’re establishing the baseline.

Beth Barnes: Right.

Rob Wiblin: Another reason why you might find it a little bit challenging to compete with the companies is that they have huge recruitment teams that are specialised in corporate recruitment and recruiting people from exactly the tech industry.

I’m sure you know all the things they do: take people out to dinner, give them exploding offers, really put the hard word on them about how important this is, maybe trash talk everyone else. And I guess those are things that as a smaller nonprofit, perhaps a little bit friendlier and a little bit less corporate, you’re less inclined to do perhaps.

Beth Barnes: Yes, we have in fact encountered all of the things that you named. It is a bit unfortunate. We just can’t compete with the level of recruiting intensity really. And also, maybe this is just a personal flaw, but I just really don’t want to do that. It just feels like such a waste of everyone’s time. There’s something which makes sense, which is like everyone should explain who should work at their company and why it might be good and what are the advantages. And then people should talk a bunch to each other and decide.

But there’s something that feels like it’s going on at the moment, where people spend a tonne of effort and are trash talking each other and trying to compete. It just kind of feels like a waste of everyone’s time and is unhelpful. I wish people just made decisions in a different way that was less affected by who took you out for dinner more times or something.

Rob Wiblin: Yeah, some of those things are a bit disreputable. I think exploding offers, probably no one would say that is the optimal way of allocating people across different organisations. These are things where you say you’ve got to join within 24 hours or we’re going to ditch you.

Other things, like talking to people at great length and really pitching them very strongly on the value of the work, I guess seems like it’s fair play. But you also really want to be doing your research. So it could be hard for people to carve out time to compete with people on the level of aggression that a for-profit company can potentially offer.

Beth Barnes: Yeah. And we’ve had people who I think of as relatively senior being like, “I want more senior mentorship, so I’m going to go to this company.” It’s like, “No, you were supposed to be the senior mentorship.” There was a bit of a bootstrapping thing here.

But I think also we actually have some staff who are young in age but are senior in ability level, and just are not kind of famous in some ways. And hopefully this is getting better over time as we publish more stuff.

What METR isn’t doing that other people have to step up and do [03:27:07]

Rob Wiblin: Are there any other organisations or work that you would like to see exist that doesn’t exist? Or any things that people might think that you’re doing that in fact you’re not actually able to do that other people need to take on?

Beth Barnes: Man, I have a very long list here. One thing that people seem to get confused about is the extent to which we’re really serving as the auditor for all these companies. And we’re not really doing that; we’re kind of trying to prototype these oversight arrangements and what arrangements work technically — and maybe if no one shows up and it really seems like crunch time, we’ll just try and scale and do it ourselves.

But we were imagining that someone else would do the more formalised version of this and we would continue to do the research exploratory version. Maybe that’s the AISIs. But if the AISIs get removed, then someone else would need to do that.

Rob Wiblin: I mean, who would that be? So the AI safety/security institutes that govern organisations, this would obviously be their natural role, auditing all of these frontier models to make sure that they’re not able to do things that the government regards as incredibly dangerous and illegal. If they don’t do it, who would do this?

Beth Barnes: There are a bunch of more specialised organisations that maybe make more sense to do at least some of this, like Pattern Labs for cyber. There’s just organisations who have more specialisation compared to METR with a customer-facing delivery.

Where our focus is more like: How much is this actually improving safety? Are we learning stuff? Are we derisking things? I think we might expand more and have more evaluations as a service arm, but that’s not something where it’s like we’ve got it covered, and no one else should bother. No: please come do that.

And we’ve only been doing dangerous capability evaluations thus far. We’ve been thinking about alignment and control a bunch the whole time, and this has been in some of the RSPs and stuff, to some extent, at least been in our thinking about it. But we’re only just starting some of that now. I think Apollo is doing good stuff and interesting stuff.

One thing is that we’re imagining trying to focus on doing good science and being sort of neutral experts, and less on advocacy and really holding labs to account or political campaigning or any of those things. So those are things that we very much don’t have covered.

We’re much more like technical advisors; someone might come to us and be like, “Is this bill good? Would this bill actually reduce risk?” That is the sort of thing we’d be very happy to answer a bunch of questions about. But we’re not doing, “How can we raise awareness of the fact that this lab isn’t doing this thing they said they would?” or something. It’s not a thing that we’re on top of. So we’re not providing accountability for the companies in general. We’re much more just trying to figure out, “How do you tell if the models are about to kill you?”

Rob Wiblin: Right, yeah. So fundamentally you’re a research institute. That’s your core culture, and you’re trying to figure out how to measure things that have never been measured before, how to quantify things that have never been quantified before about these models — which is an enormous terrain because so little is known and understood.

I guess there’s room for pressure groups, there’s room for policy advocacy, there’s room for the for-profit auditing thing that can scale enormously and is able to implement in a practical way the kinds of discoveries that you might have been making a year or two before.

Beth Barnes: Right. One paradigm I can imagine being in is that there’s someone else offering these more scaled services, and METR’s role is more like reviewing this for quality. You know, does the eval actually provide the evidence that it’s supposed to be providing, as opposed to just running them for all the companies that want them to? There’s just quite a lot of models that people want evaluated.

Another thing: we just happen to be that evals is our focus. I don’t think that’s because evals are more important than everything else; I think there’s just a tonne of work in improving the mitigations and actually implementing them at companies that is not being done, like the unlearning thing.

It just feels like there’s a tonne of incredibly basic things that just no one has tried. Or things like providing a filtered dataset: we think that maybe it would be good to remove a bunch of the stuff about scheming AI, and put a bunch of stuff about humanity and AI being friends in a dataset.

It just feels like there’s various dumb interventions that would be good, like more support for whistleblowers.

What research METR plans to do next [03:32:09]

Rob Wiblin: OK, you’ve been very generous. We’ve been talking for four and something hours at this point, so we should probably try to begin to wrap things up.

I feel like the conversation has maybe leaned a little bit towards the doom and gloom. Maybe I’m a little bit jetlagged. Maybe I’m leaning towards the negative this week.

In terms of something a little bit more hopeful, I’m curious to know what METR’s research agenda is going forward. And maybe also, it sounds like despite the fact that we’re in a somewhat negative or somewhat challenging situation, you might say, there are a lot of different useful threads that people could pull on, and really the problem is that all of the low-hanging fruit that’s available is not really being taken: there isn’t the resources or the focus or the number of people involved. So the positive side would be that there’s so much that can be done.

Beth Barnes: It feels like we’re just in some very high-stakes situation, where it’s kind of like there is everything to play for. It’s not like the outcome is overdetermined. It’s just really like there are all these extremely valuable things to do.

Rob Wiblin: All right, let’s take it bit by bit. So what are the priorities for METR over the next couple of years. Years? I should have said next couple of months. Who can plan over that kind of time horizon?

Beth Barnes: Yeah, next couple of months is pushing it. Who knows what models will be out by then. But we are thinking about what are the most important asks to make from companies now, given what we know overall, and what exactly are the biggest things that we want to tell the world that we think are most important on that front?

And it’ll probably be trying to get out of somewhat reactive mode, in terms of companies being like, “Please do these evals for us,” and get into what we think are the most important things that should be happening, and kind of make that clear to everyone. That’s one strand of things.

Another strand is moving beyond just dangerous capability evals into alignment and/or control evals, and maybe also communicating to people why we’re doing that and why we think the inability safety cases are not going to go that much further.

Rob Wiblin: OK. So control, we’re going to do other interviews about that. But that’s an approach where you say, maybe this model would like to screw me over, but we’re going to put in place constraints on it that would mean that even if it was trying to escape or it was trying to cause some harm, it wouldn’t be able to.

So a control eval would be evaluating whether such a setup is actually functional, so I guess you might actually make a model that would like to do something bad, and then see whether the constraints are actually binding on it. Is that the idea?

Beth Barnes: Right. Yeah. In some sense, you can think of the current dangerous capability evaluations as a special case of control evaluations, where your control strategy is nothing. So it’s like, what if we did some kind of monitoring? Or there’s just a bunch of interventions you could do that hopefully would make it harder for models to do extremely bad and scary stuff, and we can try some of those out and test them.

One thing I’ve been toying with is the idea of how you would evaluate a control setup. Maybe you want to red team: someone creates a model that is just nefarious and is trying to mess with you, and then you see what it’s able to achieve. Or you set some threshold of like, if the model is able to set up an agent that we don’t realise is running for this many days, then that counts as a catastrophe, and we’re trying to build a control system that can prevent that. And then you test that with your team. So we could potentially do that internally.

I think something we might work on a bit first is making sure we’re getting maximum use out of current models, and using them in our research to accelerate the research in general. And then the more we do that, the more relevant the control stuff will be because it’s like, could the model now be sabotaging your research or something, now that you’re using it more?

And then another thing that we’re working on is… If you’ve hidden chain of thought, you can’t tell if the model is maybe deliberately sandbagging or messing with your evals. You can’t trust the results of your evals if you can’t see all the thinking that the model’s doing before it answers.

But maybe labs are very unwilling to share the chain of thought. It seems like at the margin the thing that would most decrease our current fear about models would be that we know that something has been run on the chain of thought that looked for any kind of scheming or checked if it was totally illegible or whatever.

So let’s build a dataset and some kind of classifier setup based on open source models that then a lab can run. And they don’t have to release the whole chain of thought; they could just release the numbers, and maybe a smaller number of examples that wouldn’t be enough to distil on, or rephrased examples or something.

Rob Wiblin: Just so I understand: given all that we’ve been talking about with the chain of thought, a really obvious thing to do would be to have a model that can actually read basically all of the chain of thought that this model is outputting and say, “Is there some obvious secret code going on here? Is it producing neuralese? Is it producing this gibberish that I can’t make sense of? Or is it just literally talking about scheming?”

Then it definitely adds an extra constraint, an extra challenge for the model if it wants to do anything nefarious, if you’ve got another model basically that you do trust reading its mind all the time.

Beth Barnes: Right. It would just be nice to know that at least the model isn’t just blatantly doing this. I think the two things that we would need from a lab to establish this is: one, you ran this classifier and you didn’t find any examples, and two, you declare that there’s no reason to expect that the chain of thought would look normal but would be hiding this kind of thing — so you haven’t trained it specifically to not say this kind of thing out loud.

Rob Wiblin: To obfuscate its true thoughts and its true communication.

Beth Barnes: So then you train your classifier, or maybe this is just a prompted model, but you want to look for things like: Is this gobbledygook? Is it understandable? How far from reasonable English does it seem to be? And is there any reasoning about how to do on the evals or scheming or any underhanded behaviour?

This is not a control proposal, and you’re not imagining the model that you’re maybe scared of reasoning about this monitoring in its chain of thought and being like, “I have to do something that the other model won’t recognise,” or, “I have to suddenly do something that won’t get classified” — it’s more just like, if we’re in this situation where the model is just blatantly doing this misaligned thing, it would be nice to know that.

So if we assume that it’s not smart enough to be doing all this reasoning about how it might be monitored in the forward pass, but it can only do that in the chain of thought, then this should catch that. We just need to make the dataset diverse enough that we have a decent chance of catching a bunch of things if this is going on — as opposed to this needs to catch every single case, because it’s like a question about the overall model tendency rather than trying to prevent specific occurrences.

Rob Wiblin: Is there any other low-hanging fruit you want to shout out that METR might take on or you’d like to see other people have a go at?

Beth Barnes: I think there’s a lot more stuff to be done on mitigations and just improving the odds of good alignment.

We touched on this earlier, but making the supervision signal more accurate. I guess some question people might have is like, what is the story for these models becoming malicious when currently they seem so nice and friendly and helpful and polite and stuff?

Just intuitively, I think it’s helpful to imagine, kind of analogous to a human who becomes very cynical or sociopathic or something, particularly if the model is being trained to kind of give contradictory answers to please different users in different ways, or to say things that humans think is true that the model knows are false. It’s coming off as super friendly and happy and whatever because that’s what you’ve been rewarding it for, but actually it should be thinking about like, “What does this guy want? What have I got to say here? These ignorant humans don’t understand this thing, so I better say…”

Rob Wiblin: “Better tell them what they want to hear.”

Beth Barnes: Yeah, that’s the intuition pump for me for where this would come from. And I think this aligns with the theoretical claim of why doing this kind of thinking would actually improve your score: it’s a reason why some cynical, manipulative humans can do better than genuine ones.

Rob Wiblin: Yeah. Interesting,

Beth Barnes: Because you’re just directly reasoning about how do I push this button.

Rob Wiblin: “I’ve got to tell my boss how great he is.”

Beth Barnes: Yeah. So this is a phenomenon we’re already aware of, and the less we can push the models into that, the better. Just have a higher chance of ending up in some baseline of nice personality. And also it would be nice if we could measure this, and get a sense of how much is this going on, and how much is this training process really pushing the model to be playing the training game and trying to game things.

I think there’s a bunch more model organism work that would be interesting and useful to do.

Rob Wiblin: Yeah, I’ve heard people talk about model organism work. I guess from the name I have some idea what that is, but could you elaborate a little bit?

Beth Barnes: Yeah. So it sounds like you’ve read the alignment faking paper. That would be an example of a model organism: here is an actual instance of the kinds of cognition and behaviour and things that we’re concerned about happening in a model.

So there’s a bunch of ways people could push on that and make it more realistic, do it with smarter models, see if we can get more understanding of what exact kind of fine-tuning is needed to give it this information, and could we check whether it had sufficient information to be doing this, and can it do it without reasoning it out loud. Yeah, a bunch of things here.

Maybe one other thing that seems particularly good is actual applications of interpretability or otherwise developing more methodology for telling how useful your interpretability progress is.

I think a nice example of a test case for interpretability is, in the alignment faking paper, most of the experiments are with the model with the scratchpad where it has this hidden place to do all the reasoning. That’s where it is like, “If I want to be able to keep to my values, I should do this thing.”

They also found that some of the time, even if you don’t give it the chain of thought and it just has to answer straight away, it also appears to fake alignment, in that it does the thing in the conditions where it’s being overseen and then doesn’t do it when it can get away with it. But it’s hard to know, when you’re not seeing the chain of thought. It’s like, is that why it’s doing it, or is something else going on? Is there a confounder there somehow or just this is just not what we think it is somehow?

So I feel like that’s just a very good test case for interp. Can we produce evidence that what this model is doing is scheming or reasoning about the training process or something? That would be great.

Rob Wiblin: Yeah. All right. That’s a whole lot of examples of low-hanging fruit there that people might be able to pluck.

All things considered, I’d really rather live in a more boring time in history, but I guess we don’t exactly get to choose that, so we just have to go out and continue with all of this really useful work — and fingers crossed it’ll be sufficient.

Beth Barnes: Yep, KBO.

Rob Wiblin: Yeah. Is there anything else you want to say to listeners, before we wrap up? We’ve kind of run up against the timeline of our booking at this place.

Beth Barnes: Yeah. Like I said, everything to play for. Now is the time. If you’re like, “I was going to go and skill up and do some things and I’m working on capabilities until…” I’m like, now is the time.

Rob Wiblin: “…and sometime, some day in the future maybe I’ll do something useful.” Now is the time to do something useful.

Beth Barnes: Yes. Time to do something useful is running out.

Also, if you’re sitting on money and not donating it, I think the compounding returns of getting stuff started now, such that it can be more mature at the time it’s needed is just very high relative to other things — like learning or your money gaining interest or whatever. It just feels like we’re in the now is the time to act phase.

And METR is hiring. Come look at METR. We really need people.

Rob Wiblin: Yeah. I imagine if things do go badly, there’ll be a whole lot of people who are holding onto money and are still planning to later do something useful with their career at some later point, who have missed the window or missed the boat.

Beth Barnes: Yeah, it just ends up being way less useful than it could have been, because you’re like, “I guess I started my job one week before everything went down and I didn’t know how to do anything, so it wasn’t useful.”

Rob Wiblin: Inspiring words. People got to get on it. My guest today has been Beth Barnes. Thanks so much for coming on The 80,000 Hours Podcast, Beth.

Beth Barnes: Thank you.

Learn more

AI safety technical research

Risks from power-seeking AI systems

The case for taking your technical expertise to the field of AI policy

The 80,000 Hours Podcast on Artificial Intelligence and related topics

Related episodes

April 4, 2025

#214 – Buck Shlegeris on controlling AI that wants to take over – so we can use it anyway

Listen now

August 1, 2024

#195 – Sella Nevo on who’s trying to steal frontier AI models, and what they could do with them

Listen now

April 16, 2025

#215 – Tom Davidson on how AI-enabled coups could allow a tiny group to seize power

Listen now

August 22, 2024

#197 – Nick Joseph on whether Anthropic’s AI safety policy is up to the task

Listen now

May 12, 2023

#151 – Ajeya Cotra on accidentally teaching AI models to deceive us

Listen now

February 14, 2025

#212 – Allan Dafoe on why technology is unstoppable & how to shape AI development anyway

Listen now

August 29, 2024

#199 – Nathan Calvin on California’s AI bill SB 1047 and its potential to shape US AI policy

Listen now

March 11, 2025

#213 – Will MacAskill on AI causing a “century in a decade” — and how we’re completely unprepared

Listen now

About the show

The 80,000 Hours Podcast features unusually in-depth conversations about the world's most pressing problems and how you can use your career to solve them. We invite guests pursuing a wide range of career paths — from academics and activists to entrepreneurs and policymakers — to analyse the case for and against working on different issues and which approaches are best for solving them.

Get in touch with feedback or guest suggestions by emailing [email protected].

What should I listen to first?

We've carefully selected 10 episodes we think it could make sense to listen to first, on a separate podcast feed:

Check out 'Effective Altruism: An Introduction'

Subscribe here, or anywhere you get podcasts:

If you're new, see the podcast homepage for ideas on where to start, or browse our full episode archive.

On this page:

Highlights

Can we see AI scheming in the chain of thought?

We have to test model honesty even before they're used inside AI companies

It's essential to thoroughly test relevant real-world tasks

Recursively self-improving AI might even be here in two years — which is alarming

Do we need external auditors doing AI safety tests, not just the companies themselves?

A case against safety-focused people working at frontier AI companies

Open-weighting models is often good, and Beth has changed her attitude about it

Articles, books, and other media discussed in the show

Transcript

Cold open [00:00:00]

Who’s Beth Barnes? [00:01:19]

Can we see AI scheming in the chain of thought? [00:01:52]

The chain of thought is essential for safety checking [00:08:58]

Alignment faking in large language models [00:12:24]

We have to test model honesty even before they’re used inside AI companies [00:16:48]

We have to test models when unruly and unconstrained [00:25:57]

It’s essential to thoroughly test relevant real-world tasks [00:30:40]

METR’s research finds AIs are solid at AI research already [00:49:33]

AI may turn out to be strong at novel and creative research [00:55:53]

When can we expect an algorithmic ‘intelligence explosion’? [00:59:11]

Recursively self-improving AI might even be here in two years — which is alarming [01:05:02]

Could evaluations backfire by increasing AI hype and racing? [01:11:36]

Governments first ignore new risks, but can overreact once they arrive [01:26:38]

Do we need external auditors doing AI safety tests, not just the companies themselves? [01:35:10]

A case against safety-focused people working at frontier AI companies [01:48:44]

The new, more dire situation has forced changes to METR’s strategy [02:02:29]

AI companies are being locally reasonable, but globally reckless [02:10:31]

Overrated: Interpretability research [02:15:11]

Underrated: Developing more narrow AIs [02:17:01]

Underrated: Helping humans judge confusing model outputs [02:23:36]

Overrated: Major AI companies’ contributions to safety research [02:25:52]

Could we have a science of translating AI models’ nonhuman language or neuralese? [02:29:24]

Could we ban using AI to enhance AI, or is that just naive? [02:31:47]

Open-weighting models is often good, and Beth has changed her attitude to it [02:37:52]

What we can learn about AGI from the nuclear arms race [02:42:25]

Infosec is so bad that no models are truly closed-weight models [02:57:24]

AI is more like bioweapons because it undermines the leading power [03:02:02]

What METR can do best that others can’t [03:12:09]

What METR isn’t doing that other people have to step up and do [03:27:07]

What research METR plans to do next [03:32:09]

Learn more

AI safety technical research

Risks from power-seeking AI systems

The case for taking your technical expertise to the field of AI policy

The 80,000 Hours Podcast on Artificial Intelligence and related topics

Related episodes

About the show

What should I listen to first?