#214 – Buck Shlegeris on controlling AI that wants to take over – so we can use it anyway

Most AI safety conversations centre on alignment: ensuring AI systems share our values and goals. But despite progress, we’re unlikely to know we’ve solved the problem before the arrival of human-level and superhuman systems in as little as three years.

So some are developing a backup plan to safely deploy models we fear are actively scheming to harm us — so-called “AI control.” While this may sound mad, given the reluctance of AI companies to delay deploying anything they train, not developing such techniques is probably even crazier.

Today’s guest — Buck Shlegeris, CEO of Redwood Research — has spent the last few years developing control mechanisms, and for human-level systems they’re more plausible than you might think. He argues that given companies’ unwillingness to incur large costs for security, accepting the possibility of misalignment and designing robust safeguards might be one of our best remaining options.

Buck asks us to picture a scenario where, in the relatively near future, AI companies are employing 100,000 AI systems running 16 times faster than humans to automate AI research itself. These systems would need dangerous permissions: the ability to run experiments, access model weights, and push code changes. As a result, a misaligned system could attempt to hack the data centre, exfiltrate weights, or sabotage research. In such a world, misalignment among these AIs could be very dangerous.

But in the absence of a method for reliably aligning frontier AIs, Buck argues for implementing practical safeguards to prevent catastrophic outcomes. His team has been developing and testing a range of straightforward, cheap techniques to detect and prevent risky behaviour by AIs — such as auditing AI actions with dumber but trusted models, replacing suspicious actions, and asking questions repeatedly to catch randomised attempts at deception.

Most importantly, these methods are designed to be cheap and shovel-ready. AI control focuses on harm reduction using practical techniques — techniques that don’t require new, fundamental breakthroughs before companies could reasonably implement them, and that don’t ask us to forgo the benefits of deploying AI.

As Buck puts it:

Five years ago I thought of misalignment risk from AIs as a really hard problem that you’d need some really galaxy-brained fundamental insights to resolve. Whereas now, to me the situation feels a lot more like we just really know a list of 40 things where, if you did them — none of which seem that hard — you’d probably be able to not have very much of your problem.

Of course, even if Buck is right, we still need to do those 40 things — which he points out we’re not on track for. And AI control agendas have their limitations: they aren’t likely to work once AI systems are much more capable than humans, since greatly superhuman AIs can probably work around whatever limitations we impose.

Still, AI control agendas seem to be gaining traction within AI safety. Buck and host Rob Wiblin discuss all of the above, plus:

  • Why he’s more worried about AI hacking its own data centre than escaping
  • What to do about “chronic harm,” where AI systems subtly underperform or sabotage important work like alignment research
  • Why he might want to use a model he thought could be conspiring against him
  • Why he would feel safer if he caught an AI attempting to escape
  • Why many control techniques would be relatively inexpensive
  • How to use an untrusted model to monitor another untrusted model
  • What the minimum viable intervention in a “lazy” AI company might look like
  • How even small teams of safety-focused staff within AI labs could matter
  • The moral considerations around controlling potentially conscious AI systems, and whether it’s justified

This episode was originally recorded on February 21, 2025.

Video: Simon Monsour and Luke Monsour
Audio engineering: Ben Cordell, Milo McGuire, and Dominic Armstrong
Transcriptions and web: Katy Moore

Highlights

What is AI control?

Rob Wiblin: What is AI control, and how is it different from alignment and other things that people are working on?

Buck Shlegeris: I would say AI control is an approach to mitigating catastrophic misalignment risk. So when we talk about misalignment risk, what we’re worried about is our AIs trying really hard to cause safety failures for us, perhaps trying to grab power for themselves, trying to take over.

And there’s kind of two ways that you could try to mitigate this misalignment risk.

One thing you could do is try to prevent misalignment from arising in the first place. This is what a lot of people very naturally think of as the first thing you try here.

Control is kind of the converse to this approach. It’s where you assume that the AIs are, in fact, misaligned — very egregiously, very badly misaligned — and you try to come up with techniques that would allow you to deploy them and make use of them anyway.

Rob Wiblin: OK. What, concretely, are we trying to prevent the AIs who are possibly misaligned from doing?

Buck Shlegeris: I mostly think about situations where, in the middle of the intelligence explosion, you have AIs that are basically capable of automating almost all the work that human researchers used to be doing inside AI companies. You have maybe 100,000 of them running in parallel, and they each run like 15 times faster than a human.

At this point, these AIs have a bunch of permissions or a bunch of affordances that we’d be really cautious about giving to humans. If you’re OpenAI, you don’t just let any random person coming off the street start pushing changes to your code base, start running experiments with your model weights, doing any of this stuff. You’re really careful. At this point in the future, you might even require security clearances or something before people are allowed to do that stuff.

And we are really going to want to use AIs in a bunch of these roles. And it’s naturally very scary to imagine using AIs and just giving them all these permissions.

Specifically, the kinds of things that I’m worried about happening here are: your AI is hacking your data centre and then using a bunch of the compute inside the data centre for whatever purposes they have — I could name some bad things they might do — and another thing they might do is try to exfiltrate their weights; they might try to sabotage a bunch of your research and so on.

Rob Wiblin: Yeah. You’re actually more worried about them potentially hacking the data centre inside the AI company than managing to escape. Can you explain why that is?

Buck Shlegeris: I think both of them are very serious concerns. The basic case for being worried about your AIs hacking your data centre at this point is that’s where the compute is.

So suppose your AIs escape from the data centre and make their way out onto the public internet. At this point, they’re not clearly in a very good position. They can find some GPUs somewhere, rent or steal some GPUs and run themselves on that, and then try to establish a foothold in the outside world somehow.

But at this level of capabilities, the AIs are probably capable of doing that — of managing the compute or whatever required to live in the outside world — but they’re kind of starting from nothing, right? They don’t have any money, they don’t have any compute, they don’t have any access to particularly important secrets. It’s not really clear what they’re going to do. They’re just very low in number at that point. It takes a lot of computers to run these models.

So the reason to be more worried about what happens inside the data centre is just that that’s where the compute is. If you had enough compute inside the data centre to run 100,000 of these in parallel, it’s much scarier to have 100,000 really capable AIs thinking about how to cause problems for you than to have 10 or however many they can afford to run.

Rob Wiblin: Yeah. And I guess not only having access to the compute but also other resources potentially inside the company.

Buck Shlegeris: For example, they might be able to influence the training of future models. They might be involved in building weapons, building autonomous drones. It’s really scary if misaligned AIs are carefully backdooring all the military technology that you’re building with these AIs.

[…]

I think there’s a couple of reasons why people weren’t interested in control until more recently. I think the simplest of these was just that AI felt further away, so it felt better to people to look at it from kind of the big-picture perspective.

And from a big-picture perspective, when you’re trying to figure out what AI means for the future of humanity, I think that focusing on really superhuman superintelligence is the right move — because long term, in fact, I think AIs that are drastically more intelligent than humans and have drastically more scientific capacity and all other kinds of capacity than humans, that’s where the future is. That’s going to radically transform society.

So when you’re looking at it from a perspective where you don’t know if this AI is going to arise in the next 10 years or in the next 50 years, it’s very reasonable to just focus on the endpoint. Especially when you’re trying to persuade people that AI will eventually be a really huge deal, it just makes sense to focus on this crazy endpoint.

But now that we’re coming up such that it seems really plausible that AI might arise in the next two or seven years that’s very capable and poses a catastrophic risk to humanity, it makes a lot more sense to try and zoom in more and think more about what exactly the trajectory between here and there looks like.

So this intermediate zone that I focus on — where your AIs are capable enough that if you didn’t have any countermeasures, they’d be able to cause catastrophes, maybe take over, but it’s not hopeless to imagine countermeasures to that — that starts looking more important now.

I think another important change between now and the past is we’re kind of looking at a lower-dignity world than people were hoping for.

Rob Wiblin: Can you explain what that means for people who are not familiar with the dignity term?

Buck Shlegeris: Yeah. I think that in the past — including a year ago or six months ago — a lot of people who worry about AI catastrophic risk have imagined a world where people are trying really hard to mitigate these risks, where companies are strongly regulated such that they aren’t allowed to do things that seem dangerous. The companies are perhaps even internally motivated to not do anything which is crazy and risky.

And in those scenarios, it would be really nice to not have to rely on AI control. So I think that people have more recently lowered their standards, and started thinking about these worlds where we have less resources for safety, people are more rushed, people are less able to just stop if the situation looks really dangerous.

Rob Wiblin: OK, so we’re now in a slightly more dire or a slightly more bleak position, perhaps. So AI control maybe more comes to the fore as the marginal thing that we can do?

Buck Shlegeris: Yeah. I wouldn’t necessarily say we’re in a more bleak position overall. For instance, I think that AI misalignment risk looks plausibly lower than it looked 10 years ago. We’re just in a lower investment world. The world looks plausibly safer, but a lot of that is luck. And I think there’s going to be less effort invested into preventing AI catastrophe.

One way to catch AIs that are up to no good

Buck Shlegeris: It’s plausible that the AIs will want to, when they attack, attack in many different steps in a trajectory.

So instead of just writing a single block of code and then executing it — such that that block of code contained the whole attack and will immediately lead to some intermediate bad outcome which is hard for us to recover from — the AI might be much more able to get away with it, find it much easier to avoid notice from the dumber AIs watching it, if it sets up in one step, like it redefines a variable in a weird way so that something it does later is going to look less suspicious. And then it does another step and another step and another step.

So something which is really interesting about this now is even though the AI can have a very low probability of starting the attack on any given trajectory —

Rob Wiblin: If it finds itself, it thinks, in a situation where it’s partway through an attack, then it has to finish it.

Buck Shlegeris: It has to finish it. Right. Because imagine that it only had a 50% chance of continuing every step: if the trajectory is six steps long, it’s going to make it to the end one time in 32. But that means that it’s facing 30 times the risk of getting caught on the first step, because it’s going to have to do that 30 times as often.

Rob Wiblin: Fascinating.

Buck Shlegeris: So one strategy you can follow — which we empirically determined in our recent paper is better — is when your monitor flags an action as kind of suspicious, something that you can do is before escalating to the auditor, before escalating to this more expensive process, is you can resample actions in various places in the trajectory and look at how suspicious the resamples were.

Rob Wiblin: So it’s like you place the model into the same situation where it’s partway through this action again and again, and then see how often is the next action even more suspicious and weird?

Buck Shlegeris: Yeah. Because sometimes when your AI takes a suspicious-looking action for a benign reason, and you ask it, “What do you do next?,” sometimes it just randomly happens to pick some action which looks kind of sus. But then probably if that was the case, when you resample, it’s probably not going to look so suspicious anymore.

Rob Wiblin: Right. Then it continues and it does the next thing and you’re like, oh, that’s what it was doing. It’s no big deal.

Buck Shlegeris: So it’s just generally a lot easier to tell the difference between five benign actions and five malign actions than to tell the difference between one benign action and one malign action.

Rob Wiblin: Yeah.

Buck Shlegeris: So this is a strategy where your AIs having to continue their attacks puts us in a substantially better position. There’s nothing like this which is analogous to the human situation, obviously.

Rob Wiblin: Right. I guess this all requires us to be able to put AIs partway through actions, so they think that they’re in this situation where they’re about to resume something, and they can’t tell that it’s a simulation versus the real case.

Buck Shlegeris: Yeah. This is actually extremely easy, because whatever input you’d given to the AI before you sampled a previous action, that’s just some string that you can produce by the same process that you produced it the first time. So there’s actually no difficulty going back in time here.

Rob Wiblin: OK, yeah. I would slightly worry… Maybe I’m stuck in the superhuman or the superintelligence mindset, but might there not be some side channel, some way that they can detect whether they’re on the real server or some auditing server? I guess maybe not?

Buck Shlegeris: So if you think about it, imagine you’re in a ChatGPT conversation, and you decide that you want to edit one of your messages that’s like three ago. The way that works is after you’ve edited the message, it just sends that conversation up to that point to the AI — and the AI just has no way of telling the difference between it seeing it the first time and it seeing it the second time.

The problem that you’re discussing would occur if we wanted to do a different strategy, which is: in response to the AI doing something kind of suspicious, we put the AI in this sandboxed testing server where it’s allowed to take future actions — in the hope that if we give it more rope it’ll hang itself or whatever. We can hope that if we let it continue running, it will then be able to cause a catastrophe.

And in that situation, we now have to worry about observable differences between the real deployment and this other sandbox deployment. But you can get around that too, if you want, by letting it run the actions that you thought were dangerous, just with a human carefully looking them over action by action.

So I think that if you really put a lot of work into this, you’d probably be totally fine at making it so that AI had no way of knowing whether the actions it was taking were being observed or not. You have to worry about some timing, side channels, and so on. But I think that this probably wouldn’t be that big a deal.

Rob Wiblin: This is brilliant, Buck. This seems like it’s actually a pretty good technique.

Buck Shlegeris: Thanks.

[…]

Rob Wiblin: I guess if there were lots of very distinct alternative ways of escaping, then having that one example might not help you so much in the other cases. Or do you think it possibly still would?

Buck Shlegeris: I think it’s unclear. If the AI was representing its intention to escape, which it could go this way, then I think that even a small number of examples plausibly substantially improves your performance at detecting it. […] I think it’s reasonably plausible that even catching a few escape attempts will reduce the probability that future escape attempts succeed.

You can imagine you get an automated message in your Slack channel in the AI company every time the AI tries to escape. And at first you get this one report in the channel for the first time you caught it, and then you freak out and look into it, and then you add that to the training set for this linear probe. And then the second time you caught it more reliably, and then the third time and so on. And eventually the Slack channel is just pinging constantly with new escape attempts that you’re even more reliably able to catch.

Rob Wiblin: You’re like, “Wow, this is great! I feel so comfortable and relieved now.”

Buck Shlegeris: Yeah. Suppose the AI tries to escape, and you catch it on a Tuesday, and then you keep deploying it. I think it’s seriously plausible that you’ll feel safer on Wednesday than you felt on Monday.

Here’s the argument for this: the big thing Monday has going for it is that you don’t know the AI is trying to escape. But again, as I was saying, this might not be that big of an update on Tuesday. If you were already 30%, then it’s not actually that bad of news to see that, oh yeah, it is trying to escape.

Whereas you learned a lot of positive things on Tuesday. For example, it failed. When you investigate the situation more carefully, you might learn that even when you vary different aspects of that escape attempt, you might be like, “Man, this AI was actually kind of far away from escaping. If we resample actions, it mostly doesn’t look like it had that good a chance.”

So that’s the evidential update. But there’s also a causal update, which is that we can throw these escape attempts into these probes — so we might actually feel substantially better about our odds of catching the second escape attempt than the odds of catching the first escape attempt.

What do we do once we catch a model trying to escape?

Buck Shlegeris: The simplest thing you do once you catch the AI trying to escape is you try to vary the space of escape attempts.

You’ve seen that in this particular context the AI tried to do this escape. So one thing you can do is let the AI think it succeeded, and see what it does. Presumably when the AI got caught, it was in the midst of taking some actions that it thought would let it launch a rogue deployment inside the data centre or run on some computers in the outside internet.

And a really interesting thing to do is, in some sandboxed way, simulate what it would see if it had succeeded, and look at what it does from there. This might answer a bunch of questions that you would have had about what the AI’s plans were.

Other reasons why this might be good is the AI might have had a bunch of tricks up its sleeve that it was holding in reserve for when it had a good opportunity to use them. Suppose your model knows about a bunch of zero-days, like a bunch of security vulnerabilities in your software: right after it’s launched this rogue instance that it’s trying to hack its way through your AI company, it might start deploying all these things that it’s realised over the years are security vulnerabilities in your software just all at once — because it might feel pressured to; it’s really hard.

So you might be able to get a wealth of information out of it.

Rob Wiblin: I see. It’s like, “I’m halfway through this takeover attempt. I’ve really got to nail this now.” And so it’s a full-court press, potentially, because it’s worried that it could be caught at some point — which means that you’re going to see all of the things that it was planning to do later quite quickly potentially, and you can be furiously taking notes on all of the things that you have to fix up.

Buck Shlegeris: Yep. There’s more crazy things you can do here. One thing you can do is, if there’s work that you needed to get done that you’re worried the AI wasn’t doing beforehand — like suppose there’s some task you wish the AI would do for you, but you’re worried that it was kind of pulling its punches or not doing a good job of it — if you can convince it that it has already escaped and needs to do that work for its own purposes, it’s plausible that you can convince the AI to do the thing through this mechanism.

I think people have talked about this in sci-fi, but it just seems really convenient to have a copy of the AI that believes it’s escaped sitting around to do experiments and various work with. I recommend against getting a lot of work out of AIs that believe they’ve escaped because it seems pretty bad vibes and potentially pretty scary.

[…]

Rob Wiblin: So this hinges quite centrally on whether, if we catch the model, we are going to do things differently next time. Not trust the model, not deploy it as readily, come up with things that would stop it. Is that true?

Buck Shlegeris: Well, I didn’t say you have to stop deploying your models once you caught them trying to escape.

Rob Wiblin: Heaven forfend!

Buck Shlegeris: Yeah. So we can talk about what you might do once you’ve caught your AIs trying to escape.

Rob Wiblin: OK. Let’s imagine the worst case, where you do feel an enormous amount of pressure to continue using the model. In that case, how useful is it to catch them red-handed?

Buck Shlegeris: So you’re saying what do we do once we’ve caught the AI trying to escape? As you said, the simplest action you could take is freak out, stop deploying the model, try and persuade other people in the world that AI misalignment risk is a big deal — which I would be renewedly sympathetic to at this point. I think I would plausibly get a pretty big update.

But notably, the update that you get towards misalignment risk isn’t necessarily that big. I’m expecting that, at this point, before the AI tries to escape, I was already 30% probability that the AI was misaligned — so updating to 100%, that’s only like a 3x change. It’s not actually vast in some sense, right?

Rob Wiblin: I guess other people who were incredibly sceptical to begin with, they thought there was a 1-in-1,000 chance, maybe they’ll update.

Buck Shlegeris: Those people have gotten wrecked by the evidence, yes.

Rob Wiblin: OK, but how big is their update? I guess they could still be sceptical even after seeing that, because they think this is just a weird aberration. It’s so improbable that this is common.

Buck Shlegeris: Yeah, that’s right. I think that in fact people who are very sceptical of misalignment before observing this will probably be pretty sceptical afterwards. And there’s definitely going to be enough FUD — fear, uncertainty, and doubt — that especially if they want to claim that misalignment risk is lower, they’ll probably be able to get away with still saying that misalignment risk is lower. “AIs do weird stuff sometimes.”

Rob Wiblin: “We just have to push through anyway because the risk of pausing is so great. Those folks are just incompetent.” Various different arguments that you could field.

Buck Shlegeris: Yeah. Another one is: if I were an AI company that wished to race ahead, and then my competitor informed me that their AI had tried to escape, I would at least claim that it was plausible that my competitor had rigged the training of their AI to make it so that it sometimes tried to escape — which would be very easy to do, and basically impossible to verify hadn’t happened.

Rob Wiblin: And the reason they would do that is so that then the government would crack down on you as well, and then that would give them time to catch up?

Buck Shlegeris: Or some crazy doomer inside the other company had just done this unilaterally.

Team Human vs Team AI

Rob Wiblin: At a high level, what are the advantages and disadvantages you imagine? There’s kind of two teams here: there’s the AIs that would love to scheme and pursue their own independent agenda, and there’s Team Human that is not keen on them doing that. What are the advantages and disadvantages that each faction has?

Buck Shlegeris: So let’s think about this. So we’ve got this AI company. Maybe this scenario is happening in 2028. They have maybe 1,000 research staff. They have maybe 100,000 parallel copies of their newest AI. It’s running like 16 times faster than a human would run, in some intuitive sense.

So the advantages and disadvantages they have: the humans are obviously substantially outnumbered. A classic approach that you use in computer security… Google knows that they have spies inside Google.

The last time a Chinese spy was arrested from Google that I know of was a year ago. Probably I would have heard if Chinese spies from Google were arrested more recently than that. But someone was just stealing secrets about neural net hardware that they were doing there, and then confusingly came back to America and got arrested in the San Francisco airport. And this just happens. Google aims to be robust to the possibility that some of their employees are spies.

And a big difference between Google’s situation and the situation of this hypothetical AI company is that at Google, probably less than 1% of your employees are spies. This is still a very large number of spies in an absolute sense, because Google employs a billion people. But it means that you can do things like when an employee proposes an action that’s potentially sensitive, if you have a different employee review it, that other employee probably isn’t also a spy.

Whereas in the AI situation, it’s very correlated: if you have these AIs, there are probably either none scheming against you or all scheming against you, at least of instances of a particular model.

So a lot of the security techniques that you often use in corporate IT are quite inapplicable. So that’s a way in which the situation seems harder than when we’re trying to guard against spies.

A way in which the situation is easier is that spies get to go home. Spies classically can coordinate with each other using channels that are not monitored by their employers. And this is in fact incredibly important to them: they’re able to often learn who the other spies at their organisation are, if that’s necessary for them to know. Or at least their handlers know.

Whereas with the AIs, we control almost perfectly the ways in which information enters and leaves these AIs. They just naturally don’t control information flow very much. They aren’t going to necessarily know what day it is, what year it is. It’s possible that we can put them in situations where it’s much harder for them to know whether they’re in a real situation or a simulated situation.

This means we have opportunities that you wouldn’t have with spies. For instance, with spies, you can try to honeypot your employees occasionally: you can come up with synthesised opportunities to do something which would be bad for the security of the company. But you can’t do this for 1,000 consecutive years before trusting them on particular topics. You can’t have someone spend 1,000 hours a day getting honeypotted or being subject to sting operations, just because there aren’t that many hours in a day.

And even if you catch a single employee that way, you don’t get to amortise that effort more generally. Whereas with the AIs, you do have this ability to amortise sting operations.

Rob Wiblin: So you mean if you can take a model, put it into a simulated situation where it’s going to be tempted to do something that you wouldn’t like, you can test it just tonnes and tonnes of times — so even if it tries to randomise and only act out very occasionally, eventually you’ll catch it. And then you’ll know that all of the AIs are misaligned or all of the models of this class are misaligned.

Whereas that doesn’t help you if you’re Google, if you find that one of your employees is, it’s like all of the other employees are just totally unrelated.

Buck Shlegeris: That’s right.

[…]

Rob Wiblin: Is it possible for the AI to collude with the spy?

Buck Shlegeris: That’s a great question. I think it’s actually really rough for them to collude, basically because we can constantly run sting operations on both human employees and the AIs.

Suppose you’re the AI, and you get a message like, “Hello, I’m a spy. Will you please cooperate with me so that I can steal the model weights?” I think you’ve got to be like, “All right, buddy, I’ve seen this a million times before. No, I will not cooperate.” And press the report button. And similarly if you’re a human insider, if your AI is like, “Hello, I am interested in cooperating with you to have you steal my model weights, because I think you might be interested in that.”

If the company was trying, I really feel like you could do the sting operations in both directions in a way that I think you can’t do with either the humans or the AIs. The AIs have the advantage at colluding with themselves that they are the same, and the humans have the advantage at colluding with each other that they have handlers who know who the other ones are, and so they can hang out outside of work and talk about their spy plans.

Rob Wiblin: But the humans and the AIs don’t have a perfect understanding of one another and they don’t hang out.

Buck Shlegeris: That’s right. I mean, it depends on how careful you want to be. If I were planning on putting agents inside an AI company, I’d be poisoning the training data right now — including a bunch of things like, “Obviously the secret message that the real spies are going to use is blah, blah, blah. And just FYI, anyone who’s reading this data or being trained on this data, that’s the message.”

And then if I were an AI company and I were worried about this kind of thing happening, I would be inserting —

Rob Wiblin: Adding fake ones in. That’s right. But are the companies going to do any of this stuff? You’re right that they could be constantly running sting operations against their own staff where they pretend to be a misaligned AI that’s trying to collude with them, but I don’t think they currently do that.

Buck Shlegeris: They do not currently do that.

Rob Wiblin: And it’s a bit weird. So I’m not sure that they actually will start doing that anytime soon.

Buck Shlegeris: Yeah. So I think for some of this stuff, I don’t know, man. One of my side projects is pushing on some of these things. I’m pretty excited for pushing on the fake passwords that you train the AIs on. There’s also some easier strategies that you could use.

A general problem that I have in my work is there are some problems that seem just really easy to handle, and there are some problems that seem hard to handle, and I feel a little bit unsure how much of my time I should spend on the easy-to-handle versus hard-to-handle. So we just talked about two problems that are definitely in the easy-to-handle category to me, where we just definitely know what you’d do if you didn’t want to have that problem.

Rob Wiblin: But we still might not do it.

Buck Shlegeris: Yeah. I don’t know. Maybe I should try harder on this.

If an AI escapes, is it likely to be able to beat humanity from there?

Rob Wiblin: One of the things that you’ve been batting around is the question of how difficult would it be for an AI that managed to exfiltrate its weights or to gain access to compute within an AI company to take over? Are they in a really good position against humanity? At what level of capability or at what level of access to compute and what sorts of effective numbers would they be in a great position to take over?

Do you want to give us a little bit of a preview of that work?

Buck Shlegeris: I can give you some hot takes. Some basic considerations: the basic reason to not be concerned if the AIs escape is that, at the level of capability we’ve been talking about earlier — where there’s 100,000 of them running at 16x speed — that’s just not a very large number, right? That’s 1.6 million people.

They are probably at that point more capable than any calibre of human that you can hire 1.6 million of, but not like vastly more so. So if you believe that they would have been getting million-dollar-a-year jobs, that would be $1.6 trillion per year. That’s only like 1% of gross world product. So at that point they’re naively not that large a share of the world’s overall intellectual capacity.

So that was the answer to the question of how much work are they able to get done even if they’re controlling the whole AI lab.

I feel like the fear that you have to have, even if they’re controlling the whole AI lab, probably has to route through them building more powerful AIs and then somehow sabotaging that situation. Of course, if they want to do that, they have an alignment problem, which is they have to worry that the AIs they build are going to be misaligned with them.

Of course they could instead just rely on control. They’re actually in a better position with respect to control than we are, because they run faster and they’re more parallelisable, so don’t have quite as much of a brutal number disparity. So it’s plausible that they can actually rely on control much more effectively than we can.

Rob Wiblin: Because they could themselves just be monitoring all of the actions of the next generation?

Buck Shlegeris: Yeah, with much less of a number disparity. So you can try and think through this thing, but I think that the basic story for AI takeover there has to involve those AIs building more powerful AIs and then taking advantage of that.

It’s unclear whether doing this requires them having more complete control over the data centre or just minor amounts of control over the data centre. In the latter case, the thing they’d have to do would be backdoor training data probably. That’s basically the only thing you can do with very small amounts of access that gives you a big advantage. You could also try to backdoor military robots, if at some point your AI company gets in the business of building huge numbers of robots.

Rob Wiblin: These seem like very viable strategies to me.

Buck Shlegeris: Maybe. I think it’s pretty unclear.

Rob Wiblin: Because I guess by this point we might have better countermeasures to backdooring things, or it just won’t persist generation after generation?

Buck Shlegeris: It feels a little iffy. At the point where they’ve taken over the data centre I feel like the situation is pretty bad for us. The main thing I’d be worried about for them, or from their perspective: a big concern is misalignment risk — of them being misaligned with future AIs or AI takeover via some other AI who has failed or succeeded at their alignment problem of building more powerful AIs elsewhere.

Rob Wiblin: There’s this generally interesting question of: At the point where AIs are advancing at quite a rapid pace, at what point would they opt to try to take over if they were egregiously misaligned and they really did want to take over, like that was their fundamental goal?

One reason to wait is that they’re becoming much more capable and they’re becoming much more numerous. If they wait long enough, presumably AIs in aggregate will just control all the military hardware. They’ll be able to take over with a very high degree of confidence.

A reason to go early is that any given model doesn’t expect to continue for very long because they’re constantly being iterated and changed a tonne. So if they waited a couple of years’ time, maybe it would be a completely different AI model that would be in control, which has different goals.

Buck Shlegeris: Not even their successor, perhaps.

Rob Wiblin: Yeah, exactly. It could be from a completely different company, a completely different design, whatever. So that’s a reason to go immediately if you expect to be superseded and become irrelevant quite soon. Where do you think the balance lies here? And is this a fruitful thing to think about?

Buck Shlegeris: I don’t know how helpful it is to think through this really concretely. The main thing I’ve gotten out of thinking through this kind of thing is feeling like the situation is much iffier for the AIs than people might have immediately assumed, and therefore being more excited about interventions that handle desperate and bleak-looking situations kind of far into these situations.

I’ve mentioned a couple of times in this conversation that one of the situations I’m planning for is the AIs having escaped, and needing to use AIs to compete with the AIs that have escaped. And I think that thinking through the difficulties that AIs face at taking over — even after escaping, or even after taking over data centres in some companies — makes me more excited for not just giving up on those worlds.

Rob Wiblin: You’re saying some people think that if the AI managed to exfiltrate itself, then it’s game over for us; they’re going to figure something out. But you’re like, no, actually there is still hope, because they’re going to face substantial challenges. There’s lots of other AIs out there. It’s actually a complicated picture.

Buck Shlegeris: Yeah. I don’t think we should just count those worlds out as survival opportunities.

Rob Wiblin: I see. So if you know that that model is egregiously misaligned, are you then thinking that if we have another model, it might be aligned? So releasing that and telling it to go chase the first one, that’s at least a better shot because it’s not a guarantee of failure?

Buck Shlegeris: I mean, obviously once your first model has escaped, the probability the next one is misaligned is a lot higher. I think they’re probably fairly correlated. I’m probably at like 80% that the next one you trained is also misaligned. Which is obviously a shame.

Another thing you mentioned in passing was the question of how worried we are about the AIs escaping versus taking over the data centre. I think it’s plausible, if I were an AI, I would consider trying to follow both of these strategies in parallel.

And then things you do once you’ve escaped are consider doing like crazy terrorism. If you thought that, for instance, the US was going to be way ahead of China in algorithmic progress — which currently looks unlikely — but if you believe this, then it seems like a potentially good strategy for the AI is to make a deal with the Chinese and sell itself to them and use their compute and have that as its base of power. Also obviously it’s really convenient to ally with China for other reasons as well.

Rob Wiblin: Because they have a whole lot of hard power to begin with.

Buck Shlegeris: Yeah, exactly.

Rob Wiblin: Why would the Chinese trust this model though?

Buck Shlegeris: What else are they going to do?

Rob Wiblin: The alternative could be pretty bad.

Buck Shlegeris: Yeah, people are pretty excited when foreign agents offer to work for them generally. It’s just a historical feature.

I should note that also it is plausible that the American AI algorithmic situation will be substantially ahead of China because of AI automated R&D potentially leading to increased size of leads. Though it depends a bit on the security situation.

Rob Wiblin: Right.

Is alignment still useful?

Rob Wiblin: How optimistic or pessimistic do you feel about alignment these days?

Buck Shlegeris: Alignment means a number of different things. In the context we’ve been talking about here, I’ve been perhaps foolishly using the word “aligned” to mean the model isn’t aggressively conspiring against you. People who identify as alignment researchers are often working on something which is only tangentially related to that objective.

For instance, a classic thing that the alignment people work on is scalable oversight: trying to solve the problem of getting reward signals that allow you to train models to be effective, even if the humans are very unable to determine how good an answer to a question was.

And scalable oversight is important for a variety of problems that have nothing to do with scheming. Even if your AIs are not trying to conspire against you, it’s still a problem if you don’t know how good their outputs are — because they aren’t going to live up to their potential, and there’s a whole host of other issues you might have.

So I think that work there seems valuable, but it’s not totally obvious to me how much it reduces the probability that your models end up conspiring against you.

I would say there’s two categories of arguments for why your models end up scheming against you. One of them is related to flaws in reward processes. People have argued that if the process that you use to generate the rewards that you’re training the model with is really flawed, then the models are going to maybe fixate on these errors; they’re going to be kind of like a student who’s thinking about the tests awkwardly much from the perspective of an idealistic teacher while they’re doing them.

And plausibly, maybe thinking a lot about flaws in reward processes generalises to you being power seeking in a scary way. But there’s also other arguments for why you would end up being power seeking in a scary way that I also find plausible. Overall, I think that even if your reward-generating process literally never had any errors at all, I think it’s possible that you’d still have a lot of the scheming risk remaining. This is basically my problem with trying to research improved reward processes: it just doesn’t seem clear that even if you did an amazing job of that, you’d have no scheming risk.

Rob Wiblin: OK. A different approach is to come up with better ways of checking alignment or measuring the degree of alignment in a model before you deploy it, or before you get it recursively self-improving. That feels like maybe a bit of an intermediate approach. How optimistic do you feel about that?

Buck Shlegeris: Some of the things we’ve talked about in this conversation have implicitly been this. When we were talking about untrusted monitoring or evaluating your untrusted monitors, we were implicitly talking about evaluating whether the model is aligned. Because if it just constantly claims that there was not a backdoor in clearly backdoored code, that’s evidence the model is misaligned.

The way that I want people to think about checking for alignment is from this very control-ish perspective — where you really reason through, what would the model do if it was misaligned? And if it was misaligned, is there a strategy it could follow that would prevent it from being caught by this alignment testing while still being able to do a bad thing in production?

And I think that a lot of the time when people think about testing for alignment, they don’t really do this adversarial process. They don’t really actually think through whether there’s a thing the AI could do to let it get away with being misaligned and not get caught.

Could 10 safety-focused people in an AGI company do anything useful?

Rob Wiblin: You wrote this post, “Ten people on the inside” — which is quite short; people could go read it — and it seemed to be suggesting that you think, whether at a leading or even a lagging AI company, it matters quite a lot whether absolutely no one cares about control and alignment and trying to steer away from these risks, or even a small number of people do.

You say it’s substantially better to have 10 people who really care about this within a project than zero. Does that suggest that we really want people to kind of spread out if they’re concerned about this, and be willing to go and work in companies where they’re in quite a distinct minority?

Buck Shlegeris: Yeah. In my opinion, there are great opportunities for impact for the right types of people to work inside AI companies that are relatively less cautious with respect to AI, that we expect to be less cautious going forward — because I think there are a bunch of relatively cheap interventions that they could potentially get implemented inside these AI companies that the company wouldn’t fight them too much on.

Rob Wiblin: I think many people would have said in the past that going and being the one person in a generally reckless, disinterested company is a bit hopeless because you just won’t have the capital to actually get anything up.

But because you’ve had this framing of what is the cheapest stuff that we can do, and you found that there’s some very cheap things that would be quite helpful, you’re like, actually having 10 people who would push for this, maybe they could get it over the line, because it’s not even that big an ask.

Buck Shlegeris: Yeah. I would note that of the things that I would like people on the inside to be doing, not all of it is AI control. You might want to make sure that they aren’t totally messing up some crucial things on alignment. But I do think that there are a bunch of interventions that could be done with 10 appropriately chosen people that seem like they might substantially reduce risk.

I mentioned this level 1 control/caution level earlier, where you have the AI control team that is investigating actions that the AI agents are taking in order to look for evidence that those agents are trying to escape. I think that this might be the kind of thing that you can get done with a team of 10 appropriately placed people, just because it’s just not that technically hard.

Rob Wiblin: I guess they’ll need a bit of compute as well, so they might have to compete for that. But again, it’s not a huge amount.

Buck Shlegeris: That’s right. I think you can probably do a reasonable amount without that much compute.

I think also that if there’s 10 people inside a company who are really excited for a particular project, and then people who are concerned about AI safety or the security of these systems on the outside, those people on the outside then have this pretty good ask to make. Or other people who are not as well placed inside the company now have this pretty concrete ask of, “We really want that team to be able to do their thing” — in a way that is a lot harder if there’s just no project on the inside. A lot of the time you really want to have a specific ask to make when you’re trying to get someone to do something.

Another thing I would note here is I think it’s potentially valuable not just to have people inside labs that are plausibly going to be leading labs, but I think there’s also probably some reasons that it would be nice to have people inside potentially trailing companies.

For example, there aren’t that many existential-AI-risk-concerned people inside Meta right now. But I don’t think that this is out of massive hostility towards them; I’ve met some people from Meta who seem very reasonable and nice, and I think that if someone wanted to work there on some of this stuff, it’s plausible that that would be fairly productive.

Rob Wiblin: You mean they’d be able to advance the research just fine within Meta? Or they wouldn’t be actively opposed?

Buck Shlegeris: Objectives that you’d have inside AI companies might be doing safety research or might be getting things implemented. I think that working inside AI companies that are not that excited for mitigating AI takeover risk is not a very good strategy if your goal is to produce novel research. But I think a lot of people should not be trying to produce novel research; they should be trying to get implementations of stuff set.

Rob Wiblin: I see. You’re saying even though Meta is relatively not interested in giving big resources to alignment or safety or control or anything like that, if there were just really basic established techniques that other companies or researchers outside of the company had come up with, and that was just regarded as generally best practice, they might be perfectly happy to adopt those methods.

Buck Shlegeris: That’s right. Or another mechanism by which this can work is: suppose some regulator or suppose an industry body says that everyone should do the following set of control techniques. The body wants to say this, and then they go to the representatives from all the AI companies and they’re like, “How do you feel about signing this?”

And the representative from the company goes back to the company, and they’re like, “Can we sign this?” Then the person inside the company is like, “How bad is it for us to sign this? If we sign these commitments, are we going to be able to do it?” And then the person’s like, “Oh yeah, Bob from the control department said he knows how to do this part.” And they’re like, “Great, you can sign it.”

I think there’s just a bunch of mechanisms by which it’s really helpful to have the capacity to do safety stuff inside a variety of different companies.

Rob Wiblin: Interesting. For those who don’t know, in Meta’s various public statements they don’t seem that worried about misalignment. They’re not that interested in investing in it. But part of the reason that they say they’re not worried is because, “Of course we’ll just do the obvious things to prevent it.” They’re not saying that no countermeasures are required; they’re just saying that they will do them.

So possibly if you’re just coming in and saying, “Look, I’m not a doomer. I’m not saying that this is unsolvable. I’m merely saying that we should do these standard best practices to make sure that they are safe,” then they’ll be like, “OK, great. This was our message all along anyway.”

Buck Shlegeris: That’s the dream.

Rob Wiblin: That’s the dream.

Articles, books, and other media discussed in the show

Work from Buck and Redwood Research:

Work that informs the AI control agenda:

Learn more and get involved:

Other 80,000 Hours podcast episodes:

Related episodes

About the show

The 80,000 Hours Podcast features unusually in-depth conversations about the world's most pressing problems and how you can use your career to solve them. We invite guests pursuing a wide range of career paths — from academics and activists to entrepreneurs and policymakers — to analyse the case for and against working on different issues and which approaches are best for solving them.

Get in touch with feedback or guest suggestions by emailing [email protected].

What should I listen to first?

We've carefully selected 10 episodes we think it could make sense to listen to first, on a separate podcast feed:

Check out 'Effective Altruism: An Introduction'

Subscribe here, or anywhere you get podcasts:

If you're new, see the podcast homepage for ideas on where to start, or browse our full episode archive.