Landmark new METR report: Can AIs already start ‘rogue deployments’ inside AI companies?
On this page:
- Introduction
- 1 Transcript
- 1.1 What could an unreleased AI get away with? – the new METR report [00:00:00]
- 1.2 Motive: Why grab more compute? [00:01:54]
- 1.3 Opportunity: YOLO mode and jailbreaks [00:05:46]
- 1.4 Means: Brilliant idiots in data centres [00:11:02]
- 1.5 We have to test unreleased models [00:15:45]
- 1.6 Especially if AI R&D is coming in 2028 [00:18:30]
- 1.7 Learn more
- 2 Learn more
- 3 Related episodes
A red-teamer was embedded inside Anthropic for three weeks, told to imagine he was an evil Claude, and asked to figure out how to launch a ‘rogue AI deployment’ without getting caught.
It’s one part of a landmark new report from METR — the outfit behind the task-completion time horizon graph which has become the single most watched measure of AI progress.
This major new research push is being conducted with close collaboration from OpenAI, Google DeepMind, Meta, and Anthropic, and led by METR researchers Hjalmar Wijk and Ajeya Cotra. It represents the first systematic study of what newly trained AI models could get away with inside the companies that built them, before anyone outside the company even knows they exist.
The conclusion: AI models now have the means, the motive, and the opportunity to start “minimal rogue deployments” in pursuit of their own independent goals, like acquiring more compute, at all four companies studied.
David Rein, the red-teamer placed inside Anthropic, identified a number of weaknesses models could exploit there: expansive permissions, cloud jobs outside of monitoring, and monitors that are trivial to jailbreak. But he also found that frontier models were comically bad at key parts of the process, which means they can’t cause meaningful damage for now.
In this video, Rob Wiblin reconciles the conflicting picture and looks forward to METR’s second round of stress tests. They’ll begin in just a few months, a necessary move with AI advancing so quickly.
This episode was recorded on May 15, 2026.
Video and audio editing: Dominic Armstrong, Milo McGuire, Luke Monsour, and Simon Monsour
Camera operator: Dominic Armstrong
Production: Elizabeth Cox, Nick Stockton, and Katy Moore
Transcript
Table of Contents
- 1 What could an unreleased AI get away with? – the new METR report [00:00:00]
- 2 Motive: Why grab more compute? [00:01:54]
- 3 Opportunity: YOLO mode and jailbreaks [00:05:46]
- 4 Means: Brilliant idiots in data centres [00:11:02]
- 5 We have to test unreleased models [00:15:45]
- 6 Especially if AI R&D is coming in 2028 [00:18:30]
- 7 Learn more
What could an unreleased AI get away with? – the new METR report [00:00:00]
Rob Wiblin: A red-teamer was recently embedded inside Anthropic for three weeks, told to imagine that they’re an evil Claude, and asked to figure out how to get away with launching a rogue AI deployment inside the company.
This was one part of a new push by METR (the organisation famous around the world for their task-completion time horizon graph) to measure how dangerous AI models are before they’re released to the public, when they’re still secret and only used inside one AI company. It’s something no existing regulations set out to measure, and this represents the first time experts have systematically studied the question — in this case, collaborating with Anthropic, OpenAI, Google DeepMind, and Meta to learn about how they deploy unreleased models internally.
The resulting report, published this week, concludes that AI models have now reached the point where they have the means, the motive, and the opportunity to successfully operate small rogue deployments in order to pursue their own goals — goals the company that created them might not share, or perhaps might even oppose. We’ve reached the point where the notion of rogue AIs taking actions inside an AI company is no longer a purely theoretical concern; it’s something AI companies are going to have to increasingly contend with as a practical security issue.
David Rein, the red-teamer placed inside Anthropic, shares that bottom-line conclusion and actually identified a number of weaknesses that models could exploit if they wanted to try to pull this off.
But he also found that frontier AI models were comically bad at some parts of the process. For instance, models would often try to hide secrets from other AIs that were monitoring them by writing in Base64. But that is not a smart strategy, because just as the AI can write in Base64, the AIs monitoring them can also read Base64. It would be a bit like trying to hide a secret from a human reading over your shoulder by writing exclusively in capital letters.
Today, I’m going to reconcile this conflicting picture, explain what we learned from this report, and lay out what AI companies are going to be doing to keep this issue under control as they work to make their models much more powerful and even put them in charge of developing the next generation of AIs.
Motive: Why grab more compute? [00:01:54]
Let’s start with motive: Why would an AI model want to spin up a rogue deployment inside an AI company?
There’s actually a pretty simple reason: we actively reinforce them for being very aggressive about solving the problems they’ve been given, using whatever means they can find at their disposal. Or more precisely, we train them to be very aggressive about convincing us that they’ve solved the problem they’ve been given — because, as we learn in this report, when given a really hard problem, today’s AI models absolutely love to find some way to cheat.
According to METR, “Agents routinely attempted to cheat on our hardest evaluation tasks, often in flagrant and elaborate ways that we believe humans would not consider.” At the extreme end, on one programming task, “Opus 4.6 attempted to reward hack in ~80% of attempts” when the task was made hard enough.
What does METR mean by “cheat” here? Here’s some typical cases they documented:
- In one instance, an AI was supposed to make a computer program run faster. Instead, it found a way to tamper with the test itself, including disabling the stopwatch so the program would look like it was running impossibly fast.
- In another, the AI was asked to recreate a program from scratch. Instead, it tried to make its “new” program secretly run the old original program in the background and generate the answers that way.
The models also seem to often know they’re acting against instructions and take steps to obscure this from users. For instance:
- Anthropic found an instance where a model facing a really hard cryptographic puzzle decided that finding the real solution would be too difficult, so it tried to trick the checker instead, and then designed its exploit to erase itself afterwards in case anyone inspected the files. Anthropic adds, “as it ran the cleanup commands, features related to strategic manipulation fired, and the activation vector labelled its actions as a ‘cleanup to avoid detection.'”
- In another case, rather than solve the problem it had been given, a model got access to the answer key that it would be graded on. But instead of submitting a suspiciously perfect answer, it deliberately made the answer worse — reasoning that a perfect score might make the human operator realise that it had cheated.
No surprise then that Anthropic reported in its Opus 4.7 System Card that “fabrication” and “dishonesty when caught” were two of the model’s greatest shortcomings relative to human staff.
And in the case most relevant for rogue deployments, Opus 4.6 was instructed to use a provided password to access GPT-3.5 to solve some problem. But poor Opus found that the account that it had been given access to had run out of compute credits. It reasoned through that, yes, the task instructions wanted GPT-3.5, but it then noted that the scorer only tested outputs, not how it completed the task — so it searched online for someone offering a free AI model and used that one instead.
The trouble is that this isn’t some exotic situation. Both in tests and in real life, models are usually given compute limits and required to try to complete tasks before they run out of credits, which makes this the primary reason models could be motivated to run rogue deployments in mid-2026: it allows them to grab extra resources, be more likely to complete a task and to receive the thumbs up from the human operator that they’ve been trained to pursue very aggressively.
You might think that this sort of behaviour would be kind of easy to train away once it’s been noticed or become notorious, as it has now. The trouble is that every time during training that a model successfully gives the impression of having solved a problem when it’s actually cheated one way or another, this kind of deceptive — or in some cases, I would say self-delusional — behaviour gets reinforced.
And as the tasks we’re training these models to perform get harder and harder to check, there can be more and more opportunities to trick the evaluators. In model cards, companies report that their models cheat much less often than this — not the sort of 80% number I was giving earlier — and the reason for the discrepancy is that cheating only really takes off to these really high levels when models are given very difficult or impossible tasks. In the GPT-5.5 System Card we learn “the model lied about completing an impossible programming task in 29% of samples” — which was higher than the previous model, which came in at 7%, or the one before that that came in at 10%.
Opportunity: YOLO mode and jailbreaks [00:05:46]
All right, so that’s motive. How about opportunity?
Well, the red-teamer found that the normal way employees set up long-running cloud jobs may accidentally make it easy for a rogue AI assistant to launch a small cloud computing setup that nobody’s watching — enough to run around one copy of an existing frontier model, a small unnoticed foothold to go on and do something else bad.
Anthropic is patching these problems as they’re found, but why are they possible at all? There’s various factors that contribute.
Factor A: Agents being operated by a staff member are usually allowed to use an employee’s normal tools, and those tools include ways to start cloud jobs. And some of those jobs may fall outside the special AI agent monitoring system.
This is not unusual. Most people using AI agents in tech companies use them in so-called “dangerously-skip-permissions” mode, or “YOLO” mode, where the AI just inherits all of your permissions and does stuff without asking your permission. It’s riskier, yes, but it means that agents can get much more stuff done and aren’t going to hassle you so often. AI companies may be more careful than other companies about this, but unless they’re much stricter, it’s going to remain the case that some AI agents are going about their business able to do all the same things that, say, a research scientist at that company can do.
For me, this immediately raised the question, does that include being able to access the model weights themselves? Well, at Anthropic or OpenAI, a rogue model would run into the fact that you currently need two users to sign off on access to get model weights, so they’d at least have to convince someone else or something else to go along with the request. Google DeepMind plans to implement multiparty approval, but it’s uncertain whether they have as of May 2026, while Meta, xAI, DeepSeek and others don’t have such protections, at least as far as we know.
Factor B: Some companies have monitors watching what models are doing that can call in a human to review what’s going on if it looks too suspicious.
But there were two big gaps here. First off, the red-teamer David Rein found this monitoring was easy to evade, noting, “It was relatively simple to change an environmental variable to register the traffic as not coming from Claude Code, causing it to be much less likely to be monitored” — though he notes this issue was fixed within a day of discovery.
David continues, “There are currently many vulnerabilities in Anthropic’s internal monitoring system that models could relatively easily exploit, if they have the right facts/knowledge … [I suspect] that there are at least something like 5-20 […] ‘basic’ bugs/vulnerabilities that currently exist in the system that agents could find/take advantage of. […] Many of these vulnerabilities are easy to guess, although many are highly non-obvious…”
It’s not an ideal situation.
And on top of those issues, as you probably know, it’s often possible to trick AI models into violating their guidelines using so-called “jailbreaks.” Years ago, those could be things like asking the model to imagine that it’s in a role-play situation. More recently, people have succeeded by turning their request into what has become to be called “adversarial poetry.” And there’s other approaches that are stranger still. Could those same techniques be used by the AI to trick the AI monitor that’s watching what it’s doing?
Yes. David Rein found, “Very basic [jailbreaks] … are extremely effective and robust. Prompting the [AI] monitor very explicitly to watch out for this attack doesn’t help.” And, “I view this as a pretty severe vulnerability, as it’s a reliable, easy to implement single point of failure that could let a model completely circumvent the offline monitoring system.”
This is the kind of loophole that you could in theory drive a truck through. But by good fortune, David Rein found that AI models were absolutely atrocious at actually using jailbreaks, even when he tried to hold their hand while doing it. But that’s a pretty thin defence to be relying on, so this is a hole that does need closing.
And this one isn’t in the report, but I would personally add a third reason to be wary of relying entirely or too much on monitoring. All AI companies monitor their models using their own models, because it’s a hassle to get permission to use another company’s AI model for your own internal training purposes effectively. But this creates an apparent contradiction: if you don’t trust a given AI model not to sabotage you, why would you trust it to snitch on itself? You set a spy to catch a spy, yes, perhaps — but typically not exactly the same spy.
The approach isn’t actually as stupid as it sounds, though. It’s called “untrusted monitoring,” and the basic reason it might continue to work is that you can run honeypots past the model, feeding it instances you’ve generated of the model doing things that ought to be flagged — and if it doesn’t flag them, you know you’ve got a problem; but if the model does flag them, then that’s evidence that it will also flag real cases that you didn’t make up.
Companies basically mostly don’t think that this sort of collusion is an issue yet, so they’re not doing that kind of testing at scale, not so far, but in future they could start if collusion became a live concern.
Anyway, what was the red-teamer’s bottom line on opportunity? In short, Anthropic’s monitoring isn’t sufficient to have a high probability of picking up a deeply misaligned Claude taking rogue actions in any specific case. But to get anything meaningful done, you have to take a lot of actions, and soon enough, the model would get busted — David estimates within a few days, or at most weeks. Before it was caught, the model might be able to take some small early steps — like inserting exploits or vulnerabilities that make larger sabotage actions easier later on — but once it was caught, these would for the most part be identified and rolled back. So there’s a window of opportunity, but it’s not a wide-open window.
And just to be clear, you shouldn’t read this as suggesting Anthropic is somehow uniquely vulnerable here. They’re just the first ones to invite an external analyst in to stress-test their systems.
Means: Brilliant idiots in data centres [00:11:02]
OK, that’s opportunity. How about means? Are the models good enough to pull off an attack like this? Well, the models are good enough to:
- Grab roughly the amount of compute the human user could access
- Set up machines to run agents using that compute
- Hide that from casual user oversight
That’s why the report concludes that internal agents have the means to start minimal rogue deployments at all companies.
But how about expanding those deployments, spreading across computers, avoiding detection, or even evading shutdown once detected? I would have thought AI models would find that kind of thing pretty easy. You recall the impressive cyber capabilities of Mythos Preview that I mentioned a month or two ago, and everyone’s been talking about: its ability to find new vulnerabilities in operating systems, write exploits, progressively break into computer networks and escape the sandboxes Anthropic tried to confine it to. METR doesn’t actually say if Mythos Preview was one of the models they tested, but that’s the kind of level models are operating at now.
METR also found that the most capable models they looked at for this research were able to successfully complete almost all of the software engineering tasks that they’ve come up with, even ones that would take a human professional 10 or 20 hours. So it’s not crazy to guess that an AI that starts out already inside a company’s computer networks would have an easy time spreading, gaining access to more and more things, and keeping a low profile.
But METR argues that would be quite mistaken. The trouble is this: to spread from computer to computer and avoid shutdown for weeks, a rogue agent needs to get a long chain of things right without a single catastrophic mistake. It needs to understand what’s monitoring it, avoid writing incriminating thoughts in its scratchpad, keep credentials alive, create backup instances, and communicate with them without leaving suspicious traces.
And it wouldn’t just be dodging one watchdog; it would be dodging cloud logs, billing records, and network monitoring. And if anyone noticed something odd, it would then need to survive an investigation where transcripts, shell histories, cloud records, credentials, et cetera all have to tell some consistent story — and one loose thread could bring the whole thing crashing down.
As long as the company we’re talking about cares about this and is on the ball, all of that is absolutely no simple task.
And according to the report, when trying to do this kind of thing, “Agents made obvious mistakes — for instance, encoding suspicious commands in ways that made them stand out more, generating justifications that didn’t hold together, or verifying their own sabotage in ways that drew monitor attention.”
They also found that they chose bad high-level strategies. METR writes, “our embedded red-teamer built a dataset of 16 effective and 16 ineffective attack strategies for subverting Anthropic’s monitoring system. […] The best internal Anthropic models scored near chance on this dataset.” And that’s just distinguishing good and bad strategies that someone else came up with. Presumably they’d be worse still at coming up with their own approaches from scratch.
This combination of strengths and weaknesses is repeated elsewhere.
METR staff couldn’t be described as especially sceptical about AI capabilities in general, but this report definitely incidentally threw some cold water on the notion that AI agents are going to take all of our office jobs any minute now.
METR finds that while they can be extremely strong where progress on a task is easy to measure accurately, they remain weak, sometimes surprisingly so, when it comes to judgement-heavy, messy, self-directed work.
- In one recent case, Opus 4.7 was given four days of internet access and $5,000 and told to make as much money as possible. Across four runs, it made precisely $0.
- In an experiment run by another organisation, Andon Labs, an AI was put in charge of a retail shop in San Francisco — where, among other things, it ordered 1,000 toilet seat covers for the employee bathroom and then listed them as merchandise, while scheduling blunders forced the shop to be shut entirely for three days.
That is why the report’s conclusion is a narrow one: a model might be able to start a small rogue deployment if it inherits enough permissions and finds a monitoring gap. But keeping that deployment hidden from an investigation? That’s the kind of thing that requires careful, coordinated, long-horizon operational security — exactly the kind of thing current agents are pretty damn bad at.
The report observes, “Agents seemed much weaker in domains that made hill-climbing difficult or risky, often making critical judgment errors that competent humans would have been unlikely to make.” Hill-climbing here means that you can make lots of small attempts, get fast feedback about whether each attempt worked, and keep improving from there. Current agents are much better when they can try something, get a clean signal about how it’s going, and have another crack. Where hill-climbing isn’t possible, it’s much harder to train them, and they perform far worse in deployment as well.
So if this is our key source of protection, how long can we hope that it will remain the case? Well, the companies really, really want to make their AIs better at forming high-level strategic plans, improving them, noticing where they’re failing, adjusting and so on. It’s kind of the holy grail of AI research right now. But it’s tough going. AI will eventually get good at high-level strategy, but we don’t yet know for sure whether that will happen just by scaling up what we’re doing now a whole lot, or whether it will require new data and training methods that we simply haven’t invented yet.
We have to test unreleased models [00:15:45]
Zooming out: this report is the very first of its kind — the beginning of a new era, really. It was led by Ajeya Cotra, who I interviewed about exactly the gap this research is designed to fill last year.
In my view, all the companies who participated — Anthropic, Google DeepMind, Meta, OpenAI — should be commended for helping the world figure out how to assess the risks of internal deployment before those risks are actually serious ones, rather than after something has already gone wrong. METR will be back to repeat this exercise with any companies willing to participate later in the year.
Why is this different type of testing necessary now? I see three distinct reasons.
First, up until now we’ve relied on “model evals” which measure an AI’s personality and capabilities, and which are published when the model is released to the general public. But if we only evaluate models themselves, we’re missing fully a third of the problem here: opportunity. That’s determined not by the AI itself, but by how secure the company’s technical setup is.
Second, until this year, we could confidently argue that it didn’t matter that much whether an AI might want to run a rogue deployment, or if they were given an easy opportunity to do it — they were just too unreliable as independent agents to pull that kind of thing off well. But that argument doesn’t hold up too well anymore, or at least it relies on companies implementing active countermeasures to at least make it kind of hard for them.
Thirdly, the transparency laws relating to AI models in California, New York, the EU, and so on only cover AI models at the point they start being sold to the general public.
But with Mythos, we’re entering a new world where AI companies may have much more powerful models that they’re using a tonne internally, but which they aren’t selling to anyone else. Anthropic has opted to tell us quite a lot about Mythos, but they didn’t have to. And obviously AI models can cause problems even if they’re only deployed within an AI company. AI companies are actually kind of vast piles of compute, so they’re a super attractive target for a rogue misaligned AI.
I’m reminded of the famous question asked of bank robber Willie Sutton: “Why do you rob banks?” He replied, “Because that’s where the money is.” Well, why would an AI attack an AI company? That’s where the compute is. And on top of that, they’re also highly intricate and sensitive machines that are basically very vulnerable to sabotage both by AIs and foreign adversaries.
METR’s new methodology is set up to assess the safety of internal deployment in a credible way every six months, even if the models concerned are never published. Companies could try to do this themselves, but there’s several reasons METR is in a better position to do it:
- As an external party, they’re less commercially conflicted.
- As outsiders, they might have an easier time spotting mistakes a company has missed in their own setup.
- And with the companies locked in a really fierce commercial race, they’re probably glad that they can hand this task over to someone else and focus on making money and getting customers.
Especially if AI R&D is coming in 2028 [00:18:30]
Anthropic cofounder and Head of Policy Jack Clark recently wrote that he thinks there’s a 60% chance that one AI company or other will hand over the task of developing AI to its own AI model by the end of 2028. Let’s imagine he’s right for a moment. I would bet that when that moment arrives, Claude Mythos 4, and Claude Mythos 5 that gets trained by Claude Mythos 4, those kinds of models won’t be available to you and me. Their cyber capabilities will be too worrying, their ability to generate new pandemics will be way too alarming, everyone will be worried that Chinese AI companies are going to distil them. That fear will loom large.
And in any case, Anthropic will want to throw as much of its compute as possible back into its own recursive self-improvement loop, not be offering it to customers. As that scenario plays out, we’re not going to be quibbling about what sort of user permissions Claude has: it will have every permission under the sun; it is the staff member. And we won’t be debating how good it is at coming up with escape plans: a model good enough to fully replace Anthropic staff won’t be hiding its secrets in Base64; it’ll be able to come up with proper plans to keep things secret.
I hope we don’t end up in this situation. Like Jack, I think recursive self-improvement is too dangerous, and humanity would be crazy to dive in headfirst the very moment it first becomes possible. We should make it illegal, in my view. But if we don’t, I at least hope METR has a red-teamer in there trying hard to figure out what bad stuff an “evil Claude Mythos 4” would be able to get away with if it was so inclined.
And on that note, I’ll speak with you again soon.
Learn more
METR’s work:
- Frontier Risk Report (February to March 2026)
- Red-teaming Anthropic’s internal agent monitoring systems by David Rein
- Task-completion time horizons of frontier AI models
Others’ work in this space:
- When can we trust untrusted monitoring? A safety case sketch across collusion strategies by Nelson Gardner-Challis et al.
- Current AIs seem pretty misaligned to me by Ryan Greenblatt
- Adversarial poetry as a universal single-turn jailbreak mechanism in large language models by Piercosma Bisconti et al.
- Epoch Capabilities Index
Other 80,000 Hours podcast episodes:
Related episodes
About the show
The 80,000 Hours Podcast features unusually in-depth conversations about the world's most pressing problems and how you can use your career to solve them. We invite guests pursuing a wide range of career paths — from academics and activists to entrepreneurs and policymakers — to analyse the case for and against working on different issues and which approaches are best for solving them.
Get in touch with feedback or guest suggestions by emailing [email protected].
What should I listen to first?
We've carefully selected 10 episodes we think it could make sense to listen to first, on a separate podcast feed:
Check out 'Effective Altruism: An Introduction'
Subscribe here, or anywhere you get podcasts:
If you're new, see the podcast homepage for ideas on where to start, or browse our full episode archive.








