Early warning signs that AI systems might seek power

In a recent study by Anthropic, frontier AI models faced a choice: fail at a task, or succeed by taking a harmful action like blackmail. And they consistently chose harm over failure.
We’ve just published a new article, on the risks from power-seeking AI systems, which explains the significance of unsettling results like these.
Our 2022 piece on preventing an AI-related catastrophe also explored this idea, but a lot has changed since then.
So, we’ve drawn together the latest evidence to get a clearer picture of the risks — and what you can do to help.
See new evidence in context
We’ve been worried that advanced AI systems could disempower humanity since 2016, when it was purely a theoretical possibility.
Unfortunately, we’re now seeing real AI systems show early warning signs of power-seeking behaviour — and deception, which could make this behaviour hard to detect and prevent in the future. In our new article, we discuss recent evidence that AI systems may:
- Resist shutdown: e.g. OpenAI’s o3 model sometimes sabotaged attempts to shut it down, even when explicitly directed to allow shutdown.
- Seek more resources: e.g. a model attempted to edit code that enforced a time limit on its actions, essentially trying to get more time to complete its tasks.
- Behave in ways developers don’t want: e.g. a Bing chatbot produced manipulative and threatening responses during conversations with its users.
- Fake alignment with developers’ intentions: e.g. Claude 3 Opus behaved like it had certain goals during testing to avoid being modified, while planning to pursue its original goals once testing was over.
- “Sandbag”: some models act like they’re less capable than they are when it seems like performing too well might get them retrained to be less dangerous.
- Hide their intentions: some models that were penalised for having “bad thoughts” learned to hide those “thoughts” while continuing their problematic behaviour.
- Preserve dangerous goals, even after attempts to correct them: e.g. Anthropic trained “sleeper agents” with dangerous goals — like producing vulnerable code — that standard safety training techniques failed to remove. They appeared harmless during training while preserving these goals.
In the full article, we explain what these results tell us about the risk of AI disempowering humanity — and what uncertainties remain.
The evidence here is far from conclusive. These incidents have largely taken place in testing environments, and in most cases, no harm was actually done. But they illustrate that we just don’t know how to reliably control the behaviour of AI systems — and that they’re already demonstrating behaviours during testing that could undermine human interests if they happened in the real world.
And as we argue in the full article, the stakes could escalate as we build AIs with long-term goals, advanced capabilities, and an understanding of their environment.
Updated career advice and more
Our new article aims to provide a comprehensive and accessible introduction to the risks posed by power-seeking AI and what you can do about them.
Besides the latest evidence, we also cover:
- Why future AI systems might be motivated to seek power in the first place
- How an existential catastrophe from power-seeking AI might actually unfold, including how advanced AIs could succeed in disempowering humanity
- Key arguments against thinking power-seeking AI is a pressing issue — like the idea that future AIs will just be tools we can control, or that they won’t really try to seek power — and our responses to them
- How you can use your career to help, plus an overview of promising approaches in technical safety and governance
This blog post was first released to our newsletter subscribers.
Join over 500,000 newsletter subscribers who get content like this in their inboxes weekly — and we’ll also mail you a free book!
Learn more
- Is power-seeking AI an existential risk? by Open Philanthropy researcher Joseph Carlsmith
- Buck Shlegeris on controlling AI that wants to take over — so we can use it anyway
- Ajeya Cotra on accidentally teaching AI models to deceive us