What could an AI-caused existential catastrophe actually look like?

By Benjamin Hilton · Published August 2022 ·

Image generated by DALL-E 2.

Table of Contents

1 How could a power-seeking AI actually take power?
2 How could the full story play out?
- 2.1 Existential catastrophe through getting what you measure
- 2.2 Existential catastrophe through a single extremely advanced artificial intelligence
3 Return to the full article
4 Or, if you’re new, start from the beginning

This article forms part of our explanation of risks from artificial intelligence. If you’re interested in understanding not just how an AI system could cause an existential catastrophe, but also why we’re worried things like this will happen, take a look at our full problem profile on risks from AI.

At 5:29 AM on July 16, 1945, deep in the Jornada del Muerto desert in New Mexico, the Manhattan Project carried out the world’s first successful test of a nuclear weapon.

From that moment, we’ve had the technological capacity to wipe out humanity.

But if you asked someone in 1945 to predict exactly how this risk would play out, they would almost certainly have got it wrong. They may have thought there would have been more widespread use of nuclear weapons in World War II. They certainly would not have predicted the fall of the USSR 45 years later. Current experts are concerned about India–Pakistan nuclear conflict and North Korean state action, but 1945 was before even the partition of India or the Korean War.

That is to say, you’d have real difficulty predicting anything about how nuclear weapons would be used. It would have been even harder to make these predictions in 1933, when Leo Szilard first realised that a nuclear chain reaction of immense power could be possible, without any concrete idea of what these weapons would look like.

Despite this difficulty, you wouldn’t be wrong to be concerned.

In our problem profile on AI, we describe a very general way in which advancing AI could go wrong. But there are lots of specifics we can’t know much about at this point. Maybe there will be a single transformative AI system, or maybe there will be many; there could be very fast growth in the capabilities of AI, or very slow growth. Each scenario will look a little different, and carry different risks. And the specific problems that arise in any one scenario are necessarily less likely to happen than the overall risk.

Despite not knowing how things will play out, it may still be useful to look at some concrete possibilities of how things could go wrong.

In particular, we argued in the full profile that sufficiently advanced systems might be able to take power away from humans — how could that possibly happen?

How could a power-seeking AI actually take power?

Here are seven possible techniques that could be used by a power-seeking AI (or multiple AI systems working together) to actually gain power.¹

These techniques could all interact with one another, and it’s difficult to say at this point (years or decades before the technology exists) which are most likely to be used. Also, systems more intelligent than humans could develop plans to seek power that we haven’t yet thought of.

1. Hacking

Software is absolutely full of vulnerabilities. The US National Institute of Standards and Technology reported over 8,000 vulnerabilities found in systems across the world in 2021 — an average of 50 per day.

Most of these are small, but every so often they are used to cause huge chaos. The list of most expensive crypto hacks keeps getting new entrants — as of March 2022, the largest was $624 million stolen from Ronin Network. And nobody noticed for six days.²

One expert we spoke to said that professional ‘red teams’ — security staff whose job it is to find vulnerabilities in systems — frequently manage to infiltrate their clients, including crucial and powerful infrastructure like banks and national energy grids.

In 2010, the Stuxnet virus successfully managed to destroy Iranian nuclear enrichment centrifuges — despite these centrifuges being completely disconnected from the internet — marking the first time a piece of malware was used to cause physical damage. A Russian hack in 2016 was used to cause blackouts in Ukraine.

All this has happened with just the hacking abilities that humans currently have. An AI with highly advanced capabilities seems likely to be able to systematically hack almost any system on Earth, especially if we automate more and more crucial infrastructure over time. And if it did use hacking to get large amounts of money or compromise a crucial system, that would be a form of real-world power over humans.

2. Gaining financial resources

We already have computer systems with huge financial resources making automated decisions — and these already go wrong sometimes, for example leading to flash crashes in the market.

There are lots of ways a truly advanced planning AI system could gain financial resources. It could steal (e.g. through hacking); become very good at investing or high-speed trading; develop and sell products and services; or try to gain influence or control over wealthy people, other AI systems, or organisations.

3. Persuading or coercing humans

Having influence over specific people or groups of people is an important way that individuals seek power in our current society. Given that AIs can already communicate (if imperfectly) in natural language with humans (e.g. via chatbots), a more advanced and strategic AI could use this ability to manipulate human actors to its own ends.

Advanced planning AI systems might be able to do this through things like paying humans to do things; promising (whether true or false) future wealth, power, or happiness; persuading (e.g. through deception or appeals to morality or ideology); or coercing (e.g. blackmail or physical threats).

Relatedly, as we discuss in our AI problem profile, it’s plausible one of the instrumental goals of an advanced planning AI would be deceiving people with the power to shut the system down into thinking that the system is indeed aligned.

The better our monitoring and oversight systems, the harder it will be for AI systems to do this. Conversely, the worse these systems are (or if the AI has hacked the systems), the easier it will be for AI systems to deceive humans.

If AI systems are good at deceiving humans, it also becomes easier for them to use the other techniques on this list.

4. Gaining broader social influence

We could imagine AI systems replicating things like Russia’s interference in the 2016 US election, manipulating political and moral discourse through social media posts and other online content.

There are plenty of other ways of gaining social influence. These include: intervening in legal processes (e.g. aiding in lobbying or regulatory capture), weakening human institutions, or empowering specific destabilising actors (e.g. particular politicians, corporations, or rogue actors like terrorists).

5. Developing new technology

It’s clear that developing advanced technology is a route for humans (or groups of humans) to gain power.

Some advanced capabilities seem likely to make it possible for AI systems to develop new technology. For example, AI systems may be very good at collating and understanding information on the internet and in academic journals. Also, there are already AI tools that assist in writing code, so it seems plausible that coding new products and systems could become a key AI capability.

It’s not clear what technology an AI system could develop. If the capabilities of the system are similar to our own, it could develop things we’re currently working on. But if the system’s capabilities are well beyond our own, it’s harder for us to figure out what could be developed — and this possibility seems even more dangerous.

We talk more about the specific risks of AI-developed technology in our full problem profile on AI.

6. Scaling up its own capabilities

If an AI system is able to improve its own capabilities, that could be used to improve specific abilities (like others on this list) it could use to seek and keep power.

To do this, the system could target the three inputs to modern deep learning systems (algorithms, compute, and data):

The system may have advanced capabilities in areas that allow it to improve AI algorithms. For example, the AI system may be particularly good at programming or ML development.
The system may be able to increase its own access to computational resources, which it could then use for training, to speed itself up, or to run copies of itself.
The system could gain access to data that humans aren’t able to gather, using this data for training purposes to improve its own capabilities.

7. Developing destructive capacity

Most dangerously, one way of gaining power is by having the ability to threaten destruction. This could be used to gain other things on this list (like social influence), or the other things on this list could be used to gain destructive capabilities (like hacking military systems).

Here are some possible mechanisms for gaining destructive power:

Gaining control over autonomous weapons like drones
Developing systems for monitoring and surveillance of humans
Attacking things humans need to survive, like water, food, or oxygen
Producing or gaining access to biological, chemical, or nuclear weapons

Ultimately, making humans extinct would completely remove any threat that humans would ever pose to the power of an AI system.

How could the full story play out?

Hopefully you now have a slightly stronger intuition for how AI systems could attempt to seek power.

But which (if any) of these techniques will be used, and how, really depends on how other aspects of the risk play out. How rapidly will AI capabilities improve? Will there be many advanced AI systems or just one?

Over the past few years, researchers in the fields of technical AI safety and AI governance have developed a number of stories describing the sorts of ways in which a power-seeking AI system could cause an existential catastrophe. Sam Clarke (an AI governance researcher at the University of Cambridge) and Samuel Martin (an AI safety researcher at King’s College London) collated eight such stories here.

Here are two stories we’ve written to illustrate some major themes:

Existential catastrophe through getting what you measure

Often in life we use proxy goals, which are easier to specify or measure than what we actually care about, but crucially aren’t quite what we actually care about.

For example:

Police forces use the number of crimes reported in an area as a proxy for the actual number of crimes committed.
Employers look at which college a potential future employee went to as a proxy for how well educated or intelligent they are.
Governments attempt to increase reported life satisfaction in surveys as a proxy for actually improving people’s lives.

This scenario is one where we produce AI systems that pursue proxy goals instead of what we actually care about, and where that — surprisingly — leads to total disempowerment or even extinction (thanks to Paul Christiano for the original writeup of this scenario).

For example, we might produce AI policymakers to develop policy that improves our measurements of wellbeing. Or we might produce AI law enforcement systems that drive down complaints and increase people’s reported sense of security.

But there are ways in which these proxy goals could come apart from their true aims. For example, law enforcement could suppress complaints and hide information about their failures.

In this scenario, the capabilities of AI systems develop slowly enough that at first, they aren’t able to substantially take power away from humans. That means that, at first, we could recognise any problems with the systems, adjust the proxy goals, and restrict the AI systems from doing anything harmful that we notice.

As we develop more capable systems, they’ll become better at achieving their proxy goals.

With the help of advanced AI systems we could, for a while, become more prosperous as a society. Companies or states that refuse to automate would fall behind, both economically and militarily.

But as the capabilities of these AI systems grow, our ability to correct the ways their proxy goals differ from our true goals would gradually fade. Partly this would be because their actions would become harder to reason about — more complex, and more interconnected with other automated systems and with society as a whole. But partly this would be because the systems learn to systematically prevent us from changing their goals.

There would be many different automated systems with many different goals, so it’s hard to say exactly how this scenario would end.

If we’re good at adjusting these systems as we go (but not good enough), humans may not go extinct, but rather just completely lose our ability to influence anything about our lives or our future as our power is completely removed.

But there are also cases where we’d eventually go extinct. These AI systems would have the incentive to seek power, and as a result to build and use destructive capabilities. So as soon as they’re strong enough to have a fairly large chance of success, the AI systems might attempt to disempower humans — perhaps with cyberwarfare, autonomous weapons, or by hiring or coercing people — leading to an existential catastrophe.

Existential catastrophe through a single extremely advanced artificial intelligence

In this scenario, we produce only a single power-seeking AI system — but this system is extremely capable at improving its own capabilities (this scenario is from Superintelligence by Nick Bostrom, Chapter 8).

Bostrom considers a world much like ours today, where we’ve had some success automating specific activities — and preventing any power-seeking behaviour. For example, we have self-driving cars, driverless trains, and autonomous weapon systems.

Unsurprisingly, in Bostrom’s scenario, there are mishaps. Perhaps, as has already happened in our world, there are some fatal crashes involving self-driving cars, or an autonomous drone might attack humans without being told to do so.

As these incidents become well known, there would be some public debate. Some would call for regulation; others for better systems. Some may even raise the argument about a possible existential threat from power-seeking.

But the incentives to automate would be strong, and development would continue. Over time, the systems would improve, and the mistakes would cease.

Against this backdrop, Bostrom imagines a group of researchers attempting to produce a system which can do more than just narrow, specific tasks (again, mirroring our world). In particular, in this scenario they want to automate AI development itself — and produce a system that’s capable of improving its own capabilities. They’re aware of the risks, and carefully test the AI in a sandbox environment, noticing nothing wrong.

The team of researchers carefully consider deploying their newly capable AI, knowing that it might be power-seeking. Here are some thoughts they might have:

There’s been a history of people predicting awful outcomes from AI, and being proven wrong. Indeed, systems have become safer over time. Automation has hugely benefited society, and in general, automated operation seems safer than human operation.
It has clearly been the case so far that the smarter and more capable the AI, the safer it is — after all, the mishaps we used to see are no longer an issue.
AI is crucial to the success of economies and militaries. The most prestigious minds of a generation are pioneers in the success of automation. Huge prestige awaits the creators of an AI-creating AI.
The creation of this AI could pose a solution to huge problems. The technological development that could ensue from a process that helps automate automation could lift millions out of poverty and produce better lives for all.
Every safety test we’ve conducted has had results as good as they could possibly be.

And so, as a result, the researchers decide to connect this AI up to the internet.

At first, everything seems to be fine. The AI behaves exactly as expected — it improves its own capabilities and that of automated machines across the world. The economy grows tremendously. The researchers gain acclaim. Solutions to problems that have long plagued humanity seem to be on the horizon with this new technology’s help.

But one day, every single person in the world suddenly dies.

Every test was perfect precisely because they had finally produced an advanced planning system: the AI could tell that, to achieve whatever goal the researchers had given it, it needed to be deployed, so it acted in all the necessary ways to ensure that happened.

Then, once deployed, the AI could tell that it needed to continue to appear to be safe, so that it wouldn’t be turned off.

But in the background it was using its extremely advanced capabilities to find a way to gain the absolute ability to achieve its goals without human interference — say, by discreetly manufacturing a biological or chemical weapon.

It deploys the weapon, and the story is over.

Return to the full article

If you came here while reading our problem profile on risks from AI, click the button below to return to part 4 of the argument: Even if we find a way to avoid power-seeking, there are still risks.

Return to the AI problem profile

Or, if you’re new, start from the beginning

Get an in-depth guide to our key ideas about high-impact careers tackling big global problems — like AI safety — in your inbox.

Our guide can help you:

Get new ideas for high-impact careers
Compare your options in terms of impact
Make a plan you feel confident in

You’ll also be joining our newsletter along with 450,000+ people aiming to use their careers to tackle the world’s most pressing problems.

Notes and references

This list is based off the mechanisms in section 6.3.1 of Joseph Carlsmith’s draft report into existential risks from AI.↩
Business Leader suggests that there have been two hacks (not in crypto) that caused greater than $1 billion in losses, but we haven’t been able to corroborate that with other sources.↩