Transcript
Cold open [00:00:00]
Tom Davidson: New technologies have massively changed the power of different groups throughout history. The emergence of democracy itself was helped by the emergence of the Industrial Revolution: it was actually very advantageous for a country to have a well educated, free population. In this context, the AI actually seems like it will reverse that situation. It will no longer be important to a country’s competitiveness to have an empowered and healthy citizenship.
With that context, and the context that historically military coups are common when people can get away with it, and there could be a situation where a tiny number of people have an extreme degree of control over this hugely powerful AI technology, I don’t think that it is a kind of science-fiction scenario to think that there could be a power grab by a small group. Over the broad span of history, democracy is more the exception than the rule.
A major update on the show [00:00:55]
Rob Wiblin: Hi everyone, Rob here. Before today’s episode, I wanted to share a pretty significant update about this programme, and let you know that we are hiring for a new host for the show and a new chief of staff for this podcasting team.
In brief, this show and 80,000 Hours as a whole are going to be focusing even more on artificial general intelligence going forward. AGI, with all its potential upsides and downsides, has been the global issue we most strongly recommend people consider going to work on for many years now.
But last year, only a third of the interviews that we did on this programme were AI related. For this year, that number is going to be over 80%.
The reasons are simply that we think there’s a high likelihood that recursively self-improving AGI comes soon, and we think the plausible upsides and downsides are staggering. And yet, there are very big opportunities for impacts that are not being taken because society as a whole is not switched on to the issue and is basically running into it unprepared.
If you want a single episode of the podcast to listen to that paints a picture of the broad worldview that’s motivating this change, the single best one so far is probably episode #213: Will MacAskill on AI causing a “century in a decade” — and how we’re completely unprepared. That was I think our most popular episode release of all time, so far.
80,000 Hours itself lays out this new AGI focus and why we’re making the change in a blog post on our website, titled “We’re shifting our strategic approach to focus more on AGI” — which we’ll link to in the show notes.
Two quick clarifications: all the non-AI related research on our website is staying up there, and our job board will continue to list non-AI roles the same as it always has.
Now, I know many listeners will be really excited by this news that we’re focusing in on something that they also think is incredibly consequential, underdiscussed, or just super interesting to them.
But I’m sure some of you will also feel sad or frustrated to hear me say the above — maybe because you’re not convinced AGI is such a big deal; or you find it kind of tedious; or even if you think it’s important, you feel powerless to do anything about it.
For those folks, let me say a couple of things to try to cheer you up a little.
Firstly, when I say episodes will be about AGI, we’re taking a broad perspective and plan to consider basically all angles on the issue, not just some narrow topic like rogue AI and misalignment. AGI is going to affect all parts of the economy and government and culture — and personally, I’d be shocked if you weren’t interested in at least some aspects of how AGI might play out. This episode, for example, is about how AGI could affect the likelihood of coups conducted by human beings, which I think is fascinating even if you don’t think of yourself as an AI person per se.
Secondly, some episodes will look at issues that bear on how AGI might play out, but aren’t AGI exclusive by any means. For instance, we have an episode coming up about how the UK political system works — and in many cases really doesn’t work — which is certainly relevant to AI, but likely will be entertaining and useful to almost all of you. I’m also working on an episode on how geopolitical alliances are rapidly and significantly shifting this year — which again is definitely relevant to AGI, but also relevant to many other things as well.
Thirdly, we’ve got over 150 episodes that have nothing to do with AI in the archives, and we’ll be putting out more compilations of best parts of the show on a wider range of topics. So if you find yourself wishing that we were releasing more non-AI episodes, but you haven’t yet made it through the big backlog of episodes from many years past, then hopefully you can get your fix there instead.
Fourthly, if you’re not yet sold on AGI being a big deal, I’d ask you to stick around and listen to some, if not all, of the episodes that we have coming up in case future events change your mind.
The second part of this announcement is that we want to significantly grow the podcast team in order to deliver as much incredibly insightful AGI-related content as we can.
To that end, we are currently accepting expressions of interest from people who are curious about either becoming a host of the show or becoming a chief of staff to work with me to make a grander vision for this podcast possible. We’re likely to hire for additional creative and management roles later in the year as well, but we haven’t specifically decided on that yet.
Why could these roles be your greatest opportunity for impact? Well, in my view, the debate around how to develop and govern AI, and then AGI, and then superintelligent machines is the most important conversation of our time. Indeed, it might turn out to be the most important conversation of any time.
But the public conversation around AGI remains completely, comically inadequate. Mainstream sources neglect the issue entirely for the most part. When AGI is discussed, it’s often covered superficially, and the coverage fails in various different ways — naive optimism, reflective dismissiveness, ignoring problems that aren’t already super visible, or maybe uncritically accepting narratives from actors with really strong financial interests in the industry or other major conflicts of interest.
The extremely profound implications of developing recursively self-improving AGI are basically ignored, and I believe there’s a critical opportunity to fill that gap and prepare society for what’s coming by scaling up the show from periodic in-depth interviews to a more comprehensive platform, covering all kinds of AGI-related developments.
As I said, we’re currently accepting expressions of interest for the new host and chief of staff until May 6 — so if you think you have what it takes, let us know sooner rather than later.
We’ll link to those jobs in the blog post associated with the episode, and you can find them at 80000hours.org/latest. I’ll add some further details on what we’re looking for in the outro, but I’ve already gone on longer than I planned.
So without further ado, here’s Tom Davidson, joining me to discuss his new paper on AI-enabled coups, which was just launched today at forethought.org.
How AI enables tiny groups to seize power [00:06:24]
Rob Wiblin: Today I’m speaking with Tom Davidson, who researches abrupt seizures of power as a result of AI advances at the Forethought Centre for AI Strategy in Oxford. Thanks so much for coming back on the show, Tom.
Tom Davidson: It’s a pleasure to be here, Rob.
Rob Wiblin: Later on, we’re going to talk about reasons people might be sceptical that AI really will enable seizures of power, as well as what countermeasures we might implement if we are concerned about it.
But let’s just start at the beginning: how could AI potentially enable small groups of people to seize a lot of power, and indeed maybe power over an entire country like the United States?
Tom Davidson: At some point, maybe in the next few years, AI systems will get better than the top humans in domains relevant for gaining political power — domains like designing new weapons, controlling military systems, persuasion, political strategy, and cyber capabilities. My worry is essentially that a tiny group of people, or maybe just one person, could have an extreme degree of control over how those systems are built and how they’re used.
So across history, there have been many different routes by which groups have seized power over countries — there’s been military coups, there have been the removal of checks and balances on political power, and there have been uprisings — whereby small groups have overthrown the incumbent regime through armed force. I think that advanced AI could supercharge any of those routes to seizing power.
The 3 different threats [00:07:42]
Rob Wiblin: What are the different scenarios that people should have in mind? You mentioned there’s military coup, so something that is involving the normal military. And there are people building their own military, or using the AI in a kind of hostile way.
Tom Davidson: Yeah, that’s right.
Rob Wiblin: And is there another category?
Tom Davidson: Yeah. I distinguish between three broad threat models here, although you can of course get combinations of all three.
As you say, military coup is where there’s this legitimate, established military, but you subvert it by illegitimately seizing control — maybe using a technical backdoor or convincing the military to go along with your power grab. So that’s the first one: military coups.
The second one, as you said, is something I call “self-built hard power,” which is just what it says on the tin: you’re kind of creating your own armed forces and broad economic might that allows you to overthrow the incumbent regime.
The third one is something that we’ve seen much more recently in more mature democracies, which I’m calling autocratisation. That’s the kind of standard term. The broad story there is that someone is elected to political office and then proceeds to remove the checks and balances on their power, often with a broad mandate from the people who are very discontented with the system as it is.
Is this common sense or far-fetched? [00:08:51]
Rob Wiblin: So on one level, saying AI is going to enable a military coup or something like that sounds a little bit peculiar, a little bit sci-fi-ish. How sceptical should we be coming into this conversation? Are these things very abnormal or very rare? Or should we think of them as more common than perhaps we do on a day-to-day basis?
Tom Davidson: Across the second half of the 20th century, across the globe, military coups were very common. There were more than 200 successful military coups. Now, they were predominantly not in the most mature democracies; they tend to be in states that have some elements of democracy, but not kind of full-blown democracy.
But I do think that with AI technology, there will be new vulnerabilities introduced into the military that could enable military coups — so the historical trend that military coups haven’t happened in democracies may not continue to apply.
In terms of autocratisation, again, the most extreme cases of autocratisation really leading to full-blown authoritarian regimes haven’t started off in mature democracies like the United States. But for example, in Venezuela, that was a pretty healthy democracy for 40 years before Hugo Chavez came into power with a strong socialist mandate for reform — and then over the next 10, 20 years really pretty much removed all of the checks and balances. And now it’s just widely considered to be an authoritarian regime with just the smallest pretence of democracy today.
In terms of self-built hard power, the analogue I would point to there would be that historically new military technologies have enabled groups of people to gain large amounts of power. For example, the British Empire was largely built off the back of the Industrial Revolution, where there were lots of new technologies that created a broad economic and military might: the nuclear bomb, for example, gave decisive advantage in the Second World War; the longbow is a classic example people give that allowed this tiny English force to defeat the French in the Battle of Agincourt.
So it’s very typical for new military technologies to enable small groups of people to overpower larger groups of people. I think what is new is saying that, within a given country, there would be a quick process of developing new technologies that would then overthrow the incumbent regime. And that’s where some of the specifics of what AI might enable come into play.
Rob Wiblin: Yeah. And on the autocratisation, we see that happening reasonably often, or are kind of familiar with this happening in countries like Russia. So it’s happened over the last 20 years.
Tom Davidson: That’s right. Russia, Venezuela. Another good example is Hungary, which was thought to be a promising, pretty robust democracy. And then in the 2010s, with Viktor Orbán being elected to power, and a combination of just putting pressure on the media and fiddling with the electoral system, has just ended up in a position where it’s not really considered a democracy anymore.
“No person rules alone.” Except now they might. [00:11:48]
Rob Wiblin: So what is it structurally about AI as a technology that allows it to facilitate seizures of power by small groups?
Tom Davidson: The key thing in my mind is that it’s surprisingly plausible that you could get a really tiny group of people — possibly just one person — that has an extreme degree of control over how the technology is built and how it’s used.
Let me say a few things about why that might be the case. Today there are already massive capital costs needed to develop frontier systems: it costs hundreds of millions of dollars for the computer chips that you’d need. So that means that already there’s only a handful of companies that can afford to get into that game and large barriers to entry.
And I think that is going to increase as a factor over time, as these initial training runs are getting more and more expensive. And even with the move away from pretraining towards more agentic training like with o1, still we expect that there’s going to be a lot of compute used to generate that synthetic data to train the most capable systems.
There’s also kind of a broad economic feature of AI that it has these massive economies of scale, which means huge upfront development costs and then very small marginal costs of serving an extra customer. And also, AIs produced by different companies are pretty similar to one another. There are small differences between Claude and GPT-4, but not massive.
And economically speaking, those features tend to favour a kind of natural monopoly where there’s just one company that kind of serves the whole market. I’m not saying that these economic features will necessarily push all the way to there being just one frontier AI developer, but I think that there are these broad structural arguments to think that there could be a consolidation of that market — like what we’ve seen, for example, in the semiconductor supply chain over previous decades: now only TSMC is able to produce the smallest node chips.
So those are economic factors. I think there are some political factors that could lead to centralisation of AI development as well. People have raised reasonable national security grounds for centralising AI development. It could allow us to secure AI weights more against potential foreign adversaries. So I think there’s a chance that’s convincing.
People have also thought it might be good for AI safety to have just one centralised project, so you don’t have racing between different projects.
There’s also some more AI-specific reasons that you could have a real centralisation in terms of AI development. This idea of recursive improvement, which I talked about last time I was on the podcast, is the idea that at some point — I think maybe very soon — AI will be able to fully replace the technical workers at top AI developers like OpenAI. And when that happens, even if we were previously in a situation where the top AI developer is only a little bit ahead of the laggard, it could be that once you automate AI research, that gap quickly becomes very big — because whoever automates AI research first gets a big speed boost. So even if it seems like there’s multiple projects that are all developing frontier systems, then there could be a year over which actually now there’s only really one game in town, in terms of the very best system.
So this is all to say that we could very easily end up in a world where there’s just one organisation that is developing this completely general purpose, highly powerful technology.
Now, you might say that’s OK, because within that organisation there’ll be loads of different people, loads of checks and balances. But there’s actually a plausible technological path to that not being the case, which again relates to how AI could potentially replace the technical researchers at that company.
So today, there’s hundreds of different people that are involved in, for example, developing GPT-5. And if someone wanted to mess with the way that technology is built, so it kind of served the interests of a particular group, it would be quite hard to do, because there’s so many different people that are part of the process. They might notice what’s happening, they might report it.
But once we get to a world where it is technologically possible to replace those researchers with AI systems — which could just be fully obedient, instruction-following AI systems — then you could feasibly have a situation where there’s just one person at the top of the organisation that gives a command: “This is how I want the next AI system to be developed. These are the values I want it to have.”
And then this army of loyal, obedient AIs will then do all of the technical work in terms of building the AI system. There don’t have to be, technologically speaking, any humans in the loop doing that work. So that could remove a lot of the natural inbuilt checks and balances within one of these — potentially the only — developer of frontier AI.
So pulling that all together is to say that there is a plausible scenario where there’s just one organisation that’s building superhuman AI systems, and there’s actually potentially just one person that’s actually making the significant decisions about how it’s built. That is what I would consider an extreme degree of control over the system.
And even if there’s kind of an appearance that other employees are overlooking parts of the process, there’s still this risk that someone — with a lot of access rights who is able to make changes to the system without approvals — could secretly have a side project which does a lot of technical work. And that even if employees are overseeing some parts of the process, it could just be that that side project is able to have a significant influence over the shape of the technology without anyone knowing anymore.
Rob Wiblin: So I guess the key thing that distinguishes AI or AGI in this case from previous technologies, is the cliche about even dictators, even people who have seemingly enormous amounts of power: no person rules alone.
Even if you are Vladimir Putin or you’re seemingly controlling an entire country, you can’t go collecting the taxes yourself; you can’t actually hold the guns in the military yourself. You require this enormous number of people to cooperate with you and enforce your power. You have to care about the views of a broader group of people, because they might remove you if they think someone else would serve their interests better.
Tom Davidson: Exactly.
Underpinning all 3 threats: Secret AI loyalties [00:17:46]
Rob Wiblin: Through this conversation, we’re going to imagine that the alignment problem has largely been resolved, or at least for practical purposes the AI models are helpful: they do the thing that they’re instructed to do; they are trying to help out the person who is operating them and controls them and owns them.
And in that case, the person who runs the company or the person who is operating the model can basically tell the model to be loyal to them, and ensure that it always will remain loyal to them and will continue to follow their instructions. And if in the economy and in the military, it’s the AIs that are doing almost all of the useful work, then now they basically just have the entire loyalty of all of these groups, and they don’t have to care that much about what any other people think of them.
Tom Davidson: Yeah, exactly. And I’m particularly then applying that insight to the AI developer itself. I think one of the first types of work that might actually be automated is AI research itself, because it’s going to be so lucrative to do so, and because the people who are creating AI will be intimately familiar with that kind of work.
So I’m applying that insight that you won’t need to rely on humans anymore to the process of developing AI, and saying that in that context, it’s particularly scary — because now this whole new powerful general purpose technology can just be controlled by a tiny number of people.
And I actually want to distinguish between two types of extreme control you could have there.
The first type of extreme control is control over how superhuman AI is used. Once you develop these systems, it could be possible, just using the compute you used to develop them, to run hundreds of millions of copies of AIs that are as good or better than the top humans in these domains for gaining power.
If a leader of one of these organisations just used 1% of that compute — syphoned it off and used it without anyone knowing — to plot for ways of seizing power, that would be the equivalent of a million absolutely smart, hard-working people thinking as hard as they can about ways to seize power.
There’s never been a situation where one person can get that kind of massive effort behind plotting and taking actions to seize power. So that’s the first type of extreme control: use of AI.
And the second — which I think is, if anything, even more scary — is an extreme degree of control over the way the technology is built. So it seems to me that it may well be technically feasible to create AI systems that appear, when you interact with them, to have the broad interests of society in mind, to respect the rule of law — but actually secretly are loyal to one person.
This is what I call the problem of “secret loyalties”: if there was someone who was powerful in an AI project and they did want to ultimately seize power, it seems like one thing that they could try to do is actually make it so that all superhuman AI that is ever created is actually secretly loyal to them.
And then, as it’s deployed throughout the economy, as it’s deployed in the government, as it’s deployed in the military, as it’s deployed talking with people every day — advising them on their work, advising them on what they should do — it’s constantly looking for opportunities to secretly represent the interests of that one person and seize power.
And so between them — this possibility of secret loyalties and the possibility of using this vast amount of intellectual labour for the purposes of seizing power — it does seem to me scarily technologically feasible that you could have a tiny group or just one person successfully seizing power.
Rob Wiblin: Just to clarify why it is that you could end up with all of these AIs, these many different AI models through society, all being loyal to one person: the poison begins at the very beginning. Because it’s the AIs that are doing the AI research and figuring out how to train these other models, if the first one that is responsible for that kind of AI research is loyal to one individual, and that individual instructs them to make sure that all subsequent models also maintain this secret loyalty, then it kind of just continues on indefinitely from there.
Tom Davidson: Exactly. So initially, probably it’s just the AI-research AIs that are secretly loyal, just the AIs that are operating within the lab. But then later, they’ll be making other AIs — maybe they’ll be making AIs to control military equipment, maybe they’ll be making specialised AI for controlling robots — and they could, as you say, place those secret loyalties in all those other AI systems, or place backdoors in those other AI systems so that ultimately that one person is able to maintain effective control over this broad infrastructure of AIs and robots.
Are secret AI loyalties possible right now? [00:22:07]
Rob Wiblin: This issue of secret loyalties might sound a little bit sci-fi to people or a little bit strange, but I guess we should say it’s a very live problem. It kind of is the case that, with current models, with our current level of ability to look inside and understand what they’re doing, it’s possible to give them secret agendas that they don’t reveal, except at the time when that agenda is kind of called on. And even if you inspect the model, you can’t figure out that there is such a secret loyalty there. Is there more to it?
Tom Davidson: I’d say today the systems aren’t clever enough for this to be very sophisticated. It is true that there’s this “Sleeper agents” paper from Anthropic, where one example is they have an AI system that writes secure code when it can see it’s 2023, but it then inserts vulnerabilities when it’s 2024. And that’s a loose analogue to an AI system that mostly seems to act in the interests of everyone, but when it sees there’s a chance, it will advance one person’s interests because of the secret loyalty.
But the truth is that they’re not able to very sophisticatedly execute on that kind of strategy. So if you’re playing around with the system and you could give it a million different inputs and see how it reacts, you would probably be able to detect that this is an AI system that’s got some kind of secret hidden agenda. So it’s not, I would say, a risk today that there could actually be secret loyalties we’re not aware of.
But we’re looking forward to much more powerful systems. There’s an example with human spies who are able to live in a different country, work in a different organisation, and consistently fool all of their colleagues about their true intentions while taking rare opportunities to defect. So that’s an example of the kind of thing we could see in the future.
And I expect that AI systems will be much more intelligent than humans eventually, so that’s kind of a minimum threshold for how sophisticated the strategies could be that these secretly loyal AIs could execute.
Rob Wiblin: Yeah. My understanding from the sleeper agents paper was that the key issue is that if you don’t know what the trigger is for the loyalty to be activated, then it’s very hard to observe it in any of your evaluations or testing of the model.
So if you know what the trigger is and you can kind of trick it into being activated, then you can tell that it’s there. And I guess people are trying to work on other methods that would allow you to scrutinise the models and detect other loyalties or other abnormal behaviour that’s been implanted in it in response to particular triggers that you might not know. But we’re kind of not there yet.
Tom Davidson: That’s right. There’s a bit of a cat-and-mouse game. So there are methods for detecting at least simple triggers, where you can look for what inputs would trigger certain problematic behaviours, and there’s a bit of a cat-and-mouse game. But yeah, I agree that broadly, there’s not any super robust ways of detecting all the different triggers that could exist.
One thing may be worth mentioning here is an analogy with misalignment, where people worry about schemers — that is, AIs that are pretending to be aligned with us, but are secretly pursuing their own agendas.
One thing that’s worth mentioning is that I think that these secret loyalties are strictly more plausible, technically speaking, than scheming AIs. Because with scheming AIs, the worry is that you just train it, and then without you intending to do so, the AI just ends up with this kind of secret agenda that it manages to conceal from you. Here we’re actually imagining that there’s potentially a significant technological effort to install that behaviour and make it so that no one can detect it when they do testing.
So I would expect that if there’s any risk of scheming being possible, then there would be a very large risk that it would be possible to do very sophisticated secret loyalties.
Key risk factors [00:25:38]
Rob Wiblin: What are the key risk factors here? You mentioned the automation of AI research, which means that a far smaller number of actual human beings might be looped into the process in any capacity at all. I guess there’s also having one project that’s far ahead of the others, so that the leading AI model would just be much more capable of out-strategising and out-acting other actors. Are there other key risk factors?
Tom Davidson: Yeah, one other risk factor is when you do automate that research, how is that automation structured? I painted kind of the worst-case scenario, where all the AI is just reporting to one person. A much better way you could structure the automation is you automate individual teams, but humans still oversee the work on those teams, and there’s some degree of siloing that kind of prevents one person messing with the whole system. So just the kind of governance of the organisation.
Transparency of the organisation is, I think, a big risk factor. So if the company is publishing its best analyses of how dangerous its capabilities are, its best understanding of the risks, including insider threats, and there’s many people looped in on those analyses, then I think that this kind of risk is much less likely to arise.
I think the speed at which AI capabilities are improving around this time is a big risk factor. If there’s a very fast AI takeoff, very quickly going from human-level to superhuman systems, that could really accentuate the gap between the leading project and the next project, and the capabilities that are more widely diffused in society. And that could give a more extreme degree of control to this tiny group of people.
Preventing secret loyalties in a nutshell [00:27:12]
Rob Wiblin: I guess that’s a very natural lead-in to the question of what can be done to mitigate the risk. We’ll cover this more thoroughly towards the end of the conversation, after we’ve mapped out the risk in even more detail, but at a high level, what options are there for mitigating all of these kinds of threats?
Tom Davidson: I’ll keep this high level for now, but one key intervention is sophisticated safeguards on internal use of models.
So currently, models that are served externally, that you and I can access on the API, will tend to have some safeguards. They won’t tell you how to make a pipe bomb, for example. But within top labs, my understanding is it’s not too difficult to get hold of a helpful-only model that will just follow any instructions you give it, even if it’s breaking the law. I think some labs are planning to change that, but essentially putting in those safeguards on internal use is a big one.
Sharing capabilities widely: so avoiding a situation where just a tiny number of people have access to this system. And as widely as can be done safely, giving many people access to the capabilities.
Publishing the model specification is another one. It’s easier for outsiders to verify that these risks are low if we publish the set of rules for how the AI is meant to behave. So making that public I think would be a big improvement.
What’s the Forethought Centre for AI Strategy? [00:28:32]
Rob Wiblin: As I mentioned in the intro, you work at this little outfit called the Forethought Centre for AI Strategy. Can you tell us just a little bit about that before we go on?
Tom Davidson: Forethought is a new strategy research organisation that I’ve been setting up with William MacAskill, Amrit Sidhu-Brar, and Max Dalton. Its key focus is on neglected problems relating to the transition to AGI.
A lot of the people who have really taken transformative AI systems seriously have mostly focused on this risk that AI will be misaligned and maybe take over. But there’s actually a whole host of other problems that these AI systems will bring. Today we’re discussing one of them: the risk of a human power grab. William MacAskill’s been focused on another one, which is what positive vision for society should we be aiming for once we have AGI? There’s many others besides.
And yeah, I’m really enjoying working with Forethought. It’s a much more collaborative research atmosphere than what I’ve had previously, and I’m finding that’s much more enjoyable and also much more productive.
Are human power grabs more plausible than ‘rogue AI’? [00:29:32]
Rob Wiblin: Who else is working on this issue of power seizure? It seems like a relatively obvious issue once you point it out. Are there many other groups that have noticed this and have decided to write something about it?
Tom Davidson: It’s surprisingly neglected. Carl Shulman has been talking about this issue for a few years. Lukas Finnveden has done some excellent research here, and we’ve been collaborating and he should get credit for anything sensible I say here today.
And beyond that, there’s a variety of people who are aware of the risk and incorporate it into their work, but very little full-time work. There is a related research field in terms of AI risks for authoritarianism, in which the AI could undermine democracy, but that’s mostly not focused on power grabs in particular.
Rob Wiblin: I guess most people who’ve been worried about the loyalty of AI or power grabs have been worried about AI models themselves being misaligned, where their secret agenda is to gain power themselves.
But this seems like an extremely adjacent problem, where, as you mentioned, it’s more plausible — perhaps because you don’t have to imagine that there’s some surprising, unexpected way that the AI model ends up with its own independent agenda, or that it wants power for itself. Instead, you just imagine that it’s been designed specifically for that purpose and has been instructed to gain power for a group of people. So it seems at least as plausible as the misaligned, rogue AI concern. Why has there been so little thought put into this?
Tom Davidson: I think that historically, the people thinking about superhuman AI thought that it was very likely that it would be very hard to align AI, and that AI takeover was very likely. And from that perspective, with that assumption, it does make sense to prioritise the more likely risk. They probably thought that human takeover was comparably less likely to happen, because there are some parts of that analysis where you have to actually have the human that’s able to insert the secret loyalty without anyone else noticing.
I also think historically the view probably has been that AI takeover would be a lot worse than human takeover would be. People have historically thought that if AI did take over, it would very likely have completely alien values, and that there would be nothing of value done with the future, so it’d be as bad as extinction. And many have believed, and still do believe, that it would very likely lead to human extinction if AI took over.
Whereas with human takeover, the human might actually still build a great future — because they could be as selfish as they wanted, but then past a certain point, they’re going to have to use the rest of the universe for something. So the thought has probably been that AI takeover would be worse. I think that question is actually a bit more complicated, and there are some reasons to think that human takeover could be worse, but I think that’s historically been the perspective.
One other reason, which I think is reasonable, is that worrying about human power grabs has the risk of being polarised, and descending into an “everyone pointing fingers at specific individuals” situation, which I think would be a really big shame. By contrast, it’s easy to unite everyone behind the vision of “let’s ensure that humanity remains in control of this technology.”
But I do think it’s possible for everyone to come together and say that we want AI to be developed in a way which maintains democracy, which protects individual rights, which respects the US Constitution, and we can all get behind that. So I don’t think it has to be polarising or finger pointing.
Rob Wiblin: Are you focused on this because you think it is the case that misalignment isn’t an issue, or that it’s likely to be solved through the technical efforts that we’re currently making?
Tom Davidson: Overall, I’m optimistic on the chance of AI alignment being solved compared to many in this space, but probably I’m overall pessimistic compared to the average person of the population.
I’d probably say the modal scenario, the most likely scenario, is that AI is easy to align, that there are small hiccups along the way. But the ability to essentially align human-level AI by doing interpretability, doing loads of red-teaming, and then using that human-level AI to kind of bootstrap step-by-step [to superintelligence] is something that I personally feel optimistic about. So that does mean that, relatively speaking, I’ve become more interested in human power-grab risk.
Rob Wiblin: We pointed out some of the ways that the concern about misaligned AI and power-seeking human groups are similar. Are there any important structural differences between them that affect what solutions might work?
Tom Davidson: Yeah, there is. So maybe we should go through those three threat models I discussed earlier.
Starting with military coups, I think here the story could be very similar for human power grabs and AI power grabs: once you’ve actually automated large chunks of the military, if the AI systems that now control the military want to seize power for themselves or for a human, they can do so. So that threat model I think goes through for either.
And similarly for self-built hard power: again, once a group has produced this new military equipment which is very powerful, and it’s controlled by AI systems, if those AI systems are misaligned, they could seize power for themselves, or they could seize power on behalf of a human.
I do think this threat model is easier if there’s a human that’s really trying to help, because humans have certain formal permissions to set up organisations, to procure military equipment, that could make it easier to actually end up in a situation where these AIs are controlling military equipment, despite the fact that there weren’t really sufficient technical checks and guarantees. You could imagine people with political power kind of bulldozing through an automation of the military regime because they expect that it’s going to allow them to cement their own power, whereas AIs operating without human allies would struggle with that.
Then the last one is the autocratisation threat model, where I think there’s actually a much bigger difference. It’s a lot harder for AIs to get full power via that threat model because it’s just not going to be formally, legally acceptable to have an AI in formal positions of power like the president. So for a human, you become president and then remove the checks and balances on your power is a viable strategy; for the AI, that would have to be a stepping stone before, at some later stage, maybe via one of these other routes, they seize power for themselves.
Another interesting thing is I think that collusion is a bigger risk for AI takeover than for human power grabs. So let’s imagine there’s actually two organisations developing superhuman AI. If in one of them, a human within that makes the AI secretly loyal, it isn’t particularly likely that the other organisation has done the same. There’s not necessarily much correlation between one group of AIs being secretly evil and the other group.
So it wouldn’t be very expected for the secretly loyal AIs from this first organisation to collude with AIs from the second organisation and then do a power grab, because actually the second organisation is more likely to be aligned and refuse that collusion, and then prevent the first group from seizing power.
But with the risk of AI takeover, there probably is quite a large correlation between whether different organisations’ AIs are misaligned. So if one organisation, despite its efforts to align AIs, ends up with misaligned power-seeking AIs, it is relatively likely then that another organisation is in the same situation. So that then would open up a greater probability of these different groups of AIs actually secretly colluding and working together to seize power.
So there are threat models that involve sophisticated collusion between different types of AIs that I think make a lot less sense in the context of human power grabs.
Rob Wiblin: Yeah. In the case where you’ve got two different companies that have each created a misaligned AI, the concern might be that they would communicate with one another and basically figure out some deal where they’re going to cooperate in order to seize power, and then I guess share the spoils between themselves.
Couldn’t you imagine the same thing where you have just an AI that has whatever other goals or is misaligned, and then another one that has been aligned with a small group of people, and then they can cooperate with one another and again split the spoils basically? What their ultimate goals are doesn’t super matter to whether they can, if they’re able to cooperate effectively, figure out some way of doing so.
Tom Davidson: I agree with that. You could have secret-loyalty AIs cooperating with misaligned AIs. The thing I was saying is that it’s less likely you get two secretly loyal groups of AIs cooperating with one another. And if alignment is easy and one group gets secret loyalties, then probably the other group is actually aligned, so there wouldn’t be the possibility of secret loyalties. But yeah, any secretly loyal or misaligned AIs could potentially end up cooperating with each other. That’s a good point.
If you took over the US, could you take over the whole world? [00:38:11]
Rob Wiblin: As we’re talking about all of these different scenarios, should people be picturing in their heads a small group trying to get power over the United States or the UK, or the whole world? What do you have in your mind when you’re thinking, “Does this sound plausible?”
Tom Davidson: I’m mostly thinking about the United States. Most of the stuff we’ll talk about today is about a tiny group seizing power over a country, with the United States as the key example — because it is leading on AI, so this is the country where this risk could first emerge, and it might be one of the most important countries in terms of the importance of ensuring that this doesn’t happen.
But I’m also interested in the risk of a small group getting power over the whole world. My current best guess is that if one person was trying to take over the world, their best strategy might well be to first try to seize power over the US in particular, because of the way that it’s particularly well suited to that — with it being where AI is being developed, and it being a very strong country already — and then, having taken over the US, use the US’s broad economic and military power and large lead in AI to take over the rest of the world.
Rob Wiblin: So we can imagine that in the fullness of time, this might lead to takeover of the entire world. But that would be a second stage, and would involve some other considerations of how you might go about doing that, and how you would avoid failing that, that we won’t focus so much on today.
Tom Davidson: I can say some brief thoughts there. The US is already very powerful on the global stage militarily, so with its big lead on AI, it could use AI to develop very powerful military technology that could allow it to potentially dominate other countries.
Again, you can draw the analogy with the British Empire and its lead in the Industrial Revolution allowing it to gain a lot of power globally. In fact, Carl Shulman has this interesting analysis where he points out that the British Empire in 1500 was 1% of world GDP; by 1900, it was 8%. That’s an eightfold increase.
The US is already 25% of world GDP, so if you had the same kind of relative increase in the US’s share of GDP — because it leads in AI, and AI accelerates economic growth in a comparable way to how the Industrial Revolution accelerated growth — then you actually end up in a situation where the US now is a supermajority of world economic output.
And then there are further arguments to think that you could kind of bootstrap from that level of economic supremacy to even greater degrees of economic supremacy by being the first country to go through a faster phase of growth because you’re leading on AI.
And one point that’s been made to me from William MacAskill is that the US wouldn’t necessarily have to dominate other countries directly to end up really dominating the world; it could simply be the first country to gain control of the rest of the energy in the solar system.
So only a tiny fraction of the sun’s energy falls on Earth. In the fullness of technological development, it will be possible to harness the rest of that energy. So one route to global ascendancy is just to use your temporary economic and military advantage from AI to be the first one to grab all of that additional energy from space — and then you would now be more than 99.99% of world economic production; you don’t have to have infringed on any other countries in any way whatsoever.
Rob Wiblin: Yeah. If people want to hear more about that, there is quite a lot of discussion of these explosive growth dynamics with Carl Shulman in the two interviews that we put out with him last year. I think there’s five hours in total and maybe an hour or two on this topic.
I think it is quite a plausible stepping stone to go from, you control the United States and you have a significant lead in AGI and all of the related robotics technologies, to then leverage that in order to gain basically domination over the world. It does seem quite realistic to me. As you were saying, it would only require the same thing to happen with AGI as happened in the Industrial Revolution.
Will this make it impossible to escape autocracy? [00:42:20]
Rob Wiblin: One thing that this all brings to mind is that you mentioned that one of the pathways to a bad outcome is autocratisation. What’s the relationship between this concern and the concern that the Chinese Communist Party might be able to use AGI — the ability to monitor people, to interpret everything that they’re saying, to monitor all their communications, to follow people wherever they’re going — to basically lock itself into power, lock itself into control of China indefinitely?
I guess that’s a case where the power seizing occurred in the late ’40s. So you have a group that already has control of a country, and now they’re doing kind of the second stage that we might imagine in this scenario, where they’re just using AI to lock themselves in, to make it even more difficult to challenge them than it currently is.
Tom Davidson: Exactly. Most of my research has been about this first stage of seizing power, but then there is this next stage of actually really consolidating your control over the country.
And currently, as you were alluding to earlier, the CCP doesn’t have absolute control over China, because there are many people in the military who, if the CCP did something truly terrible, would not follow CCP orders. And the CCP relies on its strength ultimately from all of the people that work in the economy, so the CCP does work hard to keep them happy.
Rob Wiblin: To us, it looks like a unitary actor — where one person who we know about, Xi Jinping, has extraordinary control over it. But I imagine if you’re inside the Chinese Communist Party, there is a reasonable amount of pluralism, probably quite a lot of discussion about different paths that you could go down. Compared to a world where one person literally just determines absolutely everything, it probably is quite pluralistic and dynamic.
Tom Davidson: Yeah, I think that’s a great point. And the route to moving to the much less pluralistic world would be, as we’ve discussed, using obedient AIs to automate the military, and then ultimately to automate all the other parts of the economy — so you no longer have to rely on those plurality of interests and keep them satisfied. And then, technologically speaking, you could just have a really extreme amount of absolute power with just one person being in full control.
Rob Wiblin: It just seems very likely to me that this is possible, at least if the very top leadership wants to use AI in this way. Setting aside external constraints of competition with other countries that we were talking about before, it’s hard to see what exactly would stop them.
Maybe you do have an issue that this isn’t in the interest of most people in the Communist Party, because then it allows centralisation of power even within the party to a tiny number of people. So the generals, the top thousand people in the party, might be against implementing these kinds of controls that would allow the consolidation of power into just a handful.
Tom Davidson: Exactly. And that’s why I think that broad transparency is a pretty good general remedy to these risks, because it is ultimately in everyone’s interests to not have massive power consolidation. So if everyone is fully aware of the risks, and fully aware of who the AI is obedient to and what it’ll do in various high-stakes scenarios — where it’s maybe being ordered to do a power grab on behalf of a small group — then it does seem like the fact that today power is widely distributed, that could propagate itself forward even through AI automation.
Rob Wiblin: Is an important observation basically that the main defence that we have against this is that currently power is reasonably widely distributed? So the people who currently have control are not against this power grab, and they’re the incumbents — but we need them to step up and observe that there is this threat to the degree of power that they have, and to block these changes that would ultimately end up with them having no power. If they don’t do that, then they’re in trouble.
Tom Davidson: Yeah, I think that’s right. And many of the changes we’re talking about, you don’t necessarily have to talk about power grabs in particular to motivate them; you can just talk about the need to have an AI system that is produced in a very secure way, so that it can’t be interfered with by foreign adversaries or insiders working for personal gain, and the need to have AI structured in a way which maintains broad democratic control.
Threat 1: AI-enabled military coups [00:46:19]
Rob Wiblin: OK, so at the beginning we briefly sketched out what these different power-grab scenarios might look like, but we’ve been talking mostly in the abstract since then. It’d be good to maybe dive in and think, step by step, how do these power grabs actually take place, so people can have more of an intuition about whether they think it sounds reasonable or not.
Maybe the easiest one to talk about first is a military coup. How would that happen?
Tom Davidson: Right. So today, if you want to do a military coup in the US, you have to convince some number of the armed forces to support you, to seize control of key locations and so on, and you need to convince the rest of the armed forces not to oppose you.
And both of those things are going to be very hard, because there’s a very strong norm and commitment to democracy and rule of law in the armed forces, so you’d never even be able to get the conversation started about, maybe we’re not happy with this government, maybe we should do a power grab. That would just be immediately alarm bells ringing. You wouldn’t even be able to get started.
But in the future we are going to end up in a world, I think, where you can’t be competitive militarily without automating large parts of the military — that is, AI-controlled robot soldiers, AI-controlled military systems of all kinds. And at that point, I think there are three new vulnerabilities that are introduced that could enable coups. I can go through them one by one.
The first is almost like a basic mistake we could make, where perhaps as we start to automate, initially the AI systems are only performing kind of limited tasks; they’re not that autonomous, so it makes a lot of sense to say that they should just follow the instructions of the human operator. And then, to the extent that the human gives orders that abide by the law, the AI system will do that. And if the human gives illegal orders, then the AI system will follow them, and that’s the human’s fault.
So there’s a possibility that the way that we automate is that we have the AI systems just follow human orders, whatever they are, and keep the humans liable in terms of the illegality of the military behaviour.
Rob Wiblin: And that would just be thinking of AI military applications the same way that we think about all other military equipment now. The guns don’t refuse orders, tanks don’t refuse orders. It’s the humans’ fault.
Tom Davidson: Exactly. But once AI systems become sufficiently autonomous, then it’s going to be really important that we change that. Because if we end up with, let’s say, AI-controlled robot soldiers that just follow any orders they get, if ultimately the chain of command finishes with the president, then they would then be following even illegal orders from the president to, for example, do a military coup.
And if they’re able to operate autonomously, then they could just follow those orders and literally you could get a military coup just because we built these systems that had this kind of, in hindsight, obvious vulnerability.
Rob Wiblin: Yeah, let’s back up a second. Why are we incorporating AGI into the military? And how deeply embedded might it be?
Tom Davidson: Right. The key thing is military competitiveness: human soldiers will be less smart, less fast, less effective and precise in all domains compared with the AI robotics alternatives, once the technology gets that far. So ultimately, from a military power perspective, it will just make sense to replace humans at every step of the chain.
There’s obviously a big question of how slow does that process go? Are people going very slow and cautiously, or are we rushing because there seems to be competition with China or something? But I think in the fullness of time, we are going to get to that world where the vast majority of military power is now in fully automated systems.
Rob Wiblin: I guess it’s the same reason that we industrialised the military: it’s kind of the only way to plausibly keep up with competitors.
Tom Davidson: Exactly.
Rob Wiblin: And in DC, competition with China is probably the dominant frame through which people think about AI, and incorporating AI into the military in order to remain competitive. It’s already a very important thing that people talk about and take seriously.
So I think that stage seems reasonably plausible, although there’s a lot of uncertainty about how quickly we’ll proceed and how much we’ll want to double check that everything is fine before we deploy. I guess the more heated the international situation is, the more people feel that they have to go quickly in order to keep up with competitors — and the more likely they are to cut corners on the safety and trying to think about how do we detect secret loyalties, what kind of vulnerabilities might there be, all of that. So that increases the risk.
The nightmare scenario is that when we’re figuring out how the AGI should behave or how AGI in the military should operate, we don’t say it should follow the law fundamentally, or it needs to be aware of Supreme Court judgments and it should also be an LLM that’s scanning through the legal literature to understand whether what it’s doing is acceptable. Instead, it should follow orders. And I guess that’s a way that the military thinks quite a bit. In terms of being able to respond very quickly to events, by and large, people are taught to follow orders.
Tom Davidson: Well, I think there is a strong commitment to upholding the Constitution in the military. So maybe in a particular high-stakes situation, they might do something that’s marginally illegal. But I think it would be very hard to get a military battalion to go along and seize control of the White House to help someone seize power. So that’s an example where there’s the possibility that the AI-controlled military is actually much worse by comparison.
Rob Wiblin: I see. Yeah. I guess you could imagine people like us might be arguing that if you’re going to give AI such enormous control over powerful military equipment, such that it could seize power if it wanted to or if it was instructed to, then it needs to be following the law fundamentally, and not any one person’s instructions, because that’s just too great a vulnerability.
People might come back and say, “Are you serious? That we’re going to be in the middle of a war here, and we give instructions to the AI and it has to basically think deeply about whether the instructions are legal, and maybe it will refuse to do them because it thinks for some reason or another that it violates the Constitution or violates military law?” That’s too great a vulnerability. That puts us at too much of a competitive disadvantage in terms of acting quickly, acting reliably. So people say no, it just has to follow instructions, and it’s the person’s fault if they give them bad instructions.
Tom Davidson: Yeah, potentially. I think this is an especially plausible risk if there’s someone with a lot of political power that’s actually maybe trying to gain an excessive degree of control over the military. They could be using their political power to say that we don’t have time, we shouldn’t be worrying about this.
The example of human soldiers that are able to act very effectively and still would reliably not assist with a military coup suggests to me that it is possible to get AI systems that do this. I think actually AI systems will be more flexible in terms of the kinds of controls and instructions that they can follow — so if anything, we could end up with a more robust world where it’s even harder than today to do a military coup because the AI systems are so robustly opposed to it. But it really does depend on how carefully we do this.
Rob Wiblin: Yeah. Maybe the reason that I’m focusing on the setup here is that it seems like the next stage, if you do have AI that controls all of the key military equipment, or that’s how it’s operated in practice — because it’s only AIs that can react with the speed and intelligence required, such that they do just follow orders from the president ultimately, no matter what those orders are — then going from there to taking over the country seems like a small leap.
How sceptical should people be about that second step? To me, it just seems very natural that then you really could have a coup that would succeed, basically.
Tom Davidson: I mean, I think you don’t even need the whole military to be automated. Historically in military coups, often it’s sometimes a handful of battalions that seize control of symbolic targets and kind of create a shared consensus that no one’s opposing this attempt. So we don’t even have to wait until full automation for this to be a risk.
Today, if there was a military coup, then there would be uproar throughout the nation and everything would grind to a halt because the new government wouldn’t be seen as legitimate. So if you’ve only automated the military, I think that would still happen.
There’s two ways in which I think you would still be able to kind of push over the line if you did do this military coup.
The first is that human armies today are very reluctant to fire on their civilians. So once there are these mass protests, it really does tie the hands of the people who’ve just done the coup, where their militaries literally will not fire on those protestors.
Rob Wiblin: Well, or they don’t know whether they will. They don’t know, if they give the order, whether they’ll fire or rebel against the people during the coup, I think. And so that just makes you cautious, and you anticipate being in this bind.
Tom Davidson: That’s right. Whereas again, if we got instruction-following AIs, then those military systems will just fire. So that’s a big change.
And the other thing is what we’ve discussed earlier, about how, to the extent you’re also getting robots and AI that can automate the broader economy, it doesn’t matter to you that everyone else is refusing to work, because you can just replace them with AIs and robots.
So for those reasons, I think you’re right that actually once you’ve largely automated the military, it is going to be pretty simple to then seize power.
Rob Wiblin: And I suppose people, even if they’re against it, at the point that they perceive it as hopeless to resist, then they have more reason to continue working even if they’re inclined to strike if they just think it’s futile. Do you really want to allow yourself to get killed? Why not just go along and hope for the best?
Tom Davidson: Yeah, or there could be a million drones that are able to follow individuals around and ensure they’re doing their work. So there could be the potential for much more fine-grained enforcement and monitoring than is possible today.
Will we sleepwalk into an AI military coup? [00:56:23]
Rob Wiblin: Yeah. So let’s go back to the setup here. The main protection that we might have against this is that people will see this coming. It’s a relatively obvious issue. Even if you’re not concerned about a power grab on the part of the people operating the military equipment, if you have everything controlled by a single AI that follows orders, then you might worry that it creates cybersecurity vulnerabilities — that a foreign adversary could either seize control or deactivate the equipment.
Congress has a significant say in military procurement, in military law. And Congress, I think, would not be keen on a power grab by either the generals or the president, so they might put in place all kinds of safeguards. They might not be enthusiastic about AI-ification of the military until they’re convinced that it’s safe to do so and a power grab is improbable.
Is that a reason that we maybe shouldn’t be completely concerned about this, that we probably won’t sleepwalk into this scenario?
Tom Davidson: Yeah. It is such an obvious risk that I do think people will be raising the alarm and precautions will be taken.
There’s a few reasons why I don’t think we can fully rest on our laurels. One is that there’s actually a few different ways that this could go wrong, a few different vulnerabilities. We’ve mostly been discussing the risk that the AIs are just programmed to follow instructions in a way which is vulnerable to a coup.
But even if you patch that problem, then there’s this other problem relating to secret loyalties, which we discussed earlier. If all of the superhuman AIs in the world are already secretly loyal to one person, then the AIs that create these new automated military systems and create their AI controllers could insert secret loyalties into those military AIs — so that even if the official model specification says, “Of course they’re going to follow the law; they would never do a coup,” and all the tests say that, if there’s been a sophisticated insertion of secret loyalties, then that could be very hard to detect. And that could still result in a coup.
And those secret loyalties could potentially be inserted long before military automation actually occurs; it could be inserted at the point at which superhuman AI is first developed within an AI lab. It may be only years later that those secretly loyal AIs then pass on their secret loyalties to the automated military systems, and it may just be very hard at that point to detect.
Even if some people are saying they’re worried that these AIs in the military have secret loyalties, everyone else will say, “Where’s your evidence? This is a hypothetical worry and we’ve got a very real risk on our hands in terms of foreign adversaries building up their own militaries. So we’re going to proceed.”
We’ve talked about vulnerabilities from instruction following, vulnerabilities from secret loyalties. But a third vulnerability, which just means that this risk is more plausible in total, is the one you refer to in terms of being hacked. It seems likely that whoever controls the most powerful AI systems will also have access to the best cyber capabilities, because AI seems like it’s going to be particularly well suited to cyber; there’s quick feedback loops in terms of developing amazing cyber capabilities.
So if there is this one organisation which has access to better cyber capabilities than the rest of the world, then again, there’s a risk that they could hack multiple different military systems. Even if each military system has its own different cyber defences, ultimately, if there’s just a large imbalance in cyber capabilities, then there could still be a broad hacking and disabling or seizing control of those systems.
And so while I think there’ll be effort that goes into it, I don’t feel confident that we’ll absolutely nail defending against those three vulnerabilities.
Rob Wiblin: You’re saying you could have a loss of control that only becomes evident very late, but that begins at the very earliest point, when AI research is basically automated, and it’s possible for a small group of people to start giving instructions to the AIs that are doing the research. This could occur years before, where they instruct them to ultimately be loyal to them, or some agenda, and the AI just continues to pass this on.
And as the AI is recursively self-improving, they could get the assistance of the AI in figuring out how to obscure this loyalty as much as possible, such that it will be resistant to discovery by any of the scrutability, interpretability methods that are available to people at the time.
I’m just realising that if you can get in at that early point — I suppose later on we’re going to talk about how you can try to cut that off at the pass — but once that’s in place, it might be quite challenging to out if you have the assistance of the most capable AGI in preventing its revelation.
Tom Davidson: Exactly. And if there’s no similar access that defenders have to that level of AGI — because especially with the secret loyalties problem, any AGI that they get to help them detect if there are secret loyalties could itself be secretly loyal, and therefore not really helping them uncover the best techniques for discovering that.
Rob Wiblin: I guess you could try to get out of the bind. Let’s say that you had the leading AGI model that unfortunately, there had been a vulnerability in the past when there had been a period of time when there hadn’t been sufficient monitoring of access, so that you’re concerned there’s a secret loyalty. You can’t really ask it to help you reveal that.
I wonder whether you just have to train a new AI model from scratch, basically with proper monitoring of all of the training data that’s going into it, where everyone is watching the instructions that are going in and has a chance to double check that nothing dodgy is happening.
Tom Davidson: That’s one way you could do it. But it will be very costly, because you’re training it from scratch, and you won’t be able to use your top AI labour to help you do that because you can’t trust that. So you’re really going back a few steps. And if there’s a competitive situation where people want to race ahead, and there’s no real red flags for thinking that this is particularly likely to be an issue, the risk is that it would be hard to motivate people to actually do that.
This is one reason why having two independently developed superhuman AIs would be really nice, because then I’d say there’s no particular reason to think that they’re both going to have been subverted. And as long as alignment isn’t a massive issue, then you could use one of them to do a really thorough deep dive and audit of the other one, and that could give you some genuine independent check.
Could AIs be more coup-resistant than humans? [01:02:28]
Rob Wiblin: You mentioned that by putting AI in control of the military, we could actually end up in a safer place, where we’re more resistant to coups than we are currently. Can you explain how that would work?
Tom Davidson: So humans in the US military are very committed to democracy, but probably under extreme situations they might not be. For example, if there were very significant changes in the political climate, if the current government was seen to be corrupt and failing, then it’s not out of the realm of full possibility that certain soldiers would in fact support a military coup. And indeed, throughout history in other nations, military coups are common.
But with AIs you could get a greater assurance on their behaviour. If you’re able to deal with these various technical issues that we’ve discussed, the ceiling on how confident you could be that AIs will not, under a very wide range of circumstances, support a military coup could be higher — just because you have that flexibility to create an entirely new mind that isn’t bound by the constraints of human psychology.
Rob Wiblin: So the AI model in that case would have to be aligned to the constitution or the governing institutions of the country. It would have to have some sense of what it is to uphold. Because it’s kind of ultimately the Leviathan at this point: it controls the military equipment, it has the hard power. And it’s not completely responsive to any person; it has to be able to refuse orders. So it has to have some degree of autonomy once it’s set up, because otherwise, even if the AI company that created it was able to change its values, then that would be a vulnerability that’s unacceptable.
So this is now the thing that is actually securing the country, that ultimately has the power to control everything. We’ve got to make damn sure that it has a good sense of what rules we want it to support. And I suppose we can’t have it be too inflexible either, because that would just cut off the ability of society to evolve over time. It also has to tolerate some degree of change in processes and so on. It’s quite complicated.
Tom Davidson: Yeah, that’s right. It’s something you want to really be careful to try and get right. One thing you could do is ultimately its kind of highest master is a set of laws like the US Constitution.
Another thing you could do is its highest master is the aggregated preferences of a very wide group of stakeholders. That would be a way of kind of leaving humanity ultimately in control: the AI system has to kind of certify that in fact this wide group has agreed that it wants to fundamentally change the system — and in that circumstance, once it’s verified that, it will take certain actions that would normally be prohibited. But that would be maybe desirable, because it would be a way of preventing ourselves from getting locked in to some kind of rigid set of rules that actually we later en masse want to change.
Rob Wiblin: I see. So you could have an escape outlet, where if 90% of US citizens wanted to change something, then it has to go along with it, even if that seems to violate.
Tom Davidson: Yeah, and this is like the US Constitution. You can change the US Constitution, but it’s pretty hard, and you need supermajorities, et cetera.
Rob Wiblin: Right.
Threat 2: Autocratisation [01:05:22]
Rob Wiblin: OK, let’s switch on from the military coup and talk instead about autocratisation: a gradual weakening, breakdown of the institutions that distribute power more widely. What’s kind of a mainline scenario by which that could happen?
Tom Davidson: Yeah, so the mainline scenario is pretty similar to recent cases of autocratisation, but with AI exacerbating each stage along the way.
Normally autocratisation starts with a sense of political turmoil, with polarisation between different parties in a sense that the current climate is unstable and that the current democratic system isn’t working well.
I think AI could contribute to that in a few ways. Firstly, just potential competition between the US and China on AI and maybe broader military competition creating a sense of emergency. Then there’s AI potentially causing a lot of job loss and inequality, making people dissatisfied with the current system. There’s also the risk, maybe reality, of AI catastrophes relating to dangerous AI capabilities being misused, and risks of total loss of control over the technology.
Then lastly, there’s a broader risk that certain issues in AI could be very polarised — like the question of how quickly to roll out AI across society: some people might want to do so very quickly, while other people might be very scared about the consequences of doing that — which could create a generally polarised atmosphere.
So that’s all exacerbating potential drivers of autocratisation risk in general.
Then there’s helping the would-be autocrats in particular get and seize power. And there, the story is essentially that it’s possible that a small group, a small political faction, has disproportionate access to superhuman AI capabilities in political strategy and in persuasion. So all aspects of a political campaign AI could be helping with: adverts, making broad strategies with different groups, campaigning in a persuasive way. Already the convincingness of different political candidates varies quite a lot, so if AI is able to train someone to be a super political operator, that could make a big difference.
So one story you could tell is that someone gets elected on a very strong electoral victory and is given a very strong mandate to shake up the system, off the back of all this AI help. Then the next stage in autocratisation is typically the removal of checks and balances in power, as you alluded to: putting loyalists in the judiciary, restricting the freedom of the media, fiddling with the electoral system, expanding the powers of the president.
One option would be to kind of manufacture a sense of emergency, maybe by leaking the model weights to someone like China, or even doing a false flag attack that appears to be a Chinese cyberattack to give you a reason to expand the powers of the executive.
Ultimately, when I’m thinking about these threat models, the endpoint I’m envisaging is where this tiny group has full control of the military and really absolute hard power.
It is hard, when I think through autocratisation, to understand how we could get to that endpoint very quickly — because you would have to make very significant changes to the US Constitution, and that does seem like it would be pretty hard to do and take quite a long time. You’d need to get supermajorities in both the House and the Senate, but senators are only reelected on a six-year schedule. So it seems like even with amazing electoral success, it’s going to take a while for that to really happen. And you’ve got to get two-thirds of the state legislatures to approve it, also a supermajority.
So it just seems like it would take a while. So in terms of how we actually get through this threat model to one group having hard power, you could do the slog and just do it through legitimate channels.
But then there’s other more radical options, like maybe you introduce a new legislative body. This is what happened in Venezuela: they introduced this new legislative body that was kind of legally ambiguous, but they strongarmed the judiciary into approving it. Then that was kind of seen as replacing the old body. And even if it’s strictly illegal, ultimately if the legal system is pressured into accepting it, then it is still happening within the broad system. So that’s one way that we could potentially accelerate that process.
Another possibility is just that actually this threat model doesn’t get us all the way, and it kind of moves over to the military coup threat model. So a person gets a lot of executive power and pushes forward automation in the military in a pretty sloppy way, which maybe quite obviously gives them the ability to then do a military coup.
Rob Wiblin: It sounded like you were already anticipating objections that people might have to this and trying to respond to them there. I think I agree this is the one that maybe it’s unclear whether there is a real issue here, and it might hinge on just how powerful AGI is, and just how far ahead is the best group.
There’s a lot of opportunities for people to object here and try to prevent this from progressing. And it might be kind of evident what’s happening, so many people might be alerted and trying to ensure that you don’t get a supermajority in the Senate, that you don’t have three-quarters of the states behind your crazy constitutional changes to lock yourself in forever.
Tom Davidson: That’s right. I guess the reason why this is still a risk is the reason it’s a risk in other countries: people do oppose autocratisation in other countries, but normally the elected leader has massive popular support behind them and just is able to outmanoeuvre opponents.
There’s often a game of plausible deniability every step of the way. You clamp down on certain media because you’re saying it’s disruptive to public order and we need to focus on beating foreign adversaries. And there’s a certain plausibility to that. That’s a case where it’s plausible that superhuman AI would be very good at predicting public reactions to various moves you could make, and identifying really clever ways to remove checks and balances in a way which is very defensible, but still locks in your power pretty quickly.
Rob Wiblin: I guess that’s the thing that we’re a bit uncertain about, is just how amazing will the strategic advice be from this AGI? How much will it help you outwit and predict and outmanoeuvre everyone else in society? And I guess it’s just kind of an open question at this point, whether it really does give you an enormous strategic advantage or just a modest strategic advantage.
Tom Davidson: I agree with that. A sceptical case you could tell is that people don’t really change their opinions because they hear clever arguments. They build relationships and trust with people over many years, and then they have those trusted sources of information. And you might have really smart AI, but it’s still going to take a lot of time to build those relationships and build that brand. Maybe you can build it faster than before, but ultimately existing incumbents have already accumulated all of that trust and influence, so you won’t be that much better at persuading people.
The more bullish case you could tell is that AI will be able to have access to many thousands of times more examples of conversations between people to learn what’s persuasive, looking through the whole internet for examples. Similarly for strategy, the internet has many examples of events playing out over time, actions that were taken that result in behaviour of the system. So it does seem in principle like an AI would at least be able to get a lot more examples to learn from, so it seems hard to rule out that it would ultimately be very superhuman.
There are also opportunities to get quite quick feedback loops where you could have AI putting out adverts, and within hours or days getting lots of feedback about how persuasive they were to different groups. So you could have this kind of distributed system, where the AI is putting out lots of content and then getting lots of feedback and adjusting its strategies.
And those things make me think that we shouldn’t rule out the possibility of a very superhuman AI strategy. But it’s not my main line.
Rob Wiblin: So the super strategy approach. There’s also just the force of numbers approach, where I guess if you had access to far more compute than your competitors, then you can effectively just create a staff of millions, tens of millions of agents — all trying to figure out how to help you gain more political power, which is something that no current figure plausibly does have. Certainly not very hardworking, very loyal agents working on their behalf.
So you can have the equivalent of 10 people thinking about every single person in Congress, every single person in the Senate: “How can we possibly turn them?” Thinking about every different demographic group: “How do you try to appeal to them?”
Tom Davidson: This is actually a point where that could be very complementary with the economic power that you could get from controlling AI. Currently most of GDP is paid to human workers, and we are talking about a world here where AI is surpassing workers in most domains. So you can now have a large fraction of GDP ultimately being paid to the controllers of those AI systems, so there could potentially be a lot of money that can be thrown at this problem.
And my understanding is that for political lobbying, if you can combine a really clever strategy and personalised messaging for congresspeople with real financial incentives for their reelection, and maybe bringing business to their local area — because of the way you can target the rollout of new AI products to particular areas, bringing jobs to new areas, et cetera — that could be a pretty powerful cocktail of ingredients.
Rob Wiblin: It seems that the group that would have the greatest chance of being able to pull this off would be a combination of a political organisation, a political force, with perhaps the company that is creating and operating the AGI and profiting enormously from its deployment as well — because you’d bring all of the technical expertise and potentially a whole lot of money, along with people who understand the political system and have the legitimacy to be making policy changes that I guess could benefit that company. Then it could become quite a potent alliance.
Tom Davidson: Exactly. Yeah, I think of three ingredients which are complementary: existing political power and legitimacy, economic power, and then the cognitive work of the superhuman AI.
Will AGI be super-persuasive? [01:15:32]
Rob Wiblin: There’s this ongoing debate about how persuasive AGI might be, whether it will be able to convince people of all kinds of crazy things given enough time with them. Do you have a view on that controversy?
Tom Davidson: Yeah. As I was saying, I’m sceptical that you’re going to get an AI system that can convince an arbitrary person of an arbitrary conclusion. I think humans don’t believe all clever arguments they hear.
But it may be the case that people make a lot of use of AI assistants. I’m already doing a lot of chatting to GPT-4 and Claude in my own life, and we may get to a stage where actually you are a lot better at your job if you make frequent use of an AI assistant.
And if we are in a world of secret loyalties — where that AI assistant has many interactions to build a relationship with you, build trust, and then kind of subtly influence your opinion in certain ways — then it does, I think, become more plausible that you could get, over those longer periods of time, the kind of trust building between humans and AI assistants, and humans coming to increasingly rely on their AI assistants for advice and good judgement.
And then if there’s a secret loyalty, the AI could be constantly nudging the millions and millions of people that it’s advising towards opinions across diverse domains that will help the people that it’s secretly trying to help.
Rob Wiblin: Yeah. I feel like the questions that I ask Claude now usually don’t have a very political angle. I was trying to get help with dealing with a mould issue in my house over the weekend, and I’m imagining it coming back with, “You should use this kind of product to deal with mould. In addition, have you considered reevaluating your view on this political figure?”
I suppose at that point, we’ll just be using it for advice on basically everything, because it would just be the best source of advice that you can get. So maybe it is plausible that it would be able to build a better model of you. I guess you’d want it to build a good model of you so that it can be a more useful tool. Maybe it adds up over time.
Tom Davidson: I think maybe. And I think there’s a lot of uncertainty here, in that sometimes there are amazing products and people are just slow to adopt them because they just don’t see the need and they’re happy as they are. A lot of congresspeople aren’t massively excited about adopting new technology.
So I’m not saying this is definitely going to happen, that they’re going to have all of their political opinions formed by their AI assistants. But I do think it’s one possibility.
Threat 3: Self-built hard power [01:17:56]
Rob Wiblin: Let’s push on to the third category, which is perhaps the one that sounds the most sci-fi, the most strange to me, which is self-built hard power. How would that scenario play out?
Tom Davidson: So as I briefly mentioned earlier, there are historical examples where new military technologies give one group a massive military advantage, the nuclear bomb for example. But this scenario is unprecedented in that there’s not historically been a private group that develops a new technology and then uses it to seize power over a country, that I’m aware of.
But one scenario here is that we get a very rapid increase in AI capabilities because of this recursive feedback loop with AI making smarter and smarter AI. So the world is kind of taken by surprise by how good these new AI systems are at making powerful military technologies.
Then maybe the human-run AI organisation expands its economic remit beyond pure AI, and also moves into industry-making robots. And maybe it just has a few factories which are making industrial robots, but actually, unbeknownst to the rest of the world, they’re actually making many tiny military drones that can be expertly manoeuvred by AI systems.
And there are no humans that need to work in these factories, because again, it can all be automated, so the normal checks and balances that you get are not present. And maybe the world hasn’t fully woken up to the fact that those normal checks and balances are not present and hasn’t taken sufficient precaution.
Then maybe it’s only a small military force that you need to actually execute a power grab at that stage. You don’t need to fully destroy the incumbent military and all its equipment; what you need to do is grab symbolic targets that kind of create common knowledge that you have asserted your power, ensure that no one resists by potentially threatening or acting out against the military forces that might object and fight against you, and then declaring victory in a way that isn’t challenged.
Rob Wiblin: The thing that is peculiar about this one is imagining a private group that effectively develops its own military power that then is able to stage a coup that is able to take over the existing military that controls a country — which, if we’re picturing the US, is a very powerful military.
So I have to think, how could this be possible in a way that it wasn’t before? I suppose one answer is that industrialisation might be occurring much more quickly than in the past, because you have all this AI assistance in building up factories, making them work incredibly well.
You also might have humans that are not involved, and that is quite unprecedented. You could have factories that are producing these drones, and maybe only a handful of people almost need to even know about it, because humans have largely been cut out of the loop. You just have the AIs following these instructions, and we’re imagining that they’re loyal to the people who are instructing them.
Are there any other structural changes? I suppose it’s another revolution in military technology as well. As you mentioned before, changes in military technology have often allowed a group that was previously not very powerful to become much more powerful, if they had access to it first.
You might think, wouldn’t the government see this coming? Wouldn’t the military be worried about this? And maybe they would. But if it’s all occurring very quickly; maybe they’ll be caught flat-footed, and you’ll be able to get to the point where you could have these millions, tens of millions of drones that would effectively give you the ability to stage a coup before people have really twigged that this is a threat.
Tom Davidson: Yeah. And I would say one thing that is unprecedented if it does happen is quickly — within a couple of years — moving from a stage where AI systems are not helpful with military development at all, to a situation where you’ve now got hundreds of millions of human-expert-level AIs you could potentially put to that task. So again, this possibility of very fast takeoff in AI capabilities could just create an unprecedented concentration of R&D capability in the hands of a small number of actors.
I do want to say that there’s also another story here: rather than this all happening in secret, there’s actually a much more prolonged period of the private actor gaining more and more economic power and industrial might. So it might not be creating military technology directly initially. It might be investing in material science, in construction, in manufacturing, energy R&D, electronics and robotics, and kind of expanding this broad industrial base. They’re essentially kind of retechnologising the existing industrial base. And that might all be justified on it’s amazing for economic growth, so everyone’s kind of supporting it.
But especially if the military is being slow to adopt new military technology — maybe because they’re worried about military coups — then it could be that eventually there’s so much industrial might here that it’s actually quite quick to, within a few months, actually switch production over to military technology and then seize power.
Even the threat of being able to make that switch could potentially be enough to allow a power grab, if it’s kind of widely now recognised that this industrial base is just controlled by them and they could seize power if they chose to.
Rob Wiblin: I see. So there’s the swift power-grab scenario, where you have a secret army that is constructed — that might not even be that large in terms of its sheer physical mass, but nonetheless has the ability to outmanoeuvre people. I guess drones is the thing that we’re picturing there.
Then you have an alternative scenario, where you have a single company or a single organisation that becomes an industrial powerhouse to a greater degree than any single company that we’ve seen in the past. But we’re picturing here that it has access to the leading AGI and it’s able to develop many new technologies as a result of that. It also gets the best instructions on how to implement them, how to actually build all of these things.
Tom Davidson: Yeah. You could have it being one company that does all this, or you could have that there’s kind of a network of companies and subsidiaries that are all ultimately controlled by one company, but it’s not immediately apparent that’s the case. You could even have genuinely independent companies, but ultimately they’re all making use of the same superhuman AI, because that’s just absolutely needed to get any kind of competitive company off the ground.
Then it could still be the case that the people who control that superhuman AI are able to insert backdoors in the robots so they can later be seized control of, or again, the secret loyalties possibility.
So the key thing really is that this new industrial base is created by superhuman AI — and therefore, those who control superhuman AI might leave themselves with the option of later seizing control of that industrial base and using it to threaten to create military power.
Rob Wiblin: I guess you could imagine that this company or this set of companies through this time might seem completely loyal to the country and to the government. They might well be a military contractor. In fact, they probably would be a military contractor in order to make it legitimate that they have all of these facilities.
But then that means that there’s potentially quite a short breakout time, where if they decide to no longer be supplying the equipment to the military, then they can switch it and just keep it to themselves and basically end up in a situation where they can overpower the official military fairly quickly.
Tom Davidson: Yeah. And harking back again to this idea of transparency: if there is, throughout the process, government oversight into how powerful these capabilities are, what is possible militarily, then that should be a real guard against the problems here. Because then, as you say, there would be a lot of interest in stepping in.
But there is no guarantee that there’ll be that full transparency. And especially if the organisation doesn’t want government interference, there may be many justifications for not being fully transparent.
Can you stage a coup with 10,000 drones? [01:25:42]
Rob Wiblin: What do you think is the minimum set of robots that would allow you to stage a coup currently? Should we be picturing millions of drones that can go out and target lots of individuals?
Tom Davidson: I don’t think you need to be able to fully defeat the existing military. So we’re not talking about massive equipment that matches all the existing tanks and other military machines. Historically, military coups have seized control of symbolic targets, have arrested existing politicians that are in power, and have prevented other military forces from acting out against them. That can sometimes be just a few battalions.
If the AI also is very good at political strategy — building alliances, creating the impression that it’s maybe more powerful than it is — then it could be I think surprisingly few, maybe as few as just 10,000 drones, to seize the key targets and intimidate the key people.
Rob Wiblin: Is this another scenario that becomes more likely if you picture a coalition between political forces and the private company?
You might think, why wouldn’t the government be stepping in in order to prevent this group building what is effectively a private army? But I guess if you have an alliance between the president, say, and this private company — where the president is not currently able to stage a coup to install themselves in power indefinitely, but they ally with this external group that they’re not going to interfere with, and I guess they have the option to not enforce current laws against them having weapons.
And then eventually that is then used to install a combination of the president and the private company into a position of unusual power?
Tom Davidson: Exactly. You could claim, “These are just general purpose robots; this is just broad industrial production. This isn’t actually a military threat.” And there could be ambiguity about how easy it will be to repurpose that for military purposes, ambiguity about what is the state of the law.
So to the extent there’s that ambiguity, then I think existing political capture could massively feed in and make the scenario more plausible.
That sounds a lot like sci-fi… is it credible? [01:27:49]
Rob Wiblin: That’s perhaps enough flesh on the bones of the three different scenarios that you’re picturing. Let’s consider a whole lot of reasons that people might be listening to this and might not be convinced that any of these three, or at least some of these three, are especially likely to happen.
A dominant one that occurs to me is just that quite a few of these scenarios sound an awful lot like a sci-fi film. There’s been many films that involve the use of new technology, including AI specifically, to have power grabs. It’s the kind of thing that really captures the human imagination: the concern that they’re going to be dominated by other people.
To what extent should we worry that our imaginations are getting the best of us here, and we’re concerned about something that makes for a great story but perhaps isn’t the most likely thing to happen?
Tom Davidson: I think we can look back at history for support for the reality of these possibilities. Generically, new technologies have massively changed the power of different groups throughout history.
One example is the Arab Spring and the influence of social media there. Another example would be the printing press democratising the access to religious knowledge and reducing the influence of the Catholic leaders.
Going back as far as the introduction of agriculture: before agriculture, power was very distributed; people operated in relatively small groups. But with agriculture, groups stopped moving around so much. You could have much bigger societies, and then they became much more hierarchical, so you had much more of a concentration of power on the top.
And interestingly, the emergence of democracy itself was helped by the emergence of the Industrial Revolution — where now it was actually very advantageous for a country to have a well educated, free population that could create economic prosperity and therefore also a bigger military. So I think part of the thing that led to democracy emerging in the first place was this technological condition, where democracies were particularly competitive. And to the extent that different countries are kind of forcing other ones to adopt their systems or copying systems that seem to work, that’s probably one big driver of democracy being so popular.
And it’s actually interesting that in this context AI actually seems like it will reverse that situation. As we’ve discussed, it will no longer be important to a country’s competitiveness to have an empowered and healthy citizenship.
So with that context, and the context that historically military coups are common when people can get away with it, and the context that there could be a situation where a tiny number of people have an extreme degree of control over this hugely powerful AI technology, I don’t think that it is a kind of science-fiction scenario to think that there could be a power grab by a small group. Over the broad span of history, democracy is more the exception than the rule.
Rob Wiblin: Yeah, I think you mentioned in your notes that in the second half of the 20th century, there were 200 attempted coups, about 100 of which succeeded?
Tom Davidson: 400 attempted coups, over 200 of which succeeded.
Rob Wiblin: OK, so when people think they have a shot at taking over a country militarily, they do reasonably often take a crack at it.
I think the first time I heard this whole story, or I read about it, was a 2014 blog post by Noah Smith, the Economist commentator, who said that AI would potentially signal the end of people power. Basically the concern would be that you would no longer need people to serve in the military because it could operate autonomously as a set of machines. And then later on you would no longer need people for the economy because everything would be automated.
And at the point that leaders of a country no longer require a large number of human beings for military power or for economic power, it’s unclear why those people would retain so much political power. Maybe they would be able to scheme in order to do that, but they’re in a much more precarious position, because they no longer actually matter for any functional purpose in the way that the population currently does matter to rulers of a country.
Tom Davidson: Yeah, exactly. And this harks back to something we were saying earlier, where today they do matter and they do have real bargaining power. So I think the crucial thing is using that current day bargaining power to push itself forward into time — ensuring that as AI automation happens, it doesn’t concentrate wealth and political influence in the hands of a tiny number of people.
Will we foresee and prevent all this? [01:32:08]
Rob Wiblin: Another sceptical intuition that people have that we’ve been alluding to is: won’t we see this coming? Won’t we anticipate it? People will do all of these things in order to try to block their political power from being destroyed.
What are the ways that you can imagine that wouldn’t happen?
Tom Davidson: One thing is that I think that these risks are going to be very ambiguous in terms of how big they are. It won’t be clear: Has anyone really inserted secret loyalties? Is there a real chance that AI systems will do a coup? Is the massive political power that this person seems to have really inappropriate, or is it actually in the interests of the nation, given that current democracy is so slow and it’s kind of holding us back compared to maybe more authoritarian countries that are more quickly automating and adjusting to the new technology?
And also, even if people do see it coming and they do push back, there will be interests that are trying to push forward — interests of existing political factions, economic alliances that will be pushing forward. So the fact that people see the risk coming doesn’t necessarily mean that they’ll be able to win that political battle.
Then another thing for me is just that the secret loyalties, as we’ve discussed, could be introduced fairly early in the process, before anyone’s really alert to these possibilities. So if people only start waking up to the risks once AI is being rolled out across the economy and the military, the game might already be lost by then.
Are people psychologically willing to do coups? [01:33:34]
Rob Wiblin: Another sceptical intuition that people might have is that although there’ve been 400 coups, 200 successful ones, there hasn’t I think been an attempted coup in the US, at least not a serious one, in a very long time. I can’t remember when the last coup attempt occurred in the UK either.
So we might think that in countries like that — where perhaps people aren’t brought up to valorise violence, perhaps in the way that they were 1,000 years ago; where it’s just not socially acceptable to engage in naked power grabs in the way that it perhaps has been in some places and times in the past — it’s no longer so psychologically plausible to expect people who previously were in business or just were normal politicians to want to stage a total, forceful takeover of a country.
What do you think of that psychological plausibility argument?
Tom Davidson: I think it has some validity, but it is not fully convincing.
Firstly, one reason why it might not occur to people to try and seize power is because there’s absolutely no way they could succeed. So such a thought, if it did occur to people, would only have negative consequences for their own trajectory in the world. Because sometimes we leak what we’re thinking to those around us, and when you’re never going to be able to succeed, but there is some chance people will realise that you’re contemplating some really evil stuff, we tend not to think about things.
If this AI does actually lead to plausible scenarios where you could seize power, then that balance will change. And now it’s much more plausible that people would actually contemplate seizing power, because now actually it could advance their interests, so there is actually that positive upside.
I’d also say that, from my perspective, the most plausible threat models don’t involve one person sitting down today and being like, “My aim is to secretly plot to seize power from the US.” I think it’s much more incremental. It’s just someone who already has a fair amount of power and wants more influence, wants to achieve more things, and they just find that the way that they can do that is by further increasing their own economic and political power, further increasing the degree of control they have over the way AI technology is built and used.
And at each step, they’re just greedily trying to double their power again. Then at some point very late in the process, they might realise, “I want to do all these amazing things with AI. I can transform the world. But the government’s potentially doing all these things which are really damaging and which many of me and my friends realise are awful.”
Maybe they’re chatting to their superhuman AI advisor, and the superhuman AI advisor says, “There’s actually this pretty foolproof plan for seizing power, and here’s some ways that we can ensure it didn’t go back against you in case it didn’t work. We could delete all the evidence.” And at that point, I think it is psychologically plausible, once you already have that much power, that you do take that extra step.
Rob Wiblin: Yeah, I guess maybe it would be a mistake to picture it as a naked power grab by one person pursuing their own personal advantage. Because you’re saying there will always be this story that people can tell themselves about how what they’re doing is actually helpful: it’s helping to challenge these threats that would make the world worse, and in fact, they’re making the world better. They’ll use their power for good.
And the fact that you can do it incrementally and always have this positive story about it is one way that you could have a group collaborate around it. Because they wouldn’t be perceiving themselves as just helping one individual seize power. In fact, they’d be saying, “We’re following this actually excellent agenda. We’re going to use AI for all of these wonderful things.”
Tom Davidson: Yeah, that’s right. And you know, on some of the threat models, you don’t actually ever have to use military force. You can have the threat of military force.
Maybe you fully automate the military, and the AIs would follow your commands. At that point you’re not going to actually have to order them to start gunning down civilians and seizing control of the White House. You can just use the fact that you do have that hard power to increasingly say, “This is how we’re going to do things” — and make it clear to opponents that ultimately, when push comes to shove, you have the hard power on your side. Then you can kind of essentially take political power without actually doing anything violent or anything that would seem awful to us today.
Will a balance of power between AIs prevent this? [01:37:39]
Rob Wiblin: I think another sceptical intuition that people might have would be that these people with the leading AI model, they’ll be getting better strategic advice than anyone else. But it would be a mistake to imagine that everyone else is in the same situation that they are today, because they will have follower models, they’ll have the previous generation of models that they can ask for advice on: What are the threats to democracy? What are the threats to our political influence? What methods could we use to try to head them off?
Access to intelligence might be more widely distributed than it is today, so we need to think about the balance between the leading group that we’re worried might engage in a power grab, and everyone else who is perhaps in a better position to defend themselves than they are right now.
Do you want to comment on that?
Tom Davidson: I think there’s two elements we could disentangle. One is, what is the capability of AI systems that people have access to? I do think there’s a risk that, especially if there’s a rapid period of recursive improvement within a lab, if labs are just releasing their models with a 12-month or six-month time delay, then there could be a very big differential in capabilities there. And labs could come up with various justifications for not creating public access for their systems. Maybe they would say it could be dangerous.
But another key component is how much are people actually using and trusting these systems? Today I think that AI systems can be useful in a wide variety of scenarios already, but people often don’t actually use them.
So one scenario is that one reason that people who would be power grabbers actually have an advantage is that they’re actually making a lot of use of their AI system, whereas maybe politicians are just kind of ignoring them. The AI says, “There could be this hypothetical risk of secret loyalties,” and they don’t take it very seriously.
Rob Wiblin: I think a recurring issue here is that we somewhat need to separate out the early secret-loyalty scenario. Because of course it doesn’t matter how many people have access to some sort of AGI if they’re all secretly loyal to some original plan, and this was inserted really early.
The very early, strong-secret-loyalty scenario seems to kind of stand on its own. You desperately have to be in at that earlier time, or many of the protections that you might hope for are no longer available.
Tom Davidson: Yeah, I agree with that.
Will whistleblowers or internal mistrust prevent coups? [01:39:55]
Rob Wiblin: To what extent do you think it makes it difficult to pull off any plan like this the threat that there will be whistleblowers? Maybe you think you have at least a handful of people who are going to go along with you, but some of them get cold feet. As things get more advanced and the things you are doing get more sketchy, one of them could out you. And the fact that you’d be worried that that might happen might mean that you’d be too nervous to actually try to launch a conspiracy like this in the first place.
Tom Davidson: I think getting coordination around something that’s really evil between a lot of people is difficult in today’s world. But again, coups do happen and do manage to coordinate multiple different people. And normally the way it happens is the initial steps are small, and there’s kind of trust-building exercises where you all take actions which increasingly show your commitment to more and more illegitimate forms of power seeking. And historically this does sometimes work out. As you say, sometimes it doesn’t, and that is one blocker here.
I think one thing that’s unique about this particular risk with AI is that you could have just one person that’s executing this by themselves. If they manage to get access to AI and then use that to create secret loyalties, then they don’t necessarily need any co-conspirators; they can just create secret loyalties, sit and wait for military deployment, and then seize power.
Rob Wiblin: An alternative mechanism that just occurred to me is you could have a relatively small group of people who decide that they’re going to try to gain more power. They consult with their model on how to do it, the model that’s loyal to them. They could all commit to have the model monitor them to ensure that none of them betray the group, that none of them decide to go against the coup. It’s very easy, I guess, to have AI models potentially monitor your communications and try to check whether you’re going to betray them. So that’d be one way of locking yourself in.
Tom Davidson: That’s a great point. AI might allow lie detection. And one general thing is that the people who are kind of controlling AI and controlling the new technologies that AI enables could differentially use those technologies to help them and their allies gain political power without sharing them widely in a way that would help the rest of society gain power.
So this is a great example. If there was some kind of new technology for allowing people to trust each other a lot, only sharing that with your small clique — but not sharing it with wider society so they can keep you in check — would be a great example.
Rob Wiblin: Yeah, I guess something as strange as that is probably some way down the line of there’s already been some trust building. People have already probably done some sketchy things in order to get the mutual assurance that people are likely to be willing to go along with things that get suggested. What do you think those early stages might be of the trust-building increments?
Tom Davidson: I think there’s a few things which kind of enable more egregious action later, but are also kind of defensible on other grounds.
One is not divulging various information that probably you should under existing transparency arrangements — maybe not reporting a particularly impressive new strategy capability that you developed when there’s some ambiguity, and plausible deniability about whether your agreements commit you to do that.
There’s maybe kind of crafting the model specification — the behaviours that the AI is going to follow — in such a way that ultimately it’s going to follow the instructions of the small group that in fact you’re all part of. That’s not necessarily meaning you’re going to use it for a power grab, but it certainly sets you up nicely to do that. And then not sharing that information with other people or just keeping it somewhat hush-hush.
Rob Wiblin: You could say that this is necessary for safety reasons, that you need to have the ability to stop it if it’s doing something bad.
Tom Davidson: Exactly. Maybe launching an undisclosed project into how the company can lobby Congress effectively, but really that project increasingly involves more and more illegitimate ideas for gaining power. Just that kind of thing, where you’re increasingly taking actions which are kind of increasingly sketchy.
Rob Wiblin: You mentioned in your notes that companies have over the years done some pretty crazy things, and almost always it was incremental. You imagine if someone just walked into a meeting and said, “I think that we should do this,” people would object. But when people are able to incrementally build up to doing more and more illegal things, then you can go quite a long way. You mentioned Volkswagen.
Tom Davidson: Yeah. It’s just such a shameful example.
Rob Wiblin: Do you want to explain it?
Tom Davidson: Yeah. Volkswagen had some software inside their cars that could kind of flip a switch, essentially — so when they were in testing, the car would be kind of environmentally friendly, abiding by those regulations, but then when it wasn’t in safety testing, the cars were emitting a lot of damaging fossil fuels. That’s a whole technical project that has gone into designing a system which is clearly there in order to dodge regulations. And they clearly managed to coordinate a team to execute to avoid that.
Rob Wiblin: It’s quite challenging. Lots of engineers and lots of people in corporate had to be involved in this, I think. And it’s just completely, nakedly illegal. I think this is a rare case of corporate malfeasance where people actually went to jail. And people have tried to estimate across all of these cars how much extra particulate pollution was released and how many people might die. I think it was in the thousands of people who might have been killed by their evasion of these particulate pollution regulations.
I guess I don’t know the details of the story, but you imagine it had to be a step-by-step process, rather than someone just walking in and saying, “Let’s just completely cheat the test.”
Tom Davidson: Yeah. A less extreme example might be the tobacco industry, where multiple organisations coordinated to create products which were highly addictive in a way that really wasn’t in the consumers’ interest. And then coordinated to spread misleading science about how damaging those products were.
In a similar way, you could have these AI companies potentially spreading misleading information about the level of risk and the risk of secret loyalties and stuff like that.
Would other countries step in? [01:46:03]
Rob Wiblin: Another way that this might fail to work is you could imagine that a group does begin to seize power in a country. Other countries — both its adversaries and its allies — might be quite freaked out by this, might really strongly disapprove. And they’re still independent in this scenario; they still are actors who could try to take steps to intervene if they really hate what’s going on.
How much would you hold out hope for things to be rescued externally basically? That allies or adversaries would say, “We really don’t want this person or this group to seize power over the US, or whatever other country, and so we’re going to take rapid steps to cut them off.”
Tom Davidson: I think that’s one of the less convincing objections to the scenario. Typically, even in very weak countries without nuclear weapons, when there are military coups there, yes, the international community objects and there are sanctions, but they’re rarely able to actually dislodge and restore democracy to the area. So I wouldn’t really see that as working.
Rob Wiblin: One way that this is different is people could observe what’s going on, and anticipate that this would lead to a permanent control over the country, given the technology that’s being used. So that gives you a stronger incentive to intervene immediately rather than wait out the coup people and try to get rid of them later.
If there’s a coup in Equatorial Guinea, people just don’t care that much. If there’s a coup in the United States or a takeover of the United States, that would trouble people much more, give them more motivation potentially to try to change things.
Tom Davidson: Yeah. I don’t think people are massively thinking about, with normal coups, the fact that they won’t be permanent as a reason not to intervene. I think most people are, if it’s a new bad regime for the next decade or so, that’s kind of what they’re focused on.
I do agree that there would be more motivation to prevent the United States from their power grab from happening. But I think it’s also going to take a lot more effort to actually intervene, given the United States’s military and economic power.
Rob Wiblin: Yeah, there are many countries that might like to change the governments of the other superpowers or the other nuclear powers. I guess, by and large, people think that that is just a hopeless enterprise, because once you have access to nuclear weapons, the threat of retaliation is just too scary. So that could be a huge discouragement on anyone to intervene, at least at the point that the people who are seizing power have control over the nuclear chain of command.
Tom Davidson: Yeah, that’s right.
Will rogue AI preempt a human power grab? [01:48:30]
Rob Wiblin: Another thought that I could see at least some listeners having, the folks who are more concerned about misalignment and rogue AI in particular, is that this is all I guess putting the cart before the horse; humans are never going to have the opportunity to get secret loyalty from the AI or use it to seize power, because the AI will seize power itself.
Should folks who think that misalignment is extremely difficult, unlikely to be solved, really care about any of this?
Tom Davidson: I think that human-power-grab dynamics could be relevant even in a world where alignment is pretty hard.
Normally the prototypical scenario we’re imagining when we can’t solve alignment is a scenario where it actually seems like maybe we have solved alignment; there might be various bits of ambiguous evidence, but on the whole we’re able to get the AI system to do what we want it to do, as far as we’re able to observe and see that.
So you could get a scenario where alignment is very hard, and AI is in fact misaligned and plotting to seize power. But there’s a human who wants to seize power for themselves, and they try to train this misaligned AI to be secretly loyal to them. Misaligned AI kind of goes along and pretends to be secretly loyal to this human, all the time actually being misaligned. And then there’s this kind of unholy alliance between a power-seeking human and a misaligned AI.
Maybe in that scenario the human does actually seize power, so this power-grab risk in fact materialises, and then the AI uses its continued influence over the human to encourage deployment to the military, for example. And then later the AI seizes power and there’s ultimately AI takeover. So that’s a scenario where power-grab risk has significantly contributed to AI-takeover risk.
Rob Wiblin: It’s smoothed the path to the AI taking over because they’re able to use humans who want to seize power to basically enable all of the steps that it wouldn’t have been able to do itself.
Tom Davidson: Yeah, that’s right. Dan Kokotajlo has written about the case of the conquistadors, where there’s a lot of this kind of divide-and-rule strategy — where they kind of help different groups gain power over one another and ultimately seize power for themselves.
Rob Wiblin: I see. So the conquistadors, I think famously there were only a couple hundred people. You might think, how would a couple hundred people, even with horses and even with guns and even with armour, how on Earth would they take over a country that has millions of people?
A key part of it was that they got other local groups onside, saying, “We’ll help you seize power.” And they kept doing this until they were in a very central position, and then they cut all of these other people out and just took power for themselves.
Tom Davidson: Exactly.
The best reasons not to worry [01:51:05]
Rob Wiblin: OK, so we’ve gone through a bunch of objections there that I guess you don’t find super reassuring. What do you think are the best counterarguments, or what are the best reasons not to worry about this threat?
Tom Davidson: Well, I think a lot of the counterarguments you presented have some merit, and I do find them partially convincing. I’ve been saying they shouldn’t give us reason to deprioritise this and not worry about it.
I think for me, the most convincing overall story of why I think this probably isn’t going to happen — though I think it’s a real risk — is that today power is widely distributed, we are able to look ahead and foresee these risks, and it is in everyone’s interest today to unite around an agenda to make sure that AI is developed in a very secure and robust way — so that there can be no secret loyalties, to make sure that the AI’s behaviours are transparent, publishing the model specification.
And that’s both true within existing AI developers — there’s currently a balance of power, and those employees don’t want a tiny number of employees to gain complete control over the organisation — and then also on the level of society as a whole. The whole of society, which currently does have the bulk of economic and military power, doesn’t want one AI developer to gain control over society as a whole.
So I think that we will continue to see these risks as more and more important. And then the fact that power is currently distributed does mean that we’ll be able to propagate that forward in time, more likely than not.
Rob Wiblin: Yeah. So there’s no single counterargument that’s terribly strong. But perhaps in aggregate, all of the different actors that might take steps to make this more difficult, the fact that we’re talking about it, might mean that many people will realise that this is an issue. As we get a bit closer to the time, there’ll be pressure on the companies to have more internal controls.
I guess we’ll talk about some of the countermeasures more later, but there are many opportunities that people have to make this challenging or to intervene at the earlier stages.
Or we might get lucky and perhaps the relatively small number of people who would have a chance to do these things decide not to — because while it is psychologically plausible for a human to do this based on what we’ve seen humans do in the past, that does not necessarily guarantee that any of the individuals will be particularly motivated to do this.
There’s many different ways that it could just ultimately not happen or not succeed.
Tom Davidson: Completely.
How likely is this in the US? [01:53:23]
Rob Wiblin: Would you venture a guess on the percentage chance of this happening?
Tom Davidson: I probably wouldn’t venture a guess at the specific percentage. I think any numbers would be made up. I wouldn’t be confident enough to say it’s less than 1% that a tiny group of people seize control of the US.
Rob Wiblin: I see. I think I would rate it a fair bit higher than 1% personally. I guess the outcome you get when you think about it is that you would neither be very surprised if this didn’t happen or if it did happen. For something that would be so enormously consequential, it’s strange how plausible it is, how reasonable it is to tell a story on which it happens.
Tom Davidson: Yeah, I agree with that.
Rob Wiblin: I think an alternative angle that someone could have would be to say that because it’s so straightforward to take AI technology — assuming that alignment problems are solved and we don’t have rogue AI — it’s so easy to just give it instructions and to scale up the efforts behind the desires of any particular operator, that this just makes for a fundamentally incredibly unstable situation that is going to persist for a reasonable amount of time.
And in a minute we’re going to suggest different measures that you could have to make sure that the military can’t use this to seize power, or perhaps political leaders can’t use this to seize power, or that the company itself can’t use AI to seize power.
But because there’s this fundamental vulnerability in the nature of the technology itself, any intervention that stops one group from exploiting this vulnerability to take power doesn’t really get rid of the problem. It simply means that other groups will do so at a later time. I’m reminded of this Lenin quote, I think talking about the coup that they staged in St. Petersburg: “Power was just lying in the streets, and we merely had to pick it up.”
You might worry that there’s going to be so many opportunities for power-seeking actors to try to gain power that it’s very difficult to block all of them. Maybe we just have to accept that someone is going to seize power and try to help the right group do it. Do you have any reaction to that?
Tom Davidson: I’m more optimistic than that. I think the technology has a potential vulnerability, in that it could potentially be instruction-following even when those instructions break the law or break the company policies. So there is this potential for the technology to be set up in such a way that allows massive concentration of power.
But it also has the potential to be set up in a way which is very robust against that. You could set it up so that in fact there are no AIs that are pure instruction-following; all of the AIs have guardrails against following illegal, illegitimate instructions. And then that system would be self-reinforcing.
At that point, once none of the AIs are going to help you to create these obedient AIs that will help you seize power, then it will be very hard to actually get hold of AIs that would do that — because all the AIs that you might potentially use to assist you are just going to be stopping you at every step of the way, and power is distributed between humans. So it now becomes kind of impossible to actually exploit that potential power-concentrating vulnerability.
So I think there’s this initial period when we first have superhuman AI emerging where, yes, it could go either way. You could go the way where this vulnerability is exploited and a huge amount of cognitive labour is used to serve the interests of just one person. And then maybe that kind of perpetuates itself forward through the secret loyalties being passed on.
Or you could go this other route, where AIs are initially made in such a way that they only do really extreme acts when there’s a very wide consensus across many human stakeholders, and AIs reliably prevent anyone from getting access to the dangerous kind of AI system.
Rob Wiblin: I’ve heard this intuition from people who either think it’s going to be very technically difficult to align AI models with enforcing the law and following the Constitution and not being willing to stage coups; or they think that even if it would be technically possible, it’s very unlikely that people will be able to agree on what rules we want all of our AIs to follow and to ensure that people don’t ever have access to the purely helpful models that would not follow those rules.
Do you have any reaction to that, to either the technical feasibility or the likelihood that we would be able to impose these controls on all AI models that people would have access to?
Tom Davidson: Yeah. I’ll start with the second. So it’s OK if some people have access to instruction-following AI, as long as it’s many people that have that access and they’re able to act as checks and balances against one another. What we need to prevent is the most powerful systems, running on the largest amounts of compute, from being accessible to just one person and being helpful only.
So I think we probably will have a world where some people have access to helpful-only, and that’s OK. We don’t necessarily need to try and politically prevent that ever happening.
In terms of the technical side, I don’t know what argument these people have, but currently it seems quite possible to get AI systems to have many restrictions on what kinds of instructions they’ll follow from users who are trying to misuse the system. So I don’t see a particular reason technically to think it would be hard to have AI systems follow the law.
I agree it’s going to be a really difficult question on actually agreeing on what behaviours should these AI systems have. We’re doing something that’s really very deep, encoding society’s values into these agents that are going to be operating all over the economy. So I think that’s something where it’s really important to get broad input.
Rob Wiblin: Yeah, I think the thing that feels scary about that is that we’ve kind of established that we have to put these values in, the unwillingness to grab power, in quite an early stage. Otherwise you have this vulnerability that could be exploited and then continue to be exploited kind of indefinitely. But then you need it to be flexible enough that society can continue to change, but narrow enough that there’s no wiggle room to allow people to convince the model to seize power.
And we’ve got to do this relatively potentially quite soon — in the next couple of years, perhaps. So that just feels like it’s a challenging process, and maybe one that we’re not currently on track to complete in time.
Tom Davidson: Optimistically, perhaps, I would have thought there’s quite a wide range of behaviours where the AI can have its behaviour corrected if enough people agree and sign off on that, but won’t help any small group seize power.
One axis is the kinds of actions it will and won’t do. There’s this quite extreme axis of some actions which are really illegitimate and would allow a power grab, versus the kind of actions that we want AIs to flexibly just do whatever humans have told it. So we can just potentially rule out the most extremely illegitimate actions there, and maybe that will go a long way towards preventing power grabs.
Another axis is just human authorisation, where we can have AI systems do more extreme actions, but only when they get to a wider range of authorisation. So I would have been optimistic that it’s at least in principle possible to arrange AIs in that way.
But I actually agree that with a two-year timeline it does become tricky. And certainly when I start thinking through the details of what kinds of cyber capabilities should they or shouldn’t they use, what kind of political strategy capabilities should or shouldn’t they use, it does become very hard to draw the line.
Is a small group seizing power really so bad? [02:00:47]
Rob Wiblin: Yeah. Another objection that I heard from some people in preparing for the episode — and also that I actually got back from Claude when I asked for counterarguments or interesting takes on this — is: is it necessarily so bad if a group of people use AGI to seize power?
I guess if you’re pessimistic about the current track that humanity is on or how the future is likely to go, maybe you think a small group of people seizing power, is this really worse than the status quo or the realistic alternative, which might be a different group of people seizing power later or just some really muddled outcome? It could be very good, could be very bad. Maybe you’re interested in taking your chances. What do you think of that?
Tom Davidson: I think it would be very bad if this happens. Firstly there’s just common sense that this would be totally politically illegitimate and unjust.
Then there’s looking at the real-world autocracies that exist, and noticing that the quality of life for the people who live in those countries is much lower than in democracies, and also that those countries tend to be much less cooperative on the international stage than democracies.
Then there’s just a few more a priori reasons why we’d expect that to be the case. People who actually choose to seize power I think are unusually likely to be psychopathic, narcissistic, sadistic, Machiavellian. And that means that they’re actually a lot worse than just a randomly selected person becoming dictator. You’re kind of actively selecting for pretty nasty people with a power grab.
Then there’s also the fact that when many people have to coordinate on behaviour, and power is distributed, then people tend to coordinate around more ethical behaviour. So even if every person is individually fairly selfish, or every country is individually selfish, when countries come together in the UN and set the rules for international order, they tend to kind of coordinate around more the better angels of their nature and around more fair and ethical courses of action.
So I think actually having power distributed is better than a randomly selected person, because even if every individual is fairly selfish, they’ll tend to coordinate around the more ethical, prosocial motivations that they have.
In addition, there’s gains from trade. So if one person really cares about the environment, and another person really cares about using the stars to create the utility-maximising substance (whatever that is), then those two people can essentially both get entirely what they want if power is distributed and they can trade and negotiate. Whereas if one of them was just randomly selected to take power, then you wouldn’t get that, and you could just have one or the other being completely neglected.
And the final thing I’ll say in favour of distributed power is that I think when power is distributed, you tend to get a kind of competition of ideas, and that can lead to a reflective process, which I think is one contributor to moral and political progress over time.
Rob Wiblin: Yeah. That one might not apply in a future where hypothetically the one person who seized power had the advice of the AI model on what is the best thing to do on philosophy, on moral values. But I suppose you might think even if they could find out what the truth is without having a broader deliberation between lots of people, they might not be interested in doing so; they might just follow whatever capricious values they happen to hold at that point in time.
Tom Davidson: Yeah, that would be the worry.
Countermeasure 1: Block internal misuse [02:04:19]
Rob Wiblin: All right, we’ve been teasing through the interview that at some point we’re going to talk about countermeasures. I guess we’ve said one of the main things that’s reassuring about this, and might be the reason that this doesn’t happen, is that there are a whole lot of things that it seems like you could do, a whole lot of safeguards you could put in place that would be very natural, probably are justified on other grounds anyway, and would potentially make it very difficult for any group to use AGI to seize a lot of power.
Let’s maybe talk about very cross-cutting ones first, like countermeasures that would help to reduce the risk probably across all of these different scenarios — military coup, autocratisation, self-built hard power.
I think the first category you mentioned in your notes is safeguards on internal misuse, and you mentioned monitoring as one of them. What would those safeguards look like?
Tom Davidson: Yeah, let’s take monitoring for example. What that means is that, whenever someone who works at the AI developer is using a model internally, in addition to their using it there’s another AI model or another instance of that same model that is kind of looking at what the humans asked, what the AI is responding. And then it has certain criteria under which it would either shut off the interaction entirely, or more likely just flag it as something which could potentially be problematic, and log it and send it to some other people to review later.
So you could have this via monitoring, or you could just have what externally deployed models have: just trained refusals so the AI just won’t comply with certain requests at all in the first place.
Rob Wiblin: So when you’re monitoring, I was imagining people monitoring it, but I guess I’m living in the past. I should be always imagining why not just have AI models that do this? I guess it doesn’t have to be some brilliant AGI model that’s monitoring it. You could imagine that current LLMs would probably be pretty reliable at picking up or at least setting aside jailbreaks which hopefully we’ll fix; they’ll be reasonably good at picking up requests during the training process that seem highly suspicious, or requests that seem like they’re not something that we would want to be followed.
Tom Davidson: I think that’s right. I mean, to the extent that you’ve got a weaker model that’s doing the monitoring, then there are things that can go wrong there. You can have ways of concealing the nature of the request, because the AI that’s smarter might be able to do that, but the weaker monitor might not. So that does introduce some risk.
But I agree, especially if the human is not that smart, then that could already pose a significant barrier.
Rob Wiblin: Yeah. Is there any monitoring currently, as far as you know?
Tom Davidson: I think it varies from AI developer to AI developer, but my impression is it’s not too hard currently to get hold of a helpful-only model that will just follow any instruction that you want without it being monitored in one of these frontier labs.
That said, I believe that both Anthropic and Google DeepMind have it as part of their responsible scaling policy equivalent that they will introduce more internal safeguards on use as AI capabilities improve.
Rob Wiblin: Yeah. I cut you off. When you were introducing the monitoring idea, what were you going to say?
Tom Davidson: Well, we were just talking about the possibility of more sophisticated attempts to evade monitoring. One such possibility is that, rather than in a single interaction making some request which is clearly illegitimate, you might separate it out into very small chunks — where each chunk individually seems fairly ambiguous, or at least not obviously harmful, but then taken together, they kind of lay the seeds for maybe a strategy for introducing a secret loyalty or something.
So for that, refusals wouldn’t be enough, and a basic monitoring setup where the monitor only can see each conversation individually wouldn’t be enough. So in that case, my colleague Lukas Finnveden has suggested that you could instead have an AI system that’s able to review many different conversations that you’ve had over a long period of time, maybe that many people have had, and systematically search for patterns of misuse.
Rob Wiblin: I guess if groups were really worried about this and very committed to counteracting any potential severe misuse, they could also have human review. I mean, that’s more expensive, but you could have randomly selected review that would make people more nervous that even if they were finding some way through the AI monitoring — perhaps they figured out some jailbreak that allowed them to make requests that wouldn’t be picked up — if there’s occasional human review, then eventually someone might notice it.
Tom Davidson: Yeah, that’s right. I do think you can probably do better than random human review by having an AI monitor flag the most potentially suspicious interactions, so that the human is reviewing the most potentially relevant ones.
Rob Wiblin: Yeah. What specifically do we need to be reviewing? I guess any requests to the purely helpful model? Perhaps that would be a place where someone could try to get advice on how to evade controls, how to introduce secret loyalties. Are there other things that are really important to have this scrutiny of?
Tom Davidson: I actually think ideally employees are generally not allowed to use the helpful-only model. Ideally, just like for people today who use the model externally, there’s refusals training for clearly illegitimate things. Ideally that happens for internal use as well, so there’s safety-trained models that refuse to do broadly illegitimate things. But even there, as I was saying, there could be more subtle patterns of misuse, so it would require additional monitoring on top of that.
Rob Wiblin: Yeah. Just so people understand, I think the people who follow AI intensely might know this, but the helpful-only model is quite extreme. It will help with literally anything. You can say, “How should I best go about committing genocide?” and it will just start laying out instructions. There are basically no safeguards on that kind of model. So yeah, it would help with quite serious crimes.
Tom Davidson: Yeah, that’s right. My own view is that the helpful-only model should be used in capability evaluations and other tests, but beyond that shouldn’t be used by anyone. And any use should certainly be monitored with immediate ability to shut off the interaction and log it for review.
Rob Wiblin: Of course, at the moment we’re not too worried about this, because even the most capable models are still not that impressive — they make stupid mistakes; the advice they give is often pretty mediocre — but at the point where it truly is superhuman, brilliant advice, it’s more important that people not be able to get its help with crimes.
Tom Davidson: That’s right. And there’s many reasons to want those restrictions. It’s not just the risk of people trying to do a power grab. There’s the risk that there might be spies working for other countries within labs that are then using the models for their own ends. There’s plenty of reasons to have internal controls on the model use.
Rob Wiblin: A question mark that I have is, if the leaders of an AI company were already considering a power grab, wouldn’t they be trying to undermine all of these internal controls that they are meant to be implementing from the very outset? Maybe that would apply to other people, but not to them. Or it would look like they had these policies in place, but in fact they didn’t.
So you could just end up with a situation where the only companies that implement these kinds of monitoring and other internal controls are those that were never going to do anything bad in the first place.
Tom Davidson: I’m a bit more optimistic than that. As we discussed earlier, for me, the most realistic threat model isn’t where a leader of the company is explicitly, every step of the way, planning to do a power grab, but more where they’re just following their incentive landscape — step by step, doing whatever most helps them achieve their next goal and get that extra incremental power.
So the point in which we’re first introducing these internal safeguards, it will very likely seem like just a very plausible and reasonable thing to do to prevent any people within the company from misusing the model. And it might also seem completely reasonable for those restrictions to apply across the board to all employees, no matter how senior. Otherwise why would certain people be exempt and not others?
So I think that for the early stages, introducing them initially, I’m hopeful that that could happen, even if certain people in the company might one day pose a risk of power grabs. And by the later stages, the hope is of course that enough companies are doing this and increasingly formalising these commitments that they become legal requirements.
Rob Wiblin: Right. I guess it might be important to get these controls in place and have them be tested and become completely normal, and the kind of thing that will be very conspicuous to get rid of at the point that there’s lots of humans involved in the process. And you might want to have lots of controls, because you’re thinking, we’ve got thousands of developers, thousands of researchers. You want them in place before you get to the point where there’s maybe only five humans left involved and basically all the work is being done by AIs.
Tom Davidson: Yeah, exactly.
Rob Wiblin: So you mentioned people shouldn’t have access to the helpful-only model, and that internal requests to the model should be scrutinised by some other AI model that is going to flag if something strange seems to be going on. Are there any other internal controls that are important?
Tom Davidson: I think the thing we could do is talk about more specifically what capabilities we should be worried about here. It’s not everything that needs to be completely scrutinised.
The first most obvious thing would be AI systems that are capable of designing weapons. That’s something where probably you should just ban any such use. As soon as a request like that goes in, you just shut down the interaction entirely. Unless there’s some kind of exceptional circumstance where there’s a government contract that means it is actually legitimate to use it for that use case.
Rob Wiblin: OK, so you’re just saying there could be a general flag that applies equally to internal and external requests made to the model? That if you’re asking for help designing a bioweapon, then it doesn’t really matter whether you work at the company or not, you still shouldn’t be doing it. And it’s maybe just easiest to figure out, how do we just get the model to always refuse these kinds of things?
Tom Davidson: Yeah, exactly.
Countermeasure 2: Cybersecurity [02:14:02]
Tom Davidson: My next top choice would probably be cyber capabilities. The somewhat scary thing about this is that it would maybe just take one instance of extreme misuse of cyber capabilities that you could then use to undermine the monitoring scheme itself if you kind of hacked into the organisation’s own computer system in the right way.
So that’s an example where really, from the very earliest stage of developing a new model and beginning to test its capabilities, you want to be very carefully monitoring how those cyber capabilities are being used. Because if you don’t, then this whole kind of scheme that we have for ensuring there’s no extreme misuse could be completely removed.
Rob Wiblin: Undermine its absolute foundation. That’s a trickier one though, because there are legitimate uses that you have in uncovering cyber vulnerabilities.
Because we think, inasmuch as AI models are going to be far above human capabilities in terms of discovering vulnerabilities and exploiting them, the only way that we can keep up, the only way that we can block this stuff is to have AI models discovering the vulnerabilities, alerting people and fixing them.
So there does have to be some capacity for people to use AI for good cyber capabilities. It’s not as straightforward as just trying to block any request to build a bioweapon, which probably we would be on board with.
Tom Davidson: Yeah, that’s right. I think it’s tricky, and I think this is where the companies doing a lot of work in advance to really enable those positive use cases while blocking out the risky ones will be important.
But it does seem doable in principle. If you can train an AI model to only do constructive, defensive uses of cyber, and you can check in a lot of artificial environments that in fact it’s never inserting vulnerabilities and only patching them, then you can maybe then test, maybe trust that when the new model is developed that you can have the old model kind of monitor it and check that it’s similarly not inserting any vulnerabilities and only being used defensively.
So you can have this kind of picking yourself up by your own bootstraps and iteratively trusting a new model, and then relying on that new model to monitor the next generation that comes along.
[TD: In addition, you should have many different AI-assisted humans overseeing the early defensive use of a new model, checking it’s being used as intended.]
Countermeasure 3: Model spec transparency [02:16:11]
Rob Wiblin: OK, a second thing you mentioned is transparency about the model specifications. What does that look like?
Tom Davidson: We were just talking about safeguards, but there’s actually going to be loads of really tricky tradeoffs to make over what things should and shouldn’t be allowed. I was just giving the example of cyber.
We could also talk about the example of political strategy. Many instances of using AIs to help with political strategy will seem quite legitimate: helping a company draft some external comms, helping with some lobbying, some of those things might well be acceptable and some of them won’t be.
So publishing the model spec is a way of publishing exactly what the criteria are that your internal AI systems are using to decide what is and what isn’t OK to help with.
Rob Wiblin: So it’s kind of like, I guess, the constitution that Anthropic uses to reinforce its models, and what it will and won’t accept following certain principles.
Tom Davidson: That’s right, yeah. Sorry, let me back up: the model specification describes how the AI is meant to behave.
So in Anthropic’s case, they have these principles, sometimes drawn from the UN Charter of Human Rights, sometimes drawn from other companies’ own principles of behaviour, and some that they’ve come up with themselves. And they have dozens of principles which the AI is meant to follow — like respect all groups, don’t break the law, follow instructions where it’s safe to do so, stuff like that. It’s kind of spelling out and making transparent for the whole world how these systems are actually meant to operate.
But then in this context of safeguards, it’s then essentially also delineating the boundary of what kinds of internal and external uses are and aren’t OK.
Rob Wiblin: I see. So the constitution is the personality of the model that defines the boundaries of what requests it will help with and which ones it will reject.
Tom Davidson: Exactly.
Rob Wiblin: At the moment we have jailbreaks that I think are sufficient to get through any of these protections, at least to a really concerted expert actor. Are you imagining that at some point hopefully that will no longer be the case, and we actually will be able to have model specifications that truly do describe the limits of what the model would be willing to do?
Tom Davidson: I don’t know if we need to rely on jailbreaks no longer working. If you have a monitor of the interaction, then that can protect against certain jailbreaks, because the monitor can see that the system’s been jailbroken and the jailbreak doesn’t work quite as well on the monitor. And there’s other techniques you can use to guard against jailbreaks, like just looking for suspicious strings of characters and removing or modifying them.
My expectation isn’t particularly that just one instance of misuse will be enough for a small group to literally seize power over the United States. So as long as you’re able to keep those defences mostly ahead of the offensive use cases, then that could be sufficient for the whole system to be pretty robust overall.
Rob Wiblin: Why is it that transparency about the model specifications is especially useful?
Tom Davidson: Great question. Without transparency, there’s potential for the model specification to contain ambiguities or weaknesses that wouldn’t be highlighted.
So let’s take a really simple example, like the military use case. Maybe without a transparent model specification, the specification says that the AI should follow the president’s instructions, but it also says that it should follow the law. But it doesn’t drive into the details of how to resolve the potential tensions between those two parts of its model spec.
And currently the way that Anthropic’s model specification works is that there are all these different principles, but they don’t really nail down how they should be resolved in cases of conflict.
Whereas if the model spec was made public, in this example many people might realise, “Wait a minute, we’ve got these AIs controlling military systems, and the model spec doesn’t really say how they’re meant to resolve these two principles. That’s not really good enough. We need to change the model spec to be much more explicit about these edge cases.”
And similarly inside the company, maybe without transparency, the model spec just says, “Follow the instructions of the CEO,” because they’re ultimately in charge of the company when it comes to matters within the company. But it doesn’t say, “You shouldn’t do that when they’re telling you to hack the internal computer system and introduce some vulnerabilities.” So again, if that’s made public and it can be scrutinised, then potential weak spots — that may be there completely accidentally — can be identified and then improved upon.
Rob Wiblin: OK, so we not only need the model spec to be transparent and to be published, we need it to be very thorough, so people can understand the thorniest cases that it’s going to have to deal with. And we also need external groups to be looking at it and reading it very closely and trying to figure out the weaknesses within this model spec that we need to be complaining about in order to get them fixed.
Tom Davidson: That’s right. And the hope is that a first step of making it transparent could lead to those further improvements. If it’s not thorough but it’s transparent, people can point that out. And if it’s transparent but people aren’t really looking into it, then it’s then relatively easy for someone else in the world to be like, “OK, I’m going to do a thorough analysis.”
So my hope is that actually this is an ask which is broadly very reasonable: we’re just saying, “Can you please disclose the information that you have about how your AI system is meant to behave so the rest of the world can understand?” It’s quite similar to asking someone who makes food to say what the ingredients are that go into this food, because that’s just relevant to consumers.
Rob Wiblin: I guess I have the same worry with this as I somewhat did about the internal safeguards: at the point that people were considering misusing the AI, why wouldn’t they just start lying on the model spec? It seems like you not only need transparency about the model spec, but you also need external auditing; lots of checking to make sure that it’s accurate; restrictions on people’s ability to be misleading or to obfuscate things in the model spec, or have one model that follows this specification but then they back out to a somewhat previous version.
We’ve said maybe people shouldn’t have access to the fully helpful model, but presumably there’ll be a bunch of different gradients and models that people do have access to, some of which are more helpful or less. And I guess it’s hard to write up a totally thorough model spec about every single intermediate model that you’re creating.
Tom Davidson: Yeah, I do think there’s wiggle room of this kind. And you’re right that merely publishing the model spec isn’t enough because of, as you say, the possibility of lying. I do think it’s going to be pretty costly for a would-be power-grab person to flat out lie about the model spec: at that early stage, if that lie gets discovered, it’s going to be very costly.
And so I sometimes think about how we’re kind of shaping the incentive landscape for someone who is just generically seeking power at one of these organisations. If there’s a strong norm of publishing the model spec, and these pretty-hard-to-argue-with reasons for it, then at that point it just becomes more costly to take those actions that would increase their own power — and they might go down a different path, where they never end up forming the explicit intention to create a secret loyalty or try and seize power.
Rob Wiblin: It seems like we actually already have a developing norm that many external groups are involved in producing these documents for existing models when they’re released. I know OpenAI has given access to many external consulting groups basically, that do their own testing to figure out what their models will and won’t do, and then they include the results that those external groups find in their model [system cards]. So I think that’s something that we could really develop and is very helpful.
Tom Davidson: Yeah. There’s one thing where they’re including the results of capability evaluations in the model card that’s alongside the model. But I think you’re also right that even for the model specification that lists the behaviours, that will be influenced by the testing that those other organisations have done.
I believe that Anthropic has crowdsourced some of the principles in its own AI constitution, which is its name for the model spec. And I think that as time passes it will become increasingly important for a wide variety of actors to feed in on what that model spec should be.
Rob Wiblin: I guess in this context, folks normally talk about transparency about model capabilities, rather than the limits of what requests the AI will accept and not accept. Is transparency about capabilities also particularly key here?
Tom Davidson: Yeah, that’s right. So the model spec transparency is transparency about the model’s inclinations to do certain behaviours. I think what’s really new here is that even for the internally deployed models, we should have transparency about that, so the rest of the world can have an understanding of whether secret loyalties could be introduced, for example.
But as you say, transparency about capabilities is also really important. That’s going to be, I think, a key thing that allows the rest of the world to realise that maybe these risks could be very real very soon, and take more actions to then require more information or do more auditing.
And in addition to capabilities, I think transparency about risk assessment: what does the company’s process look like for assessing the size of these various risks — like the ones we’ve discussed today and otherwise — and what mitigations are they taking, and what is their evaluation of how convincing those mitigations are going to be in terms of eliminating the risk?
Countermeasure 4: Sharing AI access broadly [02:25:23]
Rob Wiblin: OK, a third category you mentioned is sharing AI as broadly as it’s safe to do. This one’s particularly interesting, because I think on this show, to be honest, we’ve probably more often talked about limiting people’s access to AI and limiting what things are released and deployed. So this highlights potentially quite tough tradeoffs and conflicts between people who are focused on human misuse and human power grabs versus AI misalignment.
Why is it particularly useful to share AI capabilities as widely as possible?
Tom Davidson: It’s probably best explained by going through some examples from the threat models we were discussing earlier.
So let’s talk about the autocratisation threat model. A big driver of that risk was that one political faction had this privileged access to AI persuasion and strategy capabilities, and for that reason was able to outmanoeuvre their political opponents and build huge public support, and ultimately remove many checks and balances on power.
If, though, we had been in a world where actually there was a strong norm to sharing those capabilities as widely as possible, then that could mean there’s broad public access, and equal access for different political factions to those same capabilities. Then it might be a lot easier for opposing interests to prevent clever manoeuvres and maintain a broad balance of power.
In terms of going back to secret loyalties, which I think is so essential, within the organisation itself: if a small group has access to strategic advice or cyber capabilities that the rest of the organisation don’t have access to, then that could allow them to outmanoeuvre the rest of the organisation to find very clever ways to insert secret loyalties into the training run.
Whereas if always, throughout the process, all employees had equal access to AI capabilities, then that would just be a lot harder — because then those employees could speak to the AIs themselves and realise it does seem like it could be possible for someone to manoeuvre and insert secret loyalties, so we’ll take these preventative measures.
It’s a lot harder to see, if there’s always this broad and equal access to capabilities, how you could get a tiny group inserting secret loyalties or seizing power via another mechanism.
Rob Wiblin: Yeah. So an important step in the entire seizing-power process is that there has to be some gap between the capabilities or the access that a group of insiders have versus outsiders. And that could come about in various different ways.
The one that we most often think about is that there’s a very rapid takeoff because of recursive self-improvement. So one AI model, through recursively self-improving, gets significantly above all the other competing models, and the advice that other folks are getting, and the countermeasures that they might be able to implement.
But we could also imagine that this happens because of restricted legal access: that a government takes over a company, says AI is very dangerous now, and it limits people’s access to other models that are anywhere near competitive — so people only have access to AI advice that’s many generations old and obsolete by that point. And then they’re in a great position to potentially use that advice to gain more power.
Tom Davidson: Yeah, I think it is a worrying scenario. There’s just going to be a lot of very reasonable-seeming and maybe partially legitimate reasons to restrict access to AI. As you’ve said, many people worried about AI risks have said that access should be massively limited.
And if you make models public, then that means that other countries could get access to those capabilities. So I was talking about autocratisation and how it’d be good to make strategy capabilities widely available so there can be a balance of power. But you might not want, for example, China to have access to a highly strategically capable AI that could help them with military strategy.
Another difficulty, especially with o1 and the high inference costs associated with it, is that it may be that just by using more compute and spending more money, you can actually access significantly better capabilities — so it won’t be enough to just give API access to a wide group.
Rob Wiblin: You would also have to have access to lots of computer chips.
Tom Davidson: Exactly.
Rob Wiblin: And that’s more expensive.
Tom Davidson: That’s expensive. And that gets really hard, because that comes down to how we live in a capitalist society where people have more money than other people — and that will translate potentially to being able to just have access to potentially significantly better capabilities.
This is also potentially a risk within AI developers, where it could easily be the case that people who are more senior have access to more runtime compute, or have more say and influence over how the compute is allocated between projects. And that does open up the risk of abuse, whereby if those people are trying to seek power, they could continuously allocate more compute to projects which, perhaps unbeknownst to others, are actually in part being used to take steps towards increasing their own power.
Is it more dangerous to concentrate or share AGI? [02:30:13]
Rob Wiblin: So for listeners who are more concerned with other kinds of misuse or are concerned about rogue AI misalignment, I can imagine them thinking this is a step in the wrong direction: that having people have access to very great capabilities through open source models, or just ensuring that almost everyone has access to things that are close to the cutting edge, that this makes things more dangerous.
Do we just have to take a stance on the relative risk of these different threat models — of the risk of power grabs versus misalignment versus more mundane misuse — in order to say whether this is on net helpful or harmful?
Tom Davidson: I think often, through targeted efforts, we can kind of get the best of both worlds. So it’s not a case of either only a small number of people in the organisation can use the capabilities, or they’re fully public. You can take capabilities one by one and think very carefully about who needs access in order for no small group to gain disproportionate amounts of power.
So talking about strategy capabilities and autocratisation: you could give 100 employees access to the top strategy capabilities; you could give 100 members of each political party in Congress access; you could give people in the executive access, in the intelligence services access, and people in the judiciary access. And that might be sufficient to have checks and balances on political power, but you haven’t made it public.
And let’s talk about cyber capabilities: you could give access to those cyber capabilities to the military, who wants to use them for defensive purposes; to multiple different teams within the AI lab; to the government; to various other key organisations — without broad public access.
And for the key dangerous capabilities that people have been most worried about from misuse, like bio capabilities, you can just fully restrict those.
Rob Wiblin: Yeah, I like your idea of how what we really need is access by the various different groups that have enough power now that they serve as checks and balances on one another. In practice it’s more useful that Congress or the judiciary or the military has access to these capabilities in order to check the executive, say — because in practice they’re the groups that actually are close to having the capabilities, the legal authority, the wherewithal to check other institutions in society.
There are also groups that are potentially powerful enough today that they could put their foot down and insist on having access. They actually have the clout that, at an early stage, they might be able to say that it’s not reasonable that only one part of the government should have access to these cutting-edge capabilities; that we do need some parity, some sort of balance between these different institutions. It’s a very reasonable argument, and one they might be able to win at a sufficiently early point.
Tom Davidson: Yeah, exactly. Completely agreed.
Is it important to have more than one powerful AI country? [02:32:56]
Rob Wiblin: So we’re in the UK, we’re in London. I think the government here recently released a plan to develop its own compute clusters, its own cutting-edge models that it would ensure that Britain had continued access to. I think there was both an economic thought here and a national security thought here.
Are you glad about that? Is that something that’s useful to have? Dissemination of capabilities not only within different groups within the US, but also between countries?
Tom Davidson: Yeah, I am glad about that. I think it gives an insurance policy against something awful like this happening in the US — in that if that did happen, then it means that the US would find it harder to totally dominate other countries. Because they would have, as you said, a lot of computer chips that they could use to run AI models that could give them cognitive labour, so there wouldn’t just be the US that has that.
I think if there is at some point a kind of a multinational project, it would also be good for the weights of the AI that’s being developed to be stored separately in multiple countries, so it’s not possible for the US to just seize the weights and then cut off access to other countries from that.
I also think that if other countries have more bargaining power from the get-go, then they’ll be able to influence the US in a more cooperative angle as it develops, because in terms of incremental attempts to increase its own power relative to other countries, that will seem less tempting if other countries continue to have bargaining power.
Rob Wiblin: I guess the US is likely to remain in a pretty dominant position; even if the UK tries to build its own compute clusters, it’s just a much smaller country to start with. I guess for that to be truly helpful defensively, you need to have a world that’s somewhat defence-dominant, where you don’t actually have to have parity in compute in order to be able to defend yourself. Maybe you get most of the returns from a relatively smaller amount of compute, and that can allow you to at least defend yourself against hostile actions.
Maybe this isn’t the right way of thinking about it, and instead we should think about it as more about bargaining between different groups that are somewhat friendly early on.
Tom Davidson: If the US does want to be aggressive and increase its own power, it will probably be unwilling to take massive economic hits from that. So even just trade is going to be a big lever here. The whole semiconductor supply chain is very distributed across the whole world, so it would be very costly for the US to go all-out and try and dominate itself, to the detriment of the rest of the world.
Even if it could in some sense succeed, just by being very militarily aggressive, that would be very costly from the perspective of the US — because it would probably delay its own growth by a long time, because it wouldn’t be able to rely on that existing semiconductor supply chain, and would kind of massively economically weaken it.
So I do think even marginal increases in the extent to which the rest of the world is influential in AI could make quite a big difference to the incentive landscape that faces the US.
In defence of open sourcing AI models [02:35:59]
Rob Wiblin: There’s been a bunch of discussion among safety- and governance-focused people about the possibility of trying to centralise the development of AGI under one international CERN-style scientific project to develop AGI. What do you make of that? It sounds like that’s the kind of thing that you might be a bit wary of.
Tom Davidson: I do think that the considerations we’ve been discussing today point against it. I mean, I’m not saying that there should be 10 different development efforts, but even just a second one provides a real independent check and balance that you can have, where you can have one AI system developed by one project checked to see whether there’s any chance of secret loyalties having been developed in another one.
Whereas you just wouldn’t have that check and balance, because the only really capable AIs you’d have otherwise would be all produced from the same project, and then all potentially have the same secret loyalty problem. And similarly for monitoring internal use and checking for misuse there, if you have these two separate developers, then you can get a much more independent check and balance.
I do think that within a single project you could potentially still get those benefits. You could potentially have independent teams within that organisation training AI systems completely separately. So you could try to have a centralised project that kind of maintains that balance of power.
Rob Wiblin: I do think it’s funny; I feel like the discourse that we’ve seen the last year or two has mostly been, at least on my Twitter feed, between governance, safety, misalignment focused people — who really are looking to restrict access to compute, have fewer models released, maybe fewer projects, because it’s easier to govern when there’s fewer people with access — versus folks I guess more focused on open source, who have a more sceptical attitude towards government, more libertarian, maybe more tech industry focused in some cases, who have been saying that’s very dangerous; we’re very concerned about the centralisation of power.
And I guess that hasn’t gotten a lot of sympathy from me or from other people who are worried about risk from AI. But perhaps the open source people haven’t been picturing this exact scenario exactly — maybe some of them have, but most of them probably haven’t — but they have this impulse which is we really don’t want to see one big company or just the government that has access to this cutting-edge technology while everyone else is cut out of the picture.
And that actually might be quite a healthy and sensible intuition to have, even if you don’t know exactly how it’s going to go wrong. So perhaps we should have a little bit more sympathy for those folks, even if they sometimes seem off-base to me.
Tom Davidson: Yeah, I agree. And I do think that the AI safety community has historically been overly bullish on how important it is to restrict capabilities. If I recall correctly, people were worried about open sourcing GPT-2 because they were worried about misuse. And I think that was just misguided. I think there have also been some claims that have been over the top around the risks of roughly current-level systems posing risks around bio.
So I do think we have not been sufficiently sympathetic with the open source community. I’d reiterate that I do think the considerations we’ve been discussing today differ from open source in that they really push towards moving from one project to maybe two or three projects. They don’t really push towards moving to 10 or open sourcing superhuman AI, because you can already get that check and balance with just a small number of projects.
Rob Wiblin: Yeah, I guess you’ve also talked along the way about how you would need to have restrictions on people’s ability to request assistance with building bioweapons. I suppose you think maybe we’re too worried about that with current models, but at some future time these models might actually be helpful — and if the weights are completely open, then it’s going to be very difficult to have any restrictions on what ends people might turn them to.
Tom Davidson: Yeah, that’s right. But if I’m really trying to take the open source community side here, I might say that it’s all very well saying that really you’re only concerned about future systems. But then if in practice, you also seem to be worried about current systems and overhyping the risk from those — and in practice, when we regulate technologies we do often lose fine-grained control and the regulations end up being misplaced — then you could say the AI safety community has been pushing for things that in practice would have restricted technology prematurely.
Rob Wiblin: Yeah. I want to stick up for the people who are a little bit worried about GPT-2 for a minute, because I’ve heard this line. I feel like at each stage that you develop the new model, it’s at least worth asking the question, “Could it be dangerous now?” It’s very obvious to us in retrospect that GPT-2 was pretty benign, that it wasn’t going to be used for spam in any super dangerous way, or I guess people were worried about misinformation and generating all kinds of text that could be used for harmful purposes.
I don’t think anyone was worried about extinction risk at that stage, but I think it was at least worth checking that out before you released it. I don’t know whether there were people who really were like, “We absolutely can’t give people access to this model. It’s going to be so dangerous; it’ll be a nightmare.”
Tom Davidson: Yeah, I’m more sympathetic to thinking that, if you can be very confident there’s not going to be any irreversible catastrophic risk, your default should simply be to release the model. Then there’s more people that can benefit economically, you can diversify control of the technology, and even safety research is massively enhanced by open source.
And then I agree, once there’s even a small risk of an irreversible catastrophe, then I think there’s a strong argument for the precautionary principle and actually going very slowly. But it does seem to me with GPT[-2] that there was no such argument, and therefore it would have actually been reasonable just to open source straight away.
Rob Wiblin: Yeah, interesting.
Tom Davidson: In addition, I’ll say that I think in fact safety research might have been massively enhanced if GPT-4 had been open sourced from the off, just allowing academics all over the world to do interpretability research and other kinds of safety-related research.
Rob Wiblin: Yeah, it is funny. I feel like Meta’s analysis of the open weights question and the various tradeoffs has been naive, or has not been very high quality in my mind. But nonetheless they put out Llama 2, and I think that was probably for the best, because it’s been really useful for safety research while causing almost no harm. And probably I think even in future, as people look back and think maybe they can try to squeeze more capabilities out of Llama 2, it’s not going to be a major concern.
So perhaps the worry there, on my part at least, is that even though releasing Llama 2 was probably for the best, if you’re not thinking clearly about the different tradeoffs and what limitations you might need to put in place once certain capabilities are there, then I just don’t know whether the decision making will be good in future.
Tom Davidson: I think that’s a fair concern. Again, if I’m playing devil’s advocate, I’d say often people who consider hypothetical risks, when they’re making that consideration, they think differently than if the risk is actually in front of them.
Many people were kind of dismissive of AI risk five or 10 years ago, and even if you said that there’s a chance that we’ll have really powerful systems, then don’t you think they might have still dismissed your argument, because they’re just suspicious of this kind of hypothetical-scenario type of argument? But then when GPT-4 came out and people have seen the technology, a lot of people do in fact now take the risks much more seriously.
So I don’t know how the open source community and other people are going to be as capabilities become much more intense, but I think there’s some chance that actually the apparently kind of unreasonable attitude that some people seem to have when they’re kind of blindly dismissing these risks will actually turn out to be less problematic.
Rob Wiblin: Yeah, you could imagine that if they actually do testing of Llama 4 or 5, and they find that it is incredibly helpful in developing a bioweapon, then that might focus the minds of the folks there about whether they really want to open weight it or not.
Tom Davidson: Yeah, that’s right.
2 ways to stop secret AI loyalties [02:43:34]
Rob Wiblin: I think perhaps I haven’t done quite enough to focus this conversation on the secret loyalties issue — because I think that in some ways is the most severe vulnerability, because it can be introduced quite early and then many of the countermeasures that you might use are ineffectual because you have the AI model’s help in innovating them, ensuring that the secret loyalty can’t be revealed. Basically all of the future models are going to have this vulnerability embedded in them by design.
Are there any interventions that we can use to focus on the secret loyalties problem in particular?
Tom Davidson: Yeah, there’s two that I have in mind.
The first is inspections of the model to make sure that there aren’t secret loyalties. I think this is a very early-stage technical research problem, but essentially, in principle, there’s various types of technical checks you could do to see if there is a chance that secret loyalties are present in this trained model.
The most basic is probably behavioural testing of the model after it’s been trained: give it loads of different inputs and see if you can, under any scenario, kind of trick it into revealing its secret loyalty.
Another one would be looking at the data and the algorithms used to train it, and looking for patterns of data there that could encode some kind of secret loyalty. As we alluded to earlier, this is tricky, because there’s loads of different code words or triggers that you could use to elicit the undesirable behaviour. So that is potentially challenging.
But on the other hand, if you have what’s called “white box access” to the model, you can search for inputs that would produce particular types of outputs, or you can do interpretability in terms of the model weights, and there might be more hope there for actually detecting secret loyalties.
And maybe the most promising is actually monitoring and then inspecting the process by which the algorithms and data inputs were gathered together in the first place. So if there’s a very structured, step-by-step process by which the training data is procedurally generated by code, and you can inspect that code and you can inspect how that code was written step-by-step, then you might just see that each of these steps is so reasonable and so standard that there’s no point at which someone has been able to introduce a particular effort to subvert the training process; there’s no point at which someone has been able to throw in extra data that would systematically produce a sophisticated secret loyalty. So that could potentially give us some level of assurance.
Rob Wiblin: In my own mind, should I be imagining that the secret loyalty is introducing this kind of ambient bias in the advice and the actions of all of these AIs across society as they’re embedded into decision making and businesses and the military and so on? Or rather that they all act completely normal for some period of time until they kind of flip completely, because they’re going to strike and enable a coup by the group to which they’re secretly loyal? Seems like there are quite different pictures of what this might look like.
Tom Davidson: I think it could be either. Probably the most worrying version is where the model is just secretly aware on some level that it’s trying to advance the interests of one person and then picks whatever strategy it judges will be best.
So maybe in some scenarios, like the autocratisation scenario, it’s speaking to lots of politicians and advising them, and it does kind of bias its advice on a daily basis in order to promote one person’s interests. But then when it’s being deployed in the military, the strategic situation is very different, so maybe in that context it more makes sense, as you say, to completely behave until the one moment where you start executing a coup.
And if we’re talking about truly superhuman intelligent AI systems, then there could be a risk that they can do both of these.
Rob Wiblin: Yeah, they can figure out which one is best depending on the circumstances.
So it seems like there’s broadly two different approaches that one might take to heading this off early on. One is advancing the state of interpretability or our ability to scrutinise models after the fact, and then see if they have any secret loyalty. Trying to detect that better than we currently can.
The other I think of as more access control or looking at the training data that produced a given model. And even if you couldn’t tell after the fact whether a model does have a secret loyalty, if you could literally look at all of the reinforcement, all of the materials that were fed into it to begin with, presumably, in order to create a secret loyalty, you would have to actually reinforce it at some point for going along with that.
Tom Davidson: Yeah, that’s right. I’d add that we could do an extra level of check here, where we could have a team within the organisation tasked with actually trying to produce a secret loyalty and trying to sneak it past the inspection process. That team could use the most sophisticated techniques possible to try and insert a secret loyalty, and then if they’re able to sneak it past the inspector, then that would give us reason for alarm. But if consistently and robustly the inspection process is able to detect this, then actually that might give us some real confidence.
I should back up and say that everything we’ve been discussing is really falling for me within this inspecting the model behaviour, inspecting the inputs to the model behaviour.
There’s a whole other class of interventions that are needed for secret loyalties on the internal infosecurity side. So even if you’ve got the most amazing inspection process and you absolutely guarantee that the model you trained has no secret loyalties, there’s then this more basic worry that someone just hacks the computer system and then is able to sub out your perfectly safe system and then sub in another system that they’ve trained secretly off to the side which does have a secret loyalty.
So to robustly prevent secret loyalties, we need to be kind of reliably winning on two fronts: firstly, the technical assessment of the AI system that was trained; and then secondly, on the internal infosecurity — making sure that no insider threat is able to compromise the computer system in a way that allows them to alter the model that’s being served.
Rob Wiblin: I think there’s three fronts, maybe more of them? So you’ve got to make sure that the model that’s being deployed hasn’t had a secret loyalty put into it. You’ve got to ensure that nobody can get access to the purely helpful model, where all you have to do is just give it the instruction and it’s going to go out and do the thing. You also have to make sure that there’s no flipping, no switching of the model. I don’t know how difficult or easy that is, but someone could introduce a secret loyalty into a different model and then make that the one that’s deployed, and then make it difficult to pick that up.
Tom Davidson: That’s right. I mean, if you really do very good inspection of the trained model and very good internal infosecurity, then even if someone gets access to a helpful-only model, you might say that your other two defences should catch that — because at some point, that helpful-only model will have to either compromise the internal infosecurity to do this swapping in/swapping out, or it will have to insert a secret loyalty into another model which is being trained, which then maybe you could spot when that happens.
In principle, yeah, you definitely do want to lock down the helpful-only model as well, because that would just be a big risk factor. But if you get the inspection and the infosecurity right, then that should defend.
Rob Wiblin: Do you have any sense of how close the companies are to having any of these things implemented?
Tom Davidson: My impression is that things are at a very early stage in terms of having a robust technical process for inspecting for secret loyalties. We’re only just recently seeing the sleeper agents paper from Anthropic, and it’d be a very natural extension of that to then look into techniques for detecting sleeper agents and beginning this kind of cat-and-mouse game about whether they can then find new techniques to hide more subtle sleeper agents.
On the infosecurity side, my understanding is that these organisations are not where they want to be in terms of preventing their model weights being exfiltrated. And my guess is that, to the extent they are really focused on infosecurity, they’re mostly focused on stopping that from happening. They may be a lot less focused, or not at all focused, on the internal infosecurity that would allow this sabotage of the training process or the swapping in/swapping out more at the end. So there might be very little work on that happening, or none.
Rob Wiblin: Yeah. I think all of these companies ultimately started as technology companies, and probably their information security is very good by the standards of a tech startup, which I guess is appropriate given what they’re doing.
But I think it’s not normal for a tech company to think that it’s absolutely essential that we have all of these internal controls on what our own staff, our own researchers, our own CEO could plausibly do. That’s quite an unusual circumstance that maybe you might see more in banking or in the military, but not so much normally in a tech company. So it’s quite a shift of frame, and probably requires quite a lot of stuff that is not standard in their industry.
Tom Davidson: That’s right. Although it’s interesting, because these companies are increasingly being quite explicit about the fact that they expect to develop superhuman capabilities in the next few years. So there really should increasingly be a realisation that the infosecurity needs to improve.
Rob Wiblin: When I spoke with Carl Shulman about issues in this general direction, he pointed out that, inasmuch as any country wants to deploy AI in the military in order to remain competitive, it’s extremely important from their point of view that they be able to detect secret loyalties, and that they’d be able to detect any abnormal behaviour that might be possible to trigger — because that could just be completely catastrophic from their point of view.
Especially, you could even imagine that they might have been inserted by a foreign military that was trying to introduce some codeware that they could use to deactivate the other side’s military. That’s a scenario that’s so unacceptable that you wouldn’t accept even a low probability of it. So you really need to be able to inspect the training data, make sure that that kind of thing is not the case.
Is it possible to get big government grants, big military grants? This seems like the sort of agenda that DARPA might be able to fund, or IARPA perhaps. Because it is just so important to the sorts of applications that governments might want to make of AI.
Tom Davidson: Yeah, it’s a great point. I think there should and will be very wide interest in this problem. There’s an existing research field into backdooring AI systems and lots of papers written about different techniques for introducing backdoors and detecting them. So that’s a research field which I imagine could absorb more funding and would be relevant to this.
Then recently, as I said, Anthropic had the sleeper agent problem. And that’s getting into more sophisticated types of backdoors, where the AI is purposefully being deceptive the whole time and choosing when it should or should not reveal its hand. And it does seem like there’s room for a lot of research to be done into different techniques here, different ways of detecting it. It does seem like something that could potentially be scaled up a lot.
Rob Wiblin: Do you want to make a pitch for engaging in these kinds of policy changes and practice changes to people who are maybe not convinced about the power-grab scenario in particular?
Tom Davidson: Yeah, absolutely. Possibility of power grabs aside, AI is going to be a pivotal technology. It’s general purpose and it’s going to be deployed all over society. And it’s really important that it’s a secure technology that no foreign adversary, or self-interested individual, or political extremist that just wants to promote their own ideology is able to affect the values and the goals and the behaviour of AI systems in a way that the rest of the world isn’t aware of.
Essentially, as a matter of basic democratic oversight over this pivotal technology, it’s really important to secure against secret backdoors, to secure against a tiny group getting excessive access to these capabilities. So I think there’s a really broad swathe of reasons to support efforts in that direction.
I’d also add it’s just a very neglected type of work. There’s very few people that are working on it, fewer even than work on the risk of misaligned AI causing massive catastrophes. So it does seem like it’s neglected, it’s hugely important, and it might well be very tractable to work on. The research problems we’ve been talking about, it seems like you could just get going and start looking into ways of detecting sleeper agents, ways of detecting backdoors.
Preventing AI-enabled military coups in particular [02:56:20]
Rob Wiblin: Are there any policies that are helpful for preventing the kind of conditions under which you could just see a swift military coup in particular?
Tom Davidson: I think we could reiterate the things we discussed earlier, which is: publish the model specification for all your military systems so they can be scrutinised; do exhaustive tests for secret loyalties; especially in the case of military systems, exhaustive efforts to secure them in terms of their cyber defences.
I think also having diverse AI developers involved in that process, diverse AI developers checking each other’s work, and maybe having different military systems controlled by AI systems that are developed independently wherever possible.
Having AI systems deployed in the military be almost as dumb as they need to be is another axis here: if an AI system is really much cleverer than it needs to be, then there might be greater risk that it’s smart enough to actually be acting with a secret loyalty; whereas if it’s only as clever as it needs to be and no more, then it actually might not be able to execute a sophisticated strategic deception in the way that we can’t catch through testing beforehand.
There’s the idea of having a kill switch, whereby in addition to the usual chain of command, there’s another body of trusted people — maybe it’s the judiciary, maybe it’s part of the executive or Congress — that has a kill switch where, if they can see that AI is being used to do a coup, they can give a very secure signal which shuts all those systems down. That would have to be something which is very tamper proof.
Rob Wiblin: Yeah, exactly. Otherwise it won’t get you very far. I guess that would just be the group that would be couped first, basically.
Tom Davidson: Yeah, that’s right. You don’t want foreign adversaries to be interfering with that.
Another idea is to just have a very general principle against military systems firing on Americans. That might be something which is very easy to test and make robustly the case, and might then be harder to find ways of secret loyalties getting around that, and that could be a pretty acceptable principle to have in general.
Rob Wiblin: Yeah. I guess just at a basic level it seems like it would be quite helpful for Congress to pass rules saying that any use of AI in the military, or at least AI that’s highly autonomous, we have to be able to demonstrate to a very high standard that it doesn’t have secret loyalties.
I think that would be a shot in the arm to people who, for commercial or military contractor reasons, want to develop and want to eventually sell that kind of equipment to the government. They’re going to have to figure out the science of uncovering secret loyalties, and how can we ensure that we understand how this equipment is going to behave under different conditions. It seems like quite a sensible rule to have, and one that we can get in sometime before we’re right in the middle of the military embedding AI everywhere.
Tom Davidson: Yeah, I completely agree.
And then there’s just a question of, if it’s already happened that the superhuman AI within the labs in fact all have secret loyalties, will it be possible then to actually derive really strong guarantees? That might involve constructing these systems without making use of those superhuman AIs — and that might be very costly or it might not be. I think this is pretty unknown.
But ideally, if people within these organisations are able to foresee that there’ll be these strong demands for secret loyalties to not be a risk in military systems, they might foresee that in order to give those guarantees, they’ll have to maybe much sooner have robust processes that prevent those from being inserted within their own organisations.
Rob Wiblin: I think another broad point that I would make is that, inasmuch as people are arguing that we need to implement AI in the military quickly in order to remain competitive with other countries — I guess China is almost certainly going to be the case that people will talk about — that makes sense at an abstract level, but I feel like people need to actually be precise about what do they think China is doing and when and what capabilities do they think that they’re going to have at what point.
Otherwise you can be the one that triggers the arms race basically, and you can also just end up going far ahead of what is necessary in order to maintain your competitiveness. If there’s this risk, we want to go basically the smallest direction towards embedding AI in the military that is sufficient to remain competitive and be able to deter people and maybe remain superior.
But that actually might not be very far, because I don’t get the sense that, at a nuts-and-bolts level, China is embedding AI all through its military at the moment. I suppose people in the intelligence services might know more about how much they’re spending on it. But I would really want to demand evidence that China is doing something before we were like, “We have to roll this thing out urgently in order to keep up.”
Tom Davidson: Yeah, that’s true. And I think especially with the new wave of export controls, it does seem pretty plausible that China will be quite far behind on AI, and therefore there won’t be an imminent risk of them deploying AI in their militaries and massively powering themselves up.
One thing that could be difficult here is that if we’re going very slowly into pulling AI into our own military, then that could increase the risk of self-built hard power. Because if alongside that there’s this whole new industrial base and all these robots that are being quickly spun up, then if you’re not deploying AI in the actual military, but you’ve got all this hard power that’s being developed in the broader economy, then there’s that increased risk that actually this non-military force would actually be able to quite easily overpower your military. So balancing those risks could prove difficult.
How listeners can help [03:01:59]
Rob Wiblin: OK, so those are some of the countermeasures that ideally companies and governments would implement.
Thinking now about our listeners, what can they plausibly do in order to help with this agenda? It sounds like there’s only a handful of people thinking about it or researching it, as far as you can tell. Where could people go to usefully contribute?
Tom Davidson: Yeah, great question. I think with some causes, just raising awareness is unambiguously good — but I don’t think that’s the case here, because of the risks we’ve discussed about this becoming polarised.
So my general recommendation is that if you’re talking about this in a very public, prominent place, then certainly don’t talk about it in a way which could be polarising or pointing things at individuals. And often, if you want to talk about the mitigations here, there’s plenty of reasons to want these things without even talking about the risk of one person seizing absolute power.
One thing that I think anyone could contribute pressure towards is transparency, as we’ve talked about, into capabilities, into the model spec, into the risk analyses that organisations are doing. And that’s something that can be justified on many grounds. So pushing for that transparency and then actively using the transparency that is there to evaluate, are these mitigations sufficient?
Rob Wiblin: If someone wanted to pursue tackling this problem as a career, where could they go? Are there any groups that are hiring? I suppose there’s many AI policy/governance organisations now. I think most of them don’t have any programme on this, or this is not really a focus — but I guess you could try to persuade them that it’s worth hiring someone to work on this if they’re working on adjacent things? Try to persuade them this is a bigger issue than they might have thought before. Do you have any thoughts?
Tom Davidson: I think at the moment the best place I would suggest would be the frontier AI developers themselves. There’s things that you could work on full-time there and there’s also things that you could push for using any influence you have within the company.
In terms of projects you could work on with a lot of your time, we discussed earlier how it’s important to have the model spec for internal use be made public so that people can see what will these internally deployed AI systems do, what won’t they do, what are the boundaries there.
But there’s a whole lot of complexity there in striking that balance in the particular cases, as we’ve discussed. So in the case of, let’s say, AI R&D, you want your models in some contexts to actually help you out doing research into sleeper agents and the mitigation. In some contexts you want the internally deployed GPT-5 to help you out with that research, but in other contexts you absolutely don’t want it to help you out, actually introducing that secret loyalty to a system that’s actually going to be deployed widely in society.
So I think that there’s a lot of work in actually nailing down what that model spec could look like, and what processes could we have for internal deployments reliably not enabling extreme misuse and power concentration, but still being kind of really helpful to people internally. So there’s a kind of speccing out what that model specification should look like.
Then there’s also thinking through the technical details of how we can actually train these AI systems. I think over the next few years, increasingly these frontier AI developers will be training agents that are increasingly autonomous, to edit their codebases, do a lot of work. There’s a chance that there’s neglect of the importance of balancing the way that these agents behave, and finding out ways to actually train these agents in a way where they can be really helpful, but they don’t pose this extreme risk of misuse.
How to help if you work at an AI company [03:05:49]
Rob Wiblin: Fortunately we do have quite a lot of listeners who work at the leading AI companies. Do you have any other asks for those companies, or staff working there?
Tom Davidson: Yeah, so that was one kind of technical project. Another one is getting more detail on the nuts and bolts of what I described as sharing capabilities as widely as possible. I think that if capabilities are shared widely within these AI developers, then that would provide a really strong bulwark against the introduction of secret loyalties.
But again, there’s complexities to be ironed out here, in that probably everyone wants to have some compute that they have access to and that they can use, but then there will be certain people that are maybe running large research projects or some people who are more senior that do have increased influence. So how can you structure things to prevent certain people or certain teams ending up in practice with a lot of compute that they could potentially divert towards some kind of illicit use?
So that might be that projects that use a large amount of compute have to be vetted and approved by multiple different people, and then there’s some monitoring of what happens there. But there’s a lot of details to be worked out, and I imagine a lot of work to be done to make sure that this doesn’t become like annoying red tape that just slows everyone down. Because if you can nail a way of doing this which is pretty cheap, then it’s much more likely that those mitigations will continue to be the case.
Rob Wiblin: Any other asks for the companies? I suppose maybe you could go on for a fair while.
Tom Davidson: I think that the other big one would be this research on secret loyalties. As I said a few times now, there’s the sleeper agents paper from Anthropic, and it does seem like we are just not in a place at the moment where we would inspect a model that was trained and give some confident assertion of what kinds of adversarial behaviours or secret loyalties could or could not have been introduced.
So I think there’s a big project in terms of how do you want to structure the process of training an AI system to make it really hard to introduce secret loyalties? Do you want to split that process up into six different parts where there’s different teams that do each part? That would make it really hard for one person or a small group of people to introduce a secret loyalty, because they would only have access to one step of the chain. What checks can you do at the end, and what tests can you do to give maximum assurance that there aren’t any secret loyalties? There’s just a lot of technical work there to be done I think.
Rob Wiblin: Is Forethought going to be publishing a bunch about this topic that people would go and read? I’m just imagining someone who’s actually in a position to maybe implement or advocate for some of these things within the companies: is there anything that they can go and read to have something concrete to point to?
Tom Davidson: We’re writing a paper at the moment. I’m not sure if it will be out by the time this podcast is up, but as soon as it is out, I’ll definitely be sharing it widely and promoting it.
Rob Wiblin: I feel like there could be a useful collaboration between you and other folks at Forethought, and people actually at the companies — who are maybe in a position to understand, at a nuts-and-bolts level, what the different processes are exactly currently. And you could suggest abstract ways that they could be better, and they could say, “No, that wouldn’t work for this reason.” It seems like it requires maybe a fusion of both the high-level thinking and the operational understanding.
Tom Davidson: Yeah, I think that’s right. Often once you see the concrete details of the process and the tradeoffs within the actual organisation, you’re in a much better position to know exactly what you should do about it.
One last thing on the people who work at AI companies would be pushing for transparency: pushing for strong commitments to publish capability evaluations, risk evaluations, and ideally formalising that. There’s the Frontier Model Forum, which kind of coordinates a few of these frontier AI developers, and if there was a joint commitment coordinated by that forum to have broad transparency into capability evaluations and risk mitigations, I think that’d be a big win. And anyone who works at one of these top AI companies could push towards that.
The power ML researchers still have, for now [03:09:53]
Rob Wiblin: Something that I found funny about this whole situation is that, as far as I can tell, the best AI researchers of the leading companies are working feverishly to automate AI research — at which point they will lose their jobs and no longer be necessary, and they’ll also lose all of the leverage that they have to influence the process.
At the point that they are no longer required or they’re no longer actually the best agents for doing AI research, why are they still involved? Why would the company continue to deal with them and give them things that they want? I wish they had a bit more of a class consciousness to understand what is in their interests.
Tom Davidson: I think that’s right. And it may be that when it gets closer to reality, and they’re actually closer to being fully automated, that there is a shift of mindset as they see that impending loss of influence. I think in many organisations in the context of automation, people do actively resist their jobs being fully automated. They kind of carve out niches for themselves; they’re kind of reluctant to participate in the data-generation processes that would be needed to fully automate themselves.
And I think that we may see similar behaviours with employees of these organisations — where maybe, when push comes to shove, actually employees don’t want their screens being recorded so that it’s easier to train AI systems to replace them, and actually insist upon safety procedures by which they’re still needed to be part of the loop and approve actions. And that could both be in their pragmatic self-interests and also actually reduce the risks of concentration of power.
Rob Wiblin: So you’re saying even if, in some technical sense, they were no longer required to conduct the research, they could argue that for safety reasons they still have to be around; that otherwise you just have this ridiculous concentration of power in a handful of people in the corporate department. And they would be right about that, I guess, in our view.
Tom Davidson: That’s right. And they could also actually just not participate or cooperate as much as they could in the process of actually developing those systems that could replace them. Often, humans have in their brains lots of heuristics and techniques that allow them to do their jobs really well. And I think it will be much easier to automate work if those humans are cooperative in terms of sharing everything they know, having their screens recorded, giving feedback to AIs that are trying to automate their work.
But those humans could just be like, “Actually, I don’t want to cooperate in this way.” And of course you can pit employees against each other and offer to pay some of them a lot to get them to cooperate. But as you say, if there’s some kind of class consciousness, then that could be more difficult.
Rob Wiblin: Well, they do have a lot of leverage right now. There’s a lot of competition for those kinds of researchers. So while that’s still the case, it seems like it would behove them to think about how they can leverage that to ensure that they’re not removed within the next couple of years.
Tom Davidson: Yeah. In this debate about explosive growth, often sceptics will talk about these political economy barriers to really quickly automating sectors and getting really rapid transitions to explosive growth. And I think they’re right that it’s not often in people’s individual incentives to grow the overall economy as quickly as possible, because they may lose out on an individual level — even if it’s more efficient for society as a whole. I think this is a real dynamic.
How to help if you’re an elected leader [03:13:14]
Rob Wiblin: What would be at the top of the priority list to ask our elected leaders to do, or Parliament or Congress to pass?
Tom Davidson: Again, my top priority is going to be transparency. I think it’s just a very broadly reasonable ask. As we’ve discussed, power is currently distributed very widely. If people have a good understanding of the level of capabilities, the risks, perhaps the inadequacy of current risk mitigations, then that can be a spur for demands for more.
I think that’s a broad thing that you can achieve via executive order. You could potentially pass legislation that requires that.
One particular thing is that congresspeople are able to subpoena information that could be relevant to making laws. So if you are a congressional staffer, you could potentially be looking at what are the scenarios under which you could actually use that authority to gain this information and this additional transparency into this kind of risk?
Rob Wiblin: You’re thinking you could subpoena a description of the process by which these models are trained, what sort of safeguards are there and what sort of safeguards aren’t there, so that external people can think, is this sufficient? And probably they’re going to find that it’s not sufficient.
Tom Davidson: That’s right. Have there been any incidents where something went wrong and kind of slipped through the system? Have they done red-teaming to see whether someone’s able to actually fool the inspection process and sneak a secret loyalty past? That kind of thing.
Rob Wiblin: Maybe we should clarify that I think you’ve only been working on this topic for a couple of years. A year or two. It’s relatively early in the research agenda, is that right?
Tom Davidson: Yeah. Lukas Finnveden has been working on this the longest, and I think he’s actually worked on it full-time for less than a year. And I’m even less time than that myself. So this is certainly a very early-stage line of research that I think deserves a lot more attention.
Rob Wiblin: Yeah, I wanted to say that because — given how early you are in this work and how few people are on it — I think we can expect significant improvements probably in these suggestions that you have. This is not probably going to be the state of the art forever.
Tom Davidson: I certainly hope not.
Rob Wiblin: So that’s an optimistic sign. Not that I’m criticising these interventions, but I think with additional work we can probably be a lot more concrete and understand what things are going to make the biggest difference.
Tom Davidson: Yeah, that’s right.
Rob Wiblin: I guess you’re not a forecasting timelines person yourself, but how long would you guess that we have to sort these things out before we’re actually exposed to some meaningful risk of a power grab?
Tom Davidson: I think it could easily be within the next couple of years that we get a very significant risk of an intelligence explosion within a frontier AI developer, during which time secret loyalties could be embedded that later enable a power grab.
We’re just seeing really rapid progress of AI at tasks where you can get a feedback signal. Recent o3 results show that when you’re able to automatically generate and verify examples of a task, it’s just really very impressive capabilities. And it seems like lots of parts of AI R&D will be like that.
In terms of the actual risk of a power grab happening, it does seem to me that of the threat models that I considered, autocratisation is hard to see that happening overnight; obviously automation of the military is something that we will see coming and probably will take years.
Probably, if you’re actually thinking what in principle could be the most immediate and quick one, it would be the self-built hard power where there’s just this big uncertainty. It is for all we know possible that there’s this intelligence explosion, and then AI is just very good at designing new technologies, and it only takes a relatively small amount of industrial capacity to make enough drones for there to be a quick power grab. So that risk does seem like it could imaginably, conceivably arise within the next few years.
Rob Wiblin: OK, so the issue of a secret loyalty installed early, that’s kind of an imminent problem. Then I guess you have self-built hard power, if it’s the case that you get a surprising industrial takeoff with AGI. Then I guess further out, the military coup requires AI to be more integrated into the military than it is now, so that will probably take a little bit longer. And then autocratisation is an even more gradual process, probably.
Tom Davidson: That’s right. Although on autocratisation I would say that there’s this useful distinction between the point at which literally a human has seized power and now has hard power — and we’ve been kind of talking about that — but there’s this useful notion that Daniel Kokotajlo talks about of the point of no return. That’s where the power-seeking coalition now has enough economic and political influence that in practice it’s very hard to stop them and you’re unlikely to do so.
So with the autocratisation threat model, maybe it’ll be 10 years before they could actually solidify their absolute power, but maybe it’d only be four years before they’ve actually got that groundswell of popular support and they’ve created dramatic race conditions with China. Maybe there’s a war on the horizon. And in practice, at that point, the forces that are trying to reduce the risks have already lost the game. So that’s a distinction worth bearing in mind.
Rob Wiblin: OK, so I guess I’d say it’s a reasonably urgent issue for some more people to get onto — more than just the two or three of you who are currently working on it. We can come back and see how this research agenda has developed maybe in a year or two.
Tom Davidson: Yeah, that’d be great. Thanks so much, Rob.
Rob Wiblin: My guest today has been Tom Davidson. Thanks so much for coming on The 80,000 Hours Podcast, Tom.
Tom Davidson: It’s been a pleasure.
Rob’s outro [03:19:05]
Rob Wiblin: To continue all of the things that I was discussing in the intro: if AI development continues at its current pace, we could see an intelligence explosion in which AI takes over AI research and rapidly self-improves, starting in as little as two to seven years.
Increasingly, we believe that this is the defining issue of our time, and decisions made in the next decade could shape humanity’s long-term future in profound ways. If you’re still with us at the end of that interview with Tom Davidson, perhaps you agree with that general perspective.
Given these kinds of stakes, 80,000 Hours, including this podcast, has made the strategic decision to orient roughly all of our work to try to help transformative AI go well.
We believe there’s an opportunity for our show to have an even bigger impact by scaling up the podcast from occasional in-depth interviews to offer more comprehensive coverage of AGI-related developments, risks, governance, and alignment efforts in a really substantive way.
I wish that this was already being delivered by The New York Times and other established institutions, but the sad reality is that it isn’t — not at all, not yet. Maybe that will change, but we can’t sit around and wait for it to happen.
With the right founding team in place, we can aim to triple or even quadruple the size of the podcast team here, allowing us to publish more in-depth interviews with leading AI researchers, policymakers, and strategists, in both audio and video — providing regular analysis of new developments and research findings, and potentially branching into other content formats, including voice essays and a Substack.
It’s an ambitious vision, it’s a challenging vision — but I think one we should make a big effort to run towards.
In the short term, we’re hiring for two critical roles:
- A third podcast host to help us record and release even more content
- A chief of staff who will work closely with me to build the team and establish the systems needed for consistent, high-quality output that reaches a large audience
For the podcast host role, we’re looking for someone who:
- Takes AI development seriously
- Has strong research and communication skills
- Is curious, opinionated, and comfortable disagreeing with people in public
The ideal person probably also consumes a lot of audio and video podcasts, or reads a lot of relevant Substacks, or alternatively follows AI on Twitter perhaps.
We lay out the nature of the role and the sort of person we think would be a good fit in more detail in a blog post, that you could find at 80000hours.org/latest. Depending on when you’re listening to this, you might need to scroll down and possibly click for the second or third page.
For the chief of staff role, we’re looking for someone who:
- Takes AI and AGI really seriously
- Is ambitious and entrepreneurial
- Has excellent strategic judgement
- Has experience in content creation or project management of some type
And again, we say much more about what that role would look like and who might be the best fit in a blog post on the 80000hours.org website.
The ideal locations to work from are the San Francisco Bay Area or our head office here in London. DC or Oxford are also reasonable choices. And for the right person, we’re open to remote work as well.
Both roles offer competitive compensation: £80,000–100,000 pounds, possibly more, based on people’s experience. There’s also flexible work arrangements and a comprehensive set of benefits. And finally, I’m biased, but I think we are a pretty great and fun team to work with.
If you’re interested, submit an expression of interest by going to 80000hours.org/about/work-with-us. Our expression of interest forms are open until May 6. That may get extended, but don’t count on it — so submit something ASAP if you think you are the right person for these roles.
I hope you enjoyed that interview with Tom. We’ll be back with more soon.