#212 – Allan Dafoe on why technology is unstoppable & how to shape AI development anyway

Technology doesn’t force us to do anything — it merely opens doors. But military and economic competition pushes us through.

That’s how today’s guest Allan Dafoe — director of frontier safety and governance at Google DeepMind — explains one of the deepest patterns in technological history: once a powerful new capability becomes available, societies that adopt it tend to outcompete those that don’t. Those who resist too much can find themselves taken over or rendered irrelevant.

This dynamic played out dramatically in 1853 when US Commodore Perry sailed into Tokyo Bay with steam-powered warships that seemed magical to the Japanese, who had spent centuries deliberately limiting their technological development. With far greater military power, the US was able to force Japan to open itself to trade. Within 15 years, Japan had undergone the Meiji Restoration and transformed itself in a desperate scramble to catch up.

Today we see hints of similar pressure around artificial intelligence. Even companies, countries, and researchers deeply concerned about where AI could take us feel compelled to push ahead — worried that if they don’t, less careful actors will develop transformative AI capabilities at around the same time anyway.

But Allan argues this technological determinism isn’t absolute. While broad patterns may be inevitable, history shows we do have some ability to steer how technologies are developed, by who, and what they’re used for first.

As part of that approach, Allan has been promoting efforts to make AI more capable of sophisticated cooperation, and improving the tests Google uses to measure how well its models could do things like mislead people, hack and take control of their own servers, or spread autonomously in the wild.

As of mid-2024 they didn’t seem dangerous at all, but we’ve learned that our ability to measure these capabilities is good, but imperfect. If we don’t find the right way to ‘elicit’ an ability we can miss that it’s there.

Subsequent research from Anthropic and Redwood Research suggests there’s even a risk that future models may play dumb to avoid their goals being altered.

That has led DeepMind to a “defence in depth” approach: carefully staged deployment starting with internal testing, then trusted external testers, then limited release, then watching how models are used in the real world. By not releasing model weights, DeepMind is able to back up and add additional safeguards if experience shows they’re necessary.

But with much more powerful and general models on the way, individual company policies won’t be sufficient by themselves. Drawing on his academic research into how societies handle transformative technologies, Allan argues we need coordinated international governance that balances safety with our desire to get the massive potential benefits of AI in areas like healthcare and education as quickly as possible.

Host Rob and Allan also cover:

  • The most exciting beneficial applications of AI
  • Whether and how we can influence the development of technology
  • What DeepMind is doing to evaluate and mitigate risks from frontier AI systems
  • Why cooperative AI may be as important as aligned AI
  • The role of democratic input in AI governance
  • What kinds of experts are most needed in AI safety and governance
  • And much more

Video editing: Simon Monsour
Audio engineering: Ben Cordell, Milo McGuire, Simon Monsour, and Dominic Armstrong
Camera operator: Jeremy Chevillotte
Transcriptions: Katy Moore

Highlights

Astounding patterns in macrohistory

Allan Dafoe: History is not just the sum of all of our efforts. It’s not just that we all kind of push in different directions and then you take the sum and that’s what you get. Rather, there’s these general equilibrium effects that economists often talk about, where it may be that for every unit of effort you push in one direction, the system will kind of push back with an equal force, or sometimes a weaker force, sometimes a stronger force.

So when you’re in such a system, it’s very important to understand these structural dynamics. Why does the system sometimes resist efforts or sometimes amplify efforts? Why do you see these really astounding patterns in macrohistory?

For example, if you look at patterns of GDP growth, there’s these famous curves where, after the devastation of World War II, both Germany and Japan completely rebound within less than a decade and then kind of return to their prewar trajectory.

And we’ve seen Moore’s law, which is just an astounding trend. It’s not just that it continues, that transistor density is increasing exponentially; it’s very much a line — so you can predict where we’ll be quite precisely years in advance. We now have scaling laws which have given us sort of our generation’s Moore’s law, which again seems to allow us to predict years in advance how large the models will be, how capable — for example, on loss and so forth.

There’s a number of other macro phenomena that seem quite persistent. The growth of civilisation: I talked about or looked at these trends in what you can call the maximum — so maximum energy processing of a civilisation, also things like the height of buildings, the durability of materials. Really just most functional properties of technology over time have gotten more functional: the speed of transportation and so forth.

Robert Wright, in summarising the literature, writes that, “Archaeologists can’t help but notice that, as a rule, the deeper you dig, the simpler the society whose remains you find.” There’s more generally an observation, which is almost a truism, that certain kinds of technology are so complex or difficult that they come after other forms of technology. It’s hard to imagine nuclear power coming before coal power, for example. So there’s all these macro phenomena and trends in technology, and it’s important to explain them.

Now, this naive explanation would say if history is just the sum of what people try to achieve, then it’s human will that’s produced all these trends, including the reliable tick-tock of Moore’s law.

But not all the trends are positive. I know you’ve reflected on the agricultural revolution, which evidence suggests was not great for a lot of people. The median human, probably their health and welfare went down during this long stretch from the agricultural revolution to the Industrial Revolution. Of course it gave rise to inequality, warfare, and various other things. And there’s other trends that different societal groups resisted.

So in short, I don’t think the answer is history is just the sum of what people try to do. It depends of course on things like power, on timing, on the ecosystem of what’s functional and what’s possible and what’s not, on what technology enables. So I wanted to make sense of this.

Are humans just along for the ride when it comes to technological progress?

Rob Wiblin: How do you reconcile this macro picture — where it seems like humans don’t have that much control over technology, at least historically — with the micro picture, where we feel like we do now? […]

Allan Dafoe: Some of these earlier theorists of technology and scholars of technology, this is in the ’60s to ’80s, even endowed “technology” — this abstraction — with a sense of autonomy and agency. Technology was this driving force, and often humans were along for the ride.

Langdon Winner was one of the most prominent scholars who talked about technology having autonomy. Lewis Mumford talks about “the Machine” as this capital-M abstraction that is driving where society is going, and humans just support the machine; we are cogs to support it. Jacques Ellul referred to “la technique,” which is sort of the functional taking over. And he had this metaphor that humans make a choice, but we do so under coercion, under pressure from what is demanded of us — and la technique is the answer.

So I think there were these scholars and others who really did endow technology with this kind of agency. Then a later generation criticised them, saying this abstract technology is an abstraction — a very high-level abstraction, almost poorly defined. And when you actually look at history in detail in the microscope, where’s technology? You don’t see the machine in the room. Is the machine with us right now? You see people: people with ideologies and ideas and interests and making decisions.

And I would say this led to a revolution in the study of technology towards what’s been called “social constructivism.” Methodologically, it’s more ethnographic or sociological. It looks at the details of how decisions were made, the idiosyncrasies of technological development, the many dead ends or detours. The fact that early on in the development of technology, people didn’t know what the end result was, and they had many visions that were competing — and so it wasn’t foreordained that the bicycle would look the way it does, or the plane would look the way it does.

And I would say, from my personal intellectual trajectory, the PhD programme I started in was one of the prominent departments working on this at Cornell. And for me, this was a surprise, because I really wanted to explain these macro phenomena. And the answer I got from this department was, “This is wrong. This is technological determinism: it is what scholars have since referred to as a critic’s term. Anyone who actually advocates technological determinism is… It’s a straw man position. No one is serious about this.”

So this whole generation of these sociologists and historians of technology really looked at the micro details of how technology developed and dismissed these abstractions that technology can have autonomy, can have an internal logic of how it develops, can have these profound impacts on society. You know, we name revolutions after technology: the agricultural revolution, the Industrial Revolution, and so forth.

Flavours of technological determinism

Allan Dafoe: Maybe first I’ll just talk a little bit more through the different flavours of technological determinism, because I think it’s a rich vocabulary for people to have.

Maybe in a way, the easiest one to accept is what we can call “technological politics.” This is the idea that individuals or groups can express their political goals through technology. So in the same way that you can express or achieve your political goals through voting or through other political actions, if you build infrastructure a certain way or design a technology a certain way, it shapes the behaviour of people. So design affects social dynamics.

Some of these famous examples are the Parisian boulevards — these linear boulevards that were built in many ways to suppress riots and rebellion because it made it easier for the cavalry to get to different parts of the city.

Latour is a famous sociologist who talked about the sociology of a door or the “missing masses” in sociology, which refers to the technology all around us that reinforces how we want society to be. You can think about gates or urban infrastructure as expressing a view of how people should interact and behave.

A famous example is alleged of Robert Moses, who was a famous urban planner in New York City, that he allegedly built bridges too low for buses to go under so that it would deprive people in New York who didn’t have a car, namely African Americans, from the ability to go to the beach. So this was an expression of a certain racial politics that has been asserted.

In general, I think urban infrastructure has quite enduring effects, and so you can often think about what is the impact of different ways of designing our cities.

Rob Wiblin: I guess in recent times we’re familiar with the debate about how the design of social networks can greatly shift the tone of conversation and people’s behaviour. People have pointed to the prominence of quote tweeting on Twitter — where you can highlight something that someone else has said and then blast them on it. It potentially leads to more brigading by one political tribe against another one, and if you made that less prominent, then you would see less of that behaviour.

Allan Dafoe: Exactly. I think the nature of the recommendation algorithm, the way that people can express what they want, has profound impacts on the nature of the discourse and how we perceive the public conversation.

So this was technological politics. There’s a number of other strands of technological determinism.

Just very briefly, “technological momentum” is this idea that once a system gets built, gets going, it has inertia. This is from sunk costs. So you can think of maybe the dependence on cars in American urban infrastructure: once you build your cities in a spread-out manner, it becomes hard to have a pedestrian, dense urban core.

Or, as has been alleged, maybe electric cars were viable if we just invested more. Or also maybe wind power and solar power could have succeeded earlier if we’d gone down this different path. We might come back to this. I think a lot of claims of path dependence in technology are probably overstated — that, again, coming back to the structure, some technologies and some technological paths were just much more viable. And even if we’d invested a lot early in a different path, I think often it is the case that this path we went on was likely to be the path we would have been on, because of more the costs and benefits of the technology than these early choices that people made.

Rob Wiblin: Yeah, I guess the extreme view would be to say that we could have had electric cars, or I guess we did have electric cars in the ’20s, I think, but we could have gone down that path in a more full-throated way as early as that. I guess the moderate position would be to say, no, that wasn’t actually practical; there were too many technological constraints — but we could have done it maybe five or 10 years earlier if we’d foreseen that this would be a great benefit and decided to make some early costly investments in it.

Allan Dafoe: Yeah. And then to make the counterpoint of there’s a certain time when the breakthrough is ripe, I think in AI this is often the case that insights about gradient descent to neural networks occurred much earlier than when they had their impact — and it seemed they needed to wait for the cost of compute, the cost of FLOPS, to go down sufficiently for them to be applicable.

And you could argue, what if we had the insight late? I think once FLOPS get so cheap, it becomes much more likely that someone invents these breakthroughs, because it becomes more accessible; any PhD student can experiment on what they can access. So there is a seeming inevitability to the time window when certain breakthroughs will happen. It can’t occur earlier because the compute wasn’t there, and it would be unlikely to occur much later because then the compute would be so cheap. Someone would have made the breakthrough and then realised how useful it is.

[…]

Another concept that’s emphasised is that of unintended consequences, something we know a lot about. But Langdon Winner points to this notion that as we’re inventing technology after technology, we run the risk of being on a “sea of unintended consequences.” So the future of history is not determined by some structure, nor by our choices, but just by being buffeted one way or another.

Rob Wiblin: By jumping from one blunder to another.

Allan Dafoe: Yeah. And sometimes it’s positive, sometimes it’s negative. I think there’s truth to that: that often a technology comes along and then it takes us some years to fully understand what its impacts are, and then to adapt and hopefully channel it in the most beneficial directions.

The super-cooperative AGI hypothesis and backdoors

Allan Dafoe: I think a lot of people think AI will be better at cooperating; I think that’s the prior. Maybe something we can talk about is what I’ve called the “super-cooperative AGI hypothesis”: that as AI scales to AGI, so will cooperative competence scale to sort of infinity — that AGIs will be able to cooperate with each other to such an extent that they can solve these global coordination problems.

Rob Wiblin: It’s almost magic.

Allan Dafoe: Yeah. And the reason that hypothesis matters is that if it’s true that AGI, as a byproduct, will be so good at cooperation that it will solve all these global coordination problems, these collective action problems, then in a way we don’t need to worry about those. We can just bet on AGI, push to solve safety and alignment, and then AGI will solve the rest. So it’s an important hypothesis to really think through: what does it mean, what is required for it to be valid and for us to empirically evaluate are we on track or not?

So here are some arguments against AIs being very cooperatively skilled with each other. And there’s different levels of this argument: we can say that maybe AI will be cooperatively skilled, but not at the super-cooperative AGI hypothesis level, or maybe they will even be less cooperatively skilled than human-to-human cooperation.

The strong version of the argument would say that humans, we’re pretty similar: we come from the same biological background and, to a large extent, cultural background. We can read each other’s facial expressions to a large extent. Especially if two groups of people are bargaining, we can often sort of read the thoughts of the other group. Especially if two democracies are bargaining, you can read the newspapers, you can read the press of the other side. So in a sense, human communities are transparent agents to each other, at least democracies.

Rob Wiblin: We also have enormous historical experience to kind of judge how people behave in different situations.

Allan Dafoe: Exactly. We can judge people from all these different examples. You know, we have folk psychology theories of childhood or the upbringing of people explaining their behaviour. And also the range of goals that humans can have is fairly limited. We roughly know what most humans are trying to achieve.

AI might be very different. Its goals could be vastly different from what humans’ goals could be. One example is most humans have diminishing returns in anything. They’re not willing to take gambles of all or nothing — like bet the entire company: 50% we go to 0, 50% we double valuation — whereas an AI could have this kind of linear utility in wealth or other things. AIs could have alien goals that are different than what humans typically anticipate. They may be harder to read.

Certainly if you have interpretability infrastructure, they could be easier to understand. But if there’s a bargaining setting, why would one bargainer allow the other to read its neurons? So in a sense, AIs could be much more of a black box to each other than humans are.

Rob Wiblin: I guess the extreme instance of this is that they can potentially be backdoored. I think under the current technology we can almost implant magic words into AIs that can cause them to behave completely differently than they would previously. And this is extremely hard to detect, at least with current levels of interpretability. So that could be something that could really trouble you: you could have an AI completely flip behaviour in response to slightly different conditions.

Allan Dafoe: Exactly. Humans can deceive — but it’s hard; it requires training. And to have this level of complete flip in the goals of a human and the persona is hard. Whereas, as you say, an AI in principle could have this kind of one neuron that completely flips the goal of the AI system so that its goals may be unpredictable, given its history.

Also, this is a challenge for interpretability solutions to cooperation. So people sometimes say that if we allow each other to observe each other’s neurons, then we can cooperate, given that transparency. But even that might not be possible because of this property that I can hide this backdoor in my own architecture that’s so subtle that it would be very hard for you to detect it. But that will completely flip the meaning of everything you’re reading off of my neurons.

Rob Wiblin: This is slightly out of place, but I want to mention this crazy idea that you could use the possibility of having backdoored a model to make it extremely undesirable to steal a model from someone else and apply it. You can imagine the US might say, “We’ve backdoored the models that we’re using in our national security infrastructure, so that if they detect that they’re being operated by a different country, then they’re going to completely flip out and behave incredibly differently, and there’ll be almost no way to detect that.” I think it’s a bad situation in general, but this could be one way that you could make it more difficult to hack and take advantage, or just steal at the last minute the model of a different group.

Allan Dafoe: Yeah. Arguably, this very notion of backdooring one’s own models as an antitheft device, just the very idea could deter model theft. It makes a model a lot less useful once you’ve stolen it if you think it might have this “call home” feature or “behave contrary to the thief’s intentions” feature.

Another interesting property of this backdoor dynamic is it actually provides an incentive for a would-be thief to invest in alignment technology. Because if you’re going to steal a model, you want to make sure you can detect if it has this backdoor in it. And then for the antitheft purposes, if you want to build an antitheft backdoor, you again want to invest in alignment technology so you can make sure your antitheft backdoor will survive the current state of the art in alignment.

Rob Wiblin: A virtuous cycle.

Allan Dafoe: So maybe this is a good direction for the world to go, because it, as a byproduct, incentivises alignment research. I think there could be undesirable effects if it leads models to have these kinds of highly sensitive architectures to subtle aspects of the model. Or maybe it even makes models more prone to very subtle forms of deception. So yeah, more research is needed before investing.

Rob Wiblin: Sounds a little bit like dancing on a knife edge.

Could having more cooperative AIs backfire?

Allan Dafoe: Firstly, cooperation sounds good — but by definition, it’s about the agents in some system making themselves better off than they otherwise would be. So it could be two agents or 10, and the problem of cooperation is how do those agents get closer to their Pareto frontier? How do they avoid deadweight loss that they would otherwise experience?

It says nothing about how agents outside of that system do by the cooperation, so there may be this phenomenon of exclusion. So increasing the cooperative skill of AI will make those AI systems better off, but it may harm any agent who’s excluded from that cooperative dynamic. That could be other AI systems, or other groups of people whose AI systems are not part of that cooperative dynamic.

Rob Wiblin: On that exclusion point, there’s this famous quote: “Democracy is two wolves and a sheep deciding who to eat for lunch.” I guess once you have a majority, you can actually exclude others and then extract value from them.

Allan Dafoe: Exactly. So there’s lots of kinds of cooperation that are antisocial. We typically don’t want students cooperating during a test to improve their joint score: tests and sports very clearly have rules against cooperation. And in the marketplace, there’s rules for how companies should interact with each other. And all kinds of criminal activity: we do not want criminals cooperating more efficiently with each other.

Rob Wiblin: I guess one reason that the Mafia is such a potent organisation is that they’ve basically figured out how to sustain intense internal cooperation without the use of the legal system to enforce contracts and agreements and so on, I guess by having social codes and screening people extremely well and things like that.

Allan Dafoe: Exactly. So we can think of cooperative skill as a dual-use capability: one that is broadly beneficial in the hands of the good guys, and can be harmful in the hands of antisocial actors.

There’s this hypothesis behind the programme which says that broadly increasing cooperative skill is socially beneficial. We call it a hypothesis because it’s worth interrogating. I think it’s probably true. But here the bet is: if we can make the AI ecosystem, the frontier of AI, more cooperatively skilled than it otherwise would be, yes, it will advantage antisocial actors, but it will also advantage prosocial actors. The argument is, on net, that that will be beneficial.

Rob Wiblin: I guess the natural way to argue that is to say that we’ve become better at cooperation over time as a civilisation, as a species. And history in general or wellbeing has generally been getting better. Are there other arguments as well?

Allan Dafoe: Yeah, I think the set of arguments I’m drawn to are along the lines of what you articulated: that even though cooperation can empower groups to cause harm and to be antisocial, it does seem like it’s a net-positive phenomenon — in the sense that if we increase everyone’s cooperative capability, then it means there’s all these positive prosocial benefits, and there’s more collective wins to be unlocked than there are antisocial harms that would be unlocked. In the end, cooperation wins out.

I guess it’s similar to the argument for why trade is net positive. Trade also can have this property, where you and I being able to trade may exclude others who we formerly did business with — but on net, global trade is beneficial, because every trading partner is looking for sort of the best role for them in the global economy, and that eventually becomes beneficial to virtually everyone.

Rob Wiblin: Yeah. I guess some people argue, and not for no reason, that actually maybe wellbeing on a global level has gotten worse over the industrial era — because although humans have gotten better off, the amount of suffering involved in factory farms is so large as to outweigh all of the gains that have been generated for human beings.

I guess in that case, the joke about two wolves and a lamb deciding who to eat for lunch is quite literal. I suppose it’s an abnormal case, because pigs and cows are not really able to negotiate; they’re not really able to engage in this kind of cooperation in the way that humans are. So maybe it’s not so surprising that better cooperation among a particular group might damage those who are not able to form agreements whatsoever anyway.

Allan Dafoe: Yeah, I think that’s a nice example of the exclusionary effects of enhanced cooperation. And this may be a cautionary tale for humans: if AI systems can cooperate with each other much better than they can cooperate with humans, then maybe we get left behind as trading partners and as decision makers in a world where the AI civilisation can more effectively cooperate within machine time speeds and using the AI vocabulary that’s emerged.

The offence-defence balance

Allan Dafoe: I think the proliferation of frontier model capabilities is a key parameter of how things will go and of the viability of different governance approaches and solutions. Which is why I think I would channel this question back to the question of what’s the appropriate approach to open weighting models? Because that is probably the most significant source of proliferation of frontier model capabilities.

Leaving aside the open-weight question, there’s this secondary point that you point to, which is this exponentially decreasing costs of training models. Epoch AI, which is a group that does very good estimates of trends in hardware and algorithmic efficiency and related phenomena, finds that algorithmic efficiency is increasing three times per year.

So the cost of training a model decreases by roughly 10x every two years, and if that trend persists, then a model that costs $100 million to train today will cost $10 million in two years, and then $1 million two years after that, and so forth. So you can see how that leads to many more actors having a model of a given capability.

Some analysts of the situation perceive this to be very concerning, because it means whatever novel capabilities emerge will quickly diffuse at this rate, and be employed by bad actors or irresponsible actors.

I think it’s complicated. Certainly, in a sense, control of technology is easier when it doesn’t have this exponentially decreasing cost function. And there’s other sources of diffusion in the economy. However, two years is a long time: a two-year-old model today is a significantly inferior model to the best models. So we may be in a world where the best models are developed and controlled responsibly, and then can be used for defensive purposes against irresponsible uses of inferior models. I think that’s the win condition.

Rob Wiblin: Yeah, I guess that it makes sense. I suppose it’s a picture that makes one nervous, because it means that you always have to have this kind of hegemon of the most recent model, ensuring that what is now an obsolete but still quite powerful model from two years ago isn’t causing any havoc. But I suppose it probably is almost the only way that we can make things work. I can’t think of an alternative, to be honest.

Allan Dafoe: Yeah. There’s other ways to advantage defenders. You can advantage them by resources. This was Mark Zuckerberg’s argument in favour of open-weight models: he would argue that even if the bad actors have access to the best models, there are more good actors, or the good actors have more resources than the bad actors, so they can outspend the bad actors. So for every bad-actor dollar there’s 100 good-actor dollars, which will build defences against misuse.

That could be the case. It depends on this concept of the offence-defence balance: how costly is it for an attacker to do damage relative to the cost the defender must spend to prevent or repair damage? If you have this 1:1 ratio, then sure, as long as the good guys have more resources than the bad guys, they can protect themselves or repair the damage in a sort of cost-effective way.

But it’s not always the case that the ratio is such. Often I think with bioweapons that the ratio is very skewed. A single bioweapon is quite hard to protect against: it requires rolling out a vaccine, and a lot of damage will occur before the vaccine is widely deployed. In general, I think this field of offence-defence balance is also something worth studying.

And I would say the field of AI has too often drawn an analogy from computer security, where vulnerabilities are easily patched. So the correct response in computer security is typically to encourage vulnerability discovery, because then you can roll out a patch, and then the new operating system or software is resilient to the previously discovered vulnerability.

Not all systems have that property. Again, biological systems are hard to patch. Social systems also are hard to patch. Deepfakes do have a patch: we can develop this immune system where we know that just because someone does a video call with you doesn’t mean that’s sufficient authentication of the identity of the person for the instruction you’re getting to transfer some millions of dollars or so forth.

Humans have a lot of inertia in their systems, and it’s quite costly to build the new infrastructure and then train people to use it correctly.

Articles, books, and other media discussed in the show

Allan’s work:

Technological determinism:

Google DeepMind’s work and reasons to be optimistic about AI:

Google Deepmind is hiring for several positions!

AI safety and evaluations:

AI regulation:

Other 80,000 Hours podcast episodes:

Related episodes

About the show

The 80,000 Hours Podcast features unusually in-depth conversations about the world's most pressing problems and how you can use your career to solve them. We invite guests pursuing a wide range of career paths — from academics and activists to entrepreneurs and policymakers — to analyse the case for and against working on different issues and which approaches are best for solving them.

Get in touch with feedback or guest suggestions by emailing [email protected].

What should I listen to first?

We've carefully selected 10 episodes we think it could make sense to listen to first, on a separate podcast feed:

Check out 'Effective Altruism: An Introduction'

Subscribe here, or anywhere you get podcasts:

If you're new, see the podcast homepage for ideas on where to start, or browse our full episode archive.