Transcript
Cold open [00:00:00]
Allan Dafoe: One famous quote in the history of technology that was arguing against determinism was that technology doesn’t force us to do anything, it merely opens the door. It makes possible new ways of living, new forms of life.
And my retort was technology doesn’t force us, it merely opens the door — and it’s military economic competition that forces us through. So when a new technology comes on the stage, many groups can choose to ignore it or do whatever they will with it, but if one group chooses to employ it in this functional way that gives them some advantage, eventually the pressure from that group will come to all the rest and either force them to adopt or lead the other group to losing their resources to the new, more fit group.
Who’s Allan Dafoe? [00:00:48]
Rob Wiblin: Today I have the pleasure of speaking again with Allan Dafoe, who is currently the director of frontier safety and governance at Google DeepMind, or GDM for short. Before that, he was the founding director of the Centre for the Governance of AI. He was also a founder of the Cooperative AI Foundation and is a visiting scholar at the Oxford Martin School’s AI Governance Initiative. And I guess before all of that, you were an academic in the social sciences, studying technological determinism — great power conflict, the [democratic] peace theory, that kind of thing.
We’re going to get to a little bit of all of these different pieces of your work today. Thanks so much for coming back on the show, Allan.
Allan Dafoe: Thanks, Rob. Pleasure to be here.
Allan’s role at DeepMind [00:01:27]
Rob Wiblin: Later on we’re going to talk about the frontier model evals, as well as why you think cooperative AI might be about as important as aligned AI.
But first off, you’re the director of frontier safety and governance. What does that actually involve in practice? I can see that being a whole lot of different things, and I don’t have a sense of what your kind of day to day is like.
Allan Dafoe: So my team is called the frontier safety and governance team, and we have three main pillars: frontier safety and frontier governance — per the name — and then frontier planning. This adjective, “frontier,” is a new term, I would say almost a term of art, to refer to these general-purpose large models like Gemini and others.
Frontier safety looks at dangerous capability evaluations. It tries to understand what powerful capabilities may be emerging from these large general-purpose models, forecast when those capabilities are arriving, and then think about risk mitigation and risk management. This also led to the Frontier Safety Framework, which is Google’s approach to risk management for extreme risks in frontier models. That’s frontier safety.
Frontier governance is advising norms, policies, regulations, institutions — especially with an eye towards safety considerations.
And then frontier planning looks to the horizon, tries to imagine what new considerations could be coming with powerful AI and on the path to AGI, and then advising Google DeepMind, Google, and really all of society given those insights.
Rob Wiblin: It sounds like a pretty big remit. How large is the team that’s working on all these questions?
Allan Dafoe: The team is quite small, though we’re actually hiring for several positions right now; at the time of the podcast going live that may be wrapped up. But what’s really great about working at Google DeepMind is we have a lot of partner teams and there’s a very collaborative culture. So we work with technical safety, called the AI safety, Gemini safety, and alignment teams. We work with responsibility teams, the policy teams, and so forth.
Rob Wiblin: So I guess Google DeepMind has, over the last year or two, become more integrated into the rest of Google, right? Are there other groups within this broader entity, I guess Alphabet, that take an interest in these questions? Or are you maybe the only group that’s thinking about these? I guess you’re thinking about the most important models and upcoming issues and threats. Are there many other groups that take an interest in having that kind of foresight and thinking years ahead?
Allan Dafoe: I would say Google DeepMind is the part of Google that’s most specialised at thinking about frontier models. Google DeepMind is responsible for building Gemini, the frontier model that’s underpinning all of what Google’s doing. And we also have responsibility and safety and policy teams that are especially thinking about frontier issues.
We then do have partners across Google in these various domains. For example, in policy, we work closely with Google Policy on the range of policy implications and considerations connected to frontier models. But Google DeepMind is, I would say, where the heart of the thinking related to frontier policy issues takes place.
Why join DeepMind over everyone else? [00:04:27]
Rob Wiblin: I guess back in 2021, you’d been the founding director of the Centre for the Governance of AI, GovAI, which was a reasonably big deal then. I think it’s gone on maybe to be an even bigger deal, since it’s a pretty prominent voice in the conversation around governance of AI. Why did you decide to leave this thing that was going quite well to go and work at Google DeepMind instead?
Allan Dafoe: Yeah, I agree it went well at the time, and it’s gone even better since. So a lot of credit goes to Ben Garfinkel, who’s the executive director of GovAI, and the many others who work there.
At the time I was an informal advisor to Demis Hassabis, CEO of DeepMind, and Shane Legg, cofounder of DeepMind. And I found that I had a lot of potential impact in giving advice on AGI safety, AGI governance, and AGI strategy. However, to be most impactful, it helps to be inside the company, where I have more understanding of the nature of the decisions that they’re confronting and more surface area to advise not only Demis and Shane, but also many key decision makers.
To take a step back, I want to reflect on this route to impact, this “advising important decision makers” approach. I would say one lesson I’ve drawn from history is that often in these pivotal historical moments — in crises or in very high-leverage historical moments — a lot depends on the behaviour and the ideas and the very character of key individuals in history. You know, Alexander Hamilton, in the musical portrayal, it’s like who’s in the room and what decisions are made in the room.
And I think that’s true. I think when you look at history, especially in these pivotal historical moments, it’s incredible how much the ideas that people coming into the room had, the resources, the insights that they had available for the solutions that they construct. So that argues for advising people who will be influential on these important historical developments.
And AI and AGI is, in my view, one of the most, or probably the most important historical development. And I think Demis and DeepMind are very likely to be influential — ongoing, and have been so far — in the development of AI and AGI.
There’s a second part of this: in addition to advising influential decision makers, it’s the idea of boosting decision makers who have the kind of character you would want in critical decision makers. So do they have the sensibility? In my case, are they aware of the full stakes of what is happening? Are they safety conscious? Do they have the technical and organisational competence to pull off what needs to be built? Because if you have clumsy hands, even if you have good intentions, that may still lead to a bad outcome. Finally, do they have wisdom to be able to make these very hard decisions that have complex, uncertain parameters around them?
And in my view, Demis and Shane are extremely impressive individuals for these properties, from their safety orientation to their broad perspective on the stakes of the issues to their wisdom and broad character.
I also do want to reflect on GovAI. During my time, it produced a lot of great work and great people. It’s since gone on to produce even significantly more great work and great people. Ben Garfinkel has done a great job.
It’s interesting reflecting on some of the people who’ve gone through GovAI. One person who worked very closely with me at the time is Jade Leung. She used to be head of my partner team at OpenAI and is now the chief technology officer at the UK AI Safety Institute. A number of other very prominent people in AI safety and governance similarly have gone through GovAI. Markus Anderljung and Robert Trager came through. Anton Korinek is a prominent economist who’s done some work there. Miles Brundage and others.
Rob Wiblin: Back in 2018, in our short interview back then, you were saying people should definitely be diving into this area, because it’s going to grow enormously and it’s going to be really good for your career and there’ll be lots of opportunities. And I think that has definitely bought out: that people who got in on the ground floor have been doing super well, career-wise.
Allan Dafoe: Yeah. And I think it’s still early days for any prospective joiners. I always encourage people to hop trains as soon as they can, because AI is still just a small fraction of the economy, so there’s a lot more impact to come and work to do.
Rob Wiblin: We think that in the fullness of time, it’s going to be close to 100%, certainly more than 10% — and it’s like 0.01% now, or certainly not more than 0.1% in terms of total revenue. So yeah, there’s many orders of magnitude to go up yet.
Do humans control technological change? [00:09:17]
Rob Wiblin: Let’s open by talking about the work that you did in your previous incarnation, which was as an academic. So I think you did your thesis back in the early 2010s on technological determinism as the main focus. And I think the paper that came out of that opens with, “Who — if anyone — controls technological change?” What was the academic debate there that you were reacting to or trying to be a part of?
Allan Dafoe: So my academic trajectory had a number of chapters. The first was on technological determinism, which we can come to. Just for completeness, the second was on great power politics, and peace specifically — which actually led to a lot of work that I think continues to be relevant to the question of AI and AGI governance — and then I also did some statistics and causal inference work, which has some relevance to thinking about AI today.
Turning back to technological determinism, I would say I first came to this in undergrad, reflecting on what shapes history and how we can do good and how we can kind of steer the trajectory of developments in a positive direction.
And an insight I had was that history is not just the sum of all of our efforts. It’s not just that we all kind of push in different directions and then you take the sum and that’s what you get. Rather, there’s these general equilibrium effects that economists often talk about, where it may be that for every unit of effort you push in one direction, the system will kind of push back with an equal force, or sometimes a weaker force, sometimes a stronger force.
So when you’re in such a system, it’s very important to understand these structural dynamics. Why does the system sometimes resist efforts or sometimes amplify efforts? Why do you see these really astounding patterns in macrohistory?
For example, if you look at patterns of GDP growth, there’s these famous curves where, after the devastation of World War II, both Germany and Japan completely rebound within less than a decade and then kind of return to their prewar trajectory.
And we’ve seen Moore’s law, which is just an astounding trend. It’s not just that it continues, that transistor density is increasing exponentially; it’s very much a line — so you can predict where we’ll be quite precisely years in advance. We now have scaling laws which have given us sort of our generation’s Moore’s law, which again seems to allow us to predict years in advance how large the models will be, how capable — for example, on loss and so forth.
There’s a number of other macro phenomena that seem quite persistent. The growth of civilisation: I talked about or looked at these trends in what you can call the maximum — so maximum energy processing of a civilisation, also things like the height of buildings, the durability of materials. Really just most functional properties of technology over time have gotten more functional, so the speed of transportation and so forth.
Robert Wright, in summarising the literature, writes that, “archaeologists can’t help but notice that, as a rule, the deeper you dig, the simpler the society whose remains you find.” There’s more generally an observation, which is almost a truism, that certain kinds of technology are so complex or difficult that they come after other forms of technology. It’s hard to imagine nuclear power coming before coal power, for example. So there’s all these macro phenomena and trends in technology, and it’s important to explain them.
Now, this naive explanation would say if history is just the sum of what people try to achieve, then it’s human will that’s produced all these trends, including the reliable tick-tock of Moore’s law.
But not all the trends are positive. I know you’ve reflected on the agricultural revolution, which evidence suggests was not great for a lot of people. The median human, probably their health and welfare went down during this long stretch from the agricultural revolution to the Industrial Revolution. Of course it gave rise to inequality, warfare, and various other things. And there’s other trends that different societal groups resisted.
So in short, I don’t think the answer is history is just the sum of what people try to do. It depends of course on things like power, on timing, on the ecosystem of what’s functional and what’s possible and what’s not, on what technology enables. So I wanted to make sense of this.
Rob Wiblin: I guess the thing that we have to try to reconcile here is: on the one hand, we see these trends that seem like they’re not really responsive to any person’s particular decisions. It’s a little bit like the psychohistory in Foundation, where you just have these broad trends where everyone is just an ant in this broader process and it’s not obvious that anything that any particular individual did was able to shift things.
On the other hand, technology, at least so far, doesn’t have its own agency. It does seem like in fact it’s humans doing all of the actions that are producing these outcomes. And couldn’t they, in principle, if they really hated what was happening, try to shift it? We feel like we have agency right now over how society goes, or we feel that we at least have some agency.
So how do you reconcile this macro picture — where it seems like humans don’t have that much control over technology, at least historically — with the micro picture, where we feel like we do now. Am I understanding it right?
Allan Dafoe: Definitely. Some of these earlier theorists of technology and scholars of technology, this is in the ’60s to ’80s, even endowed “technology” — this abstraction — with a sense of autonomy and agency. Technology was this driving force, and often humans were along for the ride.
Langdon Winner was one of the most prominent scholars who talked about technology having autonomy. Lewis Mumford talks about “the Machine” as this capital-M abstraction that is driving where society is going, and humans just support the machine; we are cogs to support it. Jacques Ellul referred to “la technique,” which is sort of the functional taking over. And he had this metaphor that humans make a choice, but we do so under coercion, under pressure from what is demanded of us — and la technique is the answer.
So I think there were these scholars and others who really did endow technology with this kind of agency. Then a later generation criticised them, saying this abstract technology is an abstraction — a very high-level abstraction, almost poorly defined. And when you actually look at history in detail in the microscope, where’s technology? You don’t see the machine in the room. Is the machine with us right now? You see people: people with ideologies and ideas and interests and making decisions.
And I would say this led to a revolution in the study of technology towards what’s been called “social constructivism.” Methodologically, it’s more ethnographic or sociological. It looks at the details of how decisions were made, the idiosyncrasies of technological development, the many dead ends or detours. The fact that early on in the development of technology, people didn’t know what the end result was, and they had many visions that were competing — and so it wasn’t foreordained that the bicycle would look the way it does, or the plane would look the way it does.
And I would say, from my personal intellectual trajectory, the PhD programme I started in was one of the prominent departments working on this at Cornell. And for me, this was a surprise, because I really wanted to explain these macro phenomena. And the answer I got from this department was, “This is wrong. This is technological determinism: it is what scholars have since referred to as a critic’s term. Anyone who actually advocates technological determinism is… It’s a straw man position. No one is serious about this.”
So this whole generation of these sociologists and historians of technology really looked at the micro details of how technology developed and dismissed these abstractions that technology can have autonomy, can have an internal logic of how it develops, can have these profound impacts on society. You know, we name revolutions after technology: the agricultural revolution, the Industrial Revolution, and so forth.
Rob Wiblin: In this paper, it seemed like the constructivists, in their reaction to this determinism, were really staking out a relatively extreme opposing position — where they were almost suggesting that it’s always human responsibility, and people always have choice over what technologies they adopt and what form it takes. Am I understanding that right?
Allan Dafoe: Yeah. In a way, I would say the debate was never directly had, or rarely directly had. It was often indirect. So in defence of the constructivists, I think they were asking different questions — or, more importantly, they had different tools: they had the tools of ethnography and sociology, and they were answering questions that those tools allowed them to answer. The answers were narratives based on the conversations that took place, the decisions that were made.
And, to explain macro phenomena, those tools are not well suited. So I do think there was a mistake that was made, which was to dismiss the claims about macro phenomena and technological determinism in the pursuit of the questions that they had — and I think it’s a real loss for the history of technology that so little work has since been done on these bigger macro questions.
Rob Wiblin: Were the constructivists motivated by a sense of moral outrage, maybe? That they saw people perhaps adopting technologies that were socially detrimental, and those folks might then excuse it by saying, “We have no choice; we have to do this for competitive reasons” or “It’s going to happen anyway; there’s nothing one can do”? And the constructivists were kind of frustrated by this and wanted to say, “No, you’re responsible. You’re doing it. So you do have agency here”?
Allan Dafoe: Yeah. This is an argument that’s been made often, and more recently by many different schools, including about AI.
So one criticism of, let’s call it the AGI ideology, as these people would put it, is that AGI is not foreordained, or the development of AI in any given sense is not foreordained. But when we talk about it as if inherently it’s coming, and it will have certain properties, that deprives citizens of agency to reimagine what it could be. So I think that’s the constructivist position on technology, exactly as you said.
Now, the counterposition I would offer is: you don’t want to equip groups trying to shape history with a naive model of what’s possible; you want to channel energy where it will be high leverage, where it will have lasting impact — rather than in these settings where the structure will resist all the force you push in one direction with an equal counterpressure.
Rob Wiblin: Yeah.
Arguments for technological determinism [00:20:24]
Rob Wiblin: OK, so we should talk about the synthesis that you try to put forward in your thesis. What are the circumstances under which we do have more autonomy, and what are circumstances where it can be extremely hard to change the course of history?
Allan Dafoe: Maybe first I’ll just talk a little bit more through the different flavours of technological determinism, because I think it’s a rich vocabulary for people to have.
Maybe in a way, the easiest one to accept is what we can call “technological politics.” This is the idea that individuals or groups can express their political goals through technology. So in the same way that you can express or achieve your political goals through voting or through other political actions, if you build infrastructure a certain way or design a technology a certain way, it shapes the behaviour of people. So design affects social dynamics.
Some of these famous examples are the Parisian boulevards — these linear boulevards that were built in many ways to suppress riots and rebellion because it made it easier for the cavalry to get to different parts of the city.
Latour is a famous sociologist who talked about the sociology of a door or the “missing masses” in sociology, which refers to the technology all around us that reinforces how we want society to be. You can think about gates or urban infrastructure as expressing a view of how people should interact and behave.
A famous example is alleged of Robert Moses, who was a famous urban planner in New York City, that he allegedly built bridges too low for buses to go under so that it would deprive people in New York who didn’t have a car, namely African Americans, from the ability to go to the beach. So this was an expression of a certain racial politics that has been asserted.
In general, I think urban infrastructure has quite enduring effects, and so you can often think about what is the impact of different ways of designing our cities.
Rob Wiblin: I guess in recent times we’re familiar with the debate about how the design of social networks can greatly shift the tone of conversation and people’s behaviour. People have pointed to the prominence of quote tweeting on Twitter — where you can highlight something that someone else has said and then blast them on it. It potentially leads to more brigading by one political tribe against another one, and if you made that less prominent, then you would see less of that behaviour.
Allan Dafoe: Exactly. I think the nature of the recommendation algorithm, the way that people can express what they want, has profound impacts on the nature of the discourse and how we perceive the public conversation.
So this was technological politics. There’s a number of other strands of technological determinism.
Just very briefly, “technological momentum” is this idea that once a system gets built, gets going, it has inertia. This is from sunk costs. So you can think of maybe the dependence on cars in American urban infrastructure: once you build your cities in a spread-out manner, it becomes hard to have a pedestrian, dense urban core.
Or, as has been alleged, maybe electric cars were viable if we just invested more. Or also maybe wind power and solar power could have succeeded earlier if we’d gone down this different path. We might come back to this. I think a lot of claims of path dependence in technology are probably overstated — that, again, coming back to the structure, some technologies and some technological paths were just much more viable. And even if we’d invested a lot early in a different path, I think often it is the case that this path we went on was likely to be the path we would have been on, because of more the costs and benefits of the technology than these early choices that people made.
Rob Wiblin: Yeah, I guess the extreme view would be to say that we could have had electric cars, or I guess we did have electric cars in the ’20s, I think, but we could have gone down that path in a more full-throated way as early as that. I guess the moderate position would be to say, no, that wasn’t actually practical; there were too many technological constraints — but we could have done it maybe five or 10 years earlier if we’d foreseen that this would be a great benefit and decided to make some early costly investments in it.
Allan Dafoe: Yeah. And then to make the counterpoint of there’s a certain time when the breakthrough is ripe, I think in AI this is often the case that insights about gradient descent to neural networks occurred much earlier than when they had their impact — and it seemed they needed to wait for the cost of compute, the cost of FLOPS, to go down sufficiently for them to be applicable.
And you could argue, what if we had the insight late? I think once FLOPS get so cheap, it becomes much more likely that someone invents these breakthroughs, because it becomes more accessible; any PhD student can experiment on what they can access. So there is a seeming inevitability to the time window when certain breakthroughs will happen. It can’t occur earlier because the compute wasn’t there, and it would be unlikely to occur much later because then the compute would be so cheap. Someone would have made the breakthrough and then realised how useful it is.
Rob Wiblin: Was there another school of technological determinism?
Allan Dafoe: There’s other flavours. Another concept that’s emphasised is that of unintended consequences, something we know a lot about. But Langdon Winner points to this notion that as we’re inventing technology after technology, we run the risk of being on a “sea of unintended consequences.” So the future of history is not determined by some structure, nor by our choices, but just by being buffeted one way or another.
Rob Wiblin: By jumping from one blunder to another.
Allan Dafoe: Yeah. And sometimes it’s positive, sometimes it’s negative. I think there’s truth to that: that often a technology comes along and then it takes us some years to fully understand what its impacts are, and then to adapt and hopefully channel it in the most beneficial directions.
Rob Wiblin: Yeah, there’s definitely some effect like that. The people who really highlight that, I think they’re exaggerating sometimes the scale of the negative side effects from technology. Setting aside some particular cases that we’re particularly focused on, it seems like the negative side effects of technology in general have gotten kind of smaller at each generation of technology — that we do solve more problems than we create, on average, is kind of my take.
Allan Dafoe: Yeah.
The synthesis of agency with tech determinism [00:26:29]
Rob Wiblin: Should we come back to the synthesis?
Allan Dafoe: Sure. And maybe the last big part of it is, again, these macro phenomena and trying to explain them. The scholars most working that out I would say are macro historians, macro economists, and political scientists that are trying to explain these long-run trends, and things like the spread of democracy.
One observation that led to the synthesis is that the more micro your observation — the closer to people, to the day to day — the more likely you were to conclude a constructivist explanation.
This is a robust empirical finding: if you look at the literature, people who have micro methodologies are much more likely to conclude constructivist type claims — that what matters is individuals, decisions, ideas, and so forth. Whereas the more macro your methodology and your aperture, the more likely you were to conclude a more deterministic set of claims.
So we have a puzzle there. I’d say some of the constructivists concluded that’s because the macro scholars allow themselves the error of imputing agency to technology because they’re so far from the data. I think that’s an unfair characterisation. Rather, I do think there’s emergent phenomena at different scales of analysis, and so we should give the macro phenomena its due and try to explain it.
One analogy that I’ve offered here is to imagine this hypothetical science of wave motion. So we have a group of scientists who emphasise wind: it turns out when the wind is blowing, that affects the ripples on top of the water. And then another community is based on kinetic impact: they say, “Look! When we throw rocks into the water, it produces waves,” and that’s their preferred theoretical framework.
And then there’s this kooky macro water phenomenologist who says, “I’ve been noticing that whenever the Moon is directly above us, the water level is at its highest point. And then when it’s at the horizon, it’s at its lowest point. And I’ve travelled all over the world and this pattern is robust. So I will offer this Moon determinism: that the Moon explains water levels.”
The wind and the kinetic scientists would be mistaken to dismiss the Moon determinist simply because the Moon determinist doesn’t have a micro mechanism. There’s a challenge to that finding: namely, how do you explain this pattern? Because there’s no known micro foundation that can explain why the Moon is sort of pulling water. But of course we know that that is in fact what it’s doing.
And so I think there’s similar results in macro history, that there are these patterns that need to be explained. And the fact that we didn’t have a micro foundation isn’t a reason entirely to dismiss it, but it is a challenge.
So what is a possible micro foundation? The one I offer hinges on military–economic competition.
The key idea is that there are levels of selection. So at the local level, you and I can make a decision about what we do right now — maybe if we wanted to build something, the technology we build — and that just depends on us: on our ideas and so forth.
But if we really want to get going, if we want to build a new kind of art, and we want it to be everywhere, then eventually we’re going to need resources to pay for it, and maybe it can’t be too opposed by other groups. So eventually we run into these other forces.
And when you think about ways of living — which is kind of a general term for sociotechnical systems, for civilisations — they do run into resource constraints: they need resources to sustain themselves and then to proliferate, which they typically want to do. So that involves economic competition: competition over resources and over capital.
And then the military aspect is always important, because throughout most of history, military competition was ever present. Even if you had decades or hundreds of years of peace, eventually there was military competition from a neighbour. And that provided a higher-level of constraint on what ways of living were possible.
Rob Wiblin: Yeah, and I guess even if there isn’t active war, kind of everyone’s living in “the shadow of violence” I think is the term: that they’re anticipating that there could be war in future, and maybe if they don’t play their cards right, then they would be vulnerable to aggression.
Allan Dafoe: Yeah. And we could add a higher level of selection: in my thesis I did put environmental selection on top of military and economic. There were these sort of circles of selection in the sense that civilisations that might be fit for the economic competition, and military competition may nevertheless not be sustainable with their environment, and that could be another source of failure to sustain itself and proliferate.
So you can imagine these layers of selection. I tended to put environment at the top, military–economic, and then you might put culture, and then you can put psychology or more local dynamics lower down.
One famous quote in the history of technology that was arguing against determinism was that technology doesn’t force us to do anything; it merely opens the door. It makes possible new ways of living, new forms of life. And my retort was technology doesn’t force us; it merely opens the door — and it’s military–economic competition that forces us through.
So when a new technology comes on the stage, many groups can choose to ignore it or do whatever they will with it. But if one group chooses to employ it in this functional way that gives them some advantage, eventually the pressure from that group will come to all the rest — and either force them to adopt, or lead the other group to losing their resources to the new, more fit group.
Rob Wiblin: So that seems, in a sense, very obvious. Why do you think the constructivists were missing this? Or why didn’t this stand out to them as an important effect?
Allan Dafoe: Well, I’m glad you think it’s obvious.
Rob Wiblin: Maybe because of my background.
Allan Dafoe: I mean, let’s not underestimate the bias that comes from a scientist using the tools that they prefer to use, looking under the lamplight.
So the constructivists were very good at ethnography and sociology and this kind of daily life history, this close-up microhistory. And when you look that close, you don’t see the machine exerting its force. Military competition at a macro historical level is ubiquitous, but at a micro historical level is rare. Wars are rare. And as you said, much of the effect of this military–economic competition is through how people internalise that threat.
Then you can equally say that it’s not this competition that’s driving behaviour; it’s the ideology of capitalism, of military greatness that is driving the behaviour.
This is a methodological challenge. I do think there’s this concept of vicarious selection: in an evolutionary environment, it is highly adaptive for an organism to model its environment and to internally simulate what will happen if it goes in one direction or another.
This concept was named by a historian of technology who was trying to explain the development of aviation, aerospace design. His point was that you don’t build a plane and try to fly it, and it crashes and you build another plane and try to fly it and crash. Rather, you invent the wind tunnel, and you have a theory — you model the external environment, and you say, “OK, we want these properties in our wing.” So you are still doing this experimentation; you’re just doing it in a controlled, targeted manner, internal to the broader economic competition.
Rob Wiblin: So if I think about how these folks might respond, at least my simulacrum of them… Maybe I shouldn’t have said it was obvious. It’s obvious to me because I’ve literally been taught this, more or less, in books and maybe even in undergrad. I suppose everything is obvious once you’ve literally been told it.
But a pushback that I can imagine is: we’re living in the UK; the UK has nuclear weapons, and it’s in a pretty friendly neighbourhood. Are we really saying that when the UK adopts some new technology, or it designs its cities one way or another, or has a particular housing policy, it’s doing this because it thinks that it has to for defensive purposes? Because otherwise it’s going to be invaded by France or Russia or whoever?
It’s not as if we’re actively thinking about these defence issues or competitive issues all the time. I guess individual businesses do think, “If we don’t adopt this new technology, we’ll be outcompeted.” But at least the military thing is less clear once you have a very strong defensive position where you don’t feel a great risk of attack.
Allan Dafoe: Yes. There’s two things I’d want to say here. One is, in the modern era, military competition has declined a lot, and we have much more of a global culture than we had 100 years ago, or 300 years ago, or more. So that can change this higher level of selection.
I do think it’s still there. And certainly you see conversations around national security having a lot of force in domestic politics — UK politics, US politics, in pretty much any country’s politics — that if there’s a claim that we risk losing a strategic positioning against an adversary, that can be very motivating for internal reform.
Rob Wiblin: Yeah.
Allan Dafoe: The second point I want to make is that there’s this great example from UK history, where I think this dynamic is really well illustrated.
This is from Thomas Hughes, this historian of energy systems. He looks at the UK energy system, which initially had these local power plants and local energy systems that Thomas argues were better suited to the United Kingdom’s notion of democracy: there was decentralised energy provision, it was much more under the control of localities, it wasn’t this big national energy system.
And that persisted up to World War II, when the cost constraints that it imposed became excessive, and the UK made the decision to adopt more of a national grid. So there is again: this story of one certain way of living, arguably aligned with the political ethos of the community, persisting until the cost constraints become excessive, often driven by this crisis of conflict.
Competition took away Japan’s choice [00:37:13]
Rob Wiblin: So the case study that you focus on in your work is the Meiji Restoration in Japan in the mid 19th century. It feels like almost as clean an example of this military competition driving history as you could imagine. Can you briefly explain it?
Allan Dafoe: Sure, yeah. It’s good to have empirics that you can tell a story about. Moore’s law and these macro phenomena aren’t sufficient evidence for making sense of macrohistory, so I looked for a case where a community chose to go in a direction contrary to what was demanded by la technique, by what was functional in this military–economic competitive milieu.
And there’s this great example of Japan under the Tokugawa regime. The regime lasted roughly 200 to 250 years, and it was a return to the [samurai] way of life. So samurai were at the top of the pecking order, a very feudal society. They had firearms at the beginning of this period, and they sort of uninvented the firearms. The shogunate centralised firearm production, so everyone who knew how to produce firearms was brought into a central place and then paid a stipend to not build firearms, so that the technology was forgotten. So they had cannons, firearms, and they lost the technology.
That persisted for roughly 200 years. And during this time, Japan wanted little to do with the outside world. But they observed that things were changing, that the Europeans were sailing around and involved in China.
And everything changed on this fateful day in 1853, when an American, Commodore Perry, visited Japan with the explicit purpose of opening Japan up to trade. And he came in these steamships that were seemingly magical: they were moving upwind without sails, belching black smoke, made out of this very large heavy metal — and had a real profound impact on the Japanese who received him.
He gave a demonstration of what was possible with the cannons to bombard the shore, and gave them white flags so that they could signal their desire for the bombardment to cease — you know, be able to communicate in the future. He said he was going to come back in a year to complete the negotiations. The Japanese at the time asked him, “Will you bring your ships again?” And he said, “I’ll bring more.”
So that was the opening of Japan. It led to a 15-year period of revolution. This was the Meiji Restoration, where different groups were trying to make sense of their new environment. It was no longer sustainable to continue their way of life, and different groups contested it in different ways.
And the final answer was this restoration under the emperor, and a view that we need to modernise. So the Japanese very proactively sought to learn everything they could about the West. They sent people to the West to get all the books on all the industrial arts and so forth. And Japan incredibly succeeded: just several decades later, Japan was able to contest in World War II control of Asia against the US and Britain and others.
I think what’s powerful about that story is it really shows how a group of people in a sense chose — of course, there’s power infused through it — but that community chose to go one direction with respect to the technology of firearms and other aspects of modern industrial civilisation, and that choice was time limited by how long the West would choose to not force on Japan a different way.
Rob Wiblin: Yeah. Well, I think the thing that set the timeline, you can imagine the only reason this was possible for Japan was because it was an island, and so it was actually quite hard to invade. So they had this protective barrier, and that gave them a degree of discretion that, if they were on the steppes of Asia, I think they would not have had nearly the freedom of movement that they had as an island. And they would have felt the pressure and the fear of invasion; it would have been much more salient, and they would have just been much more focused on being able to defend themselves.
I suppose it gave them breathing room that allowed them to fall quite a bit behind. But at some point, they fell so far behind that even the sea barrier was not enough to keep them safe from invasion. And at that point they did a complete 180 and decided to catch up and modernise.
Allan Dafoe: Yeah. So I find this case study quite compelling. And most case studies are rarely so clean, because people internalise these external pressures, and there’s mimicry and status dynamics where some communities look up to other communities in a way that’s often correlated with power or wealth. So the narrative is not as clean. Whereas in this case, it was very clear that what forced the change was the sheer power of the steamship and cannon that the West could bring.
Can speeding up one tech redirect history? [00:42:09]
Rob Wiblin: So I guess the reason we’re talking about technological determinism is that many people in our circles are very focused on this idea of differential technological development.
A related, more recent idea was defensive accelerationism, which is that the way that we want to try to shift history in a positive direction is shifting the order in the development of technologies, or advancing some particular lines of science and research to try to get them ahead of other ones — and you want to advance the ones that you think are generally making the world safer, so that you have more of those technologies by the time that other more risk-increasing technologies arrive on the scene.
What does the discipline around technological determinism say about whether this is a viable pursuit, and a sensible approach to trying to influence history in a good direction?
Allan Dafoe: Yeah, great question. And a big question. To give a bit more colour on differential technological development — which is a very clumsy term to say, but the community hasn’t come up with a cleaner term — to motivate it, maybe the best example is the seatbelt.
The seatbelt seems like something we could have invented before the car. You can imagine faster-moving vehicles, and the value of restraining a person in the event of a collision. It doesn’t seem like it requires the invention of the capability, the combustion engine in the car, in order to invent the seatbelt. In principle, we could have invented the seatbelts before the car, and then had it ready to go as soon as cars were diffusing, so that we didn’t have to wait decades for the seatbelt. This is an example of a safety technology that pairs nicely with the capability, to make the capability safer.
And there’s other examples that would make it more beneficial in other ways. Another class of technologies or interventions are when you can develop the countermeasures or societal defences for the capability that has these adverse byproducts. An example would be a vaccine: if you know that a potential disease could come, you can develop the vaccine in advance.
A third category is a substitute. An example often given is whether wind power or solar power could have been made more cost effective than fossil fuel-based power, so if the cost curve was such, then we would have invested more in these sustainable energy sources and then civilisation would have gone down that path, rather than a more fossil fuel-dependent path.
I think these interventions are the cleanest examples. There’s more general ones that imagine wholly different technological paths that have different properties, and it’s worth reflecting on them.
Again, Langdon Winner argued nuclear power was more authoritarian. The story being that the nature of the technology is that it requires centralised development, and it requires strong coercive infrastructure around it to make sure it’s not abused. Whereas wind and solar permit both decentralised development and so decentralised politics, and they don’t have this risk that requires more of a security state on top of it. So there’s arguments of certain technological trajectories that have these byproduct political or social effects.
Now to offer some challenges. Firstly, I do think differential technological development is a very important idea that we should be thinking about a lot. And in many ways, it undergirds the whole notion of AGI safety. The whole notion of AGI safety as a field is that we want to, on the margin, put a bit more effort into AI safety or AGI safety than we otherwise would, than the market would naturally do, with the idea being that’s going to make a difference. It’s like inventing the seatbelt before the car.
So I think it’s a very important idea. However, there’s a lot of good reasons to doubt its tractability or feasibility. One way to see this is that most of the arguments for differential technological development require some person to see two pathways that are both viable, given some effort, and then furthermore to anticipate the consequences of each of those pathways to choose the better one — and these things are hard.
First, it’s quite hard to know what is the next viable step in technological development. If you know that, you can make a great business: the market is very hungry for that insight. So you should think it hard to find two of those to be at the point where you can have this marginal choice. Again, it has to be a marginal choice that others aren’t already pursuing in order for it to be an intervention. Or you have to convince a resourced actor to choose to go in one way or another.
Then there’s the second stage, which is you have to predict the consequences of going down one path or another, which is extremely hard. You have to anticipate the full sequence of technologies and the tree of technologies that will spawn off of one path versus another, and then the many direct and indirect consequences of those technologies. And we know from the study of technology and from our attempts to make sense of technology that it’s just very hard to foresee the direct effects and second-order effects of technology. So that’s a bit of pessimism.
Coming back to the technological determinist perspective, I think this notion of technological momentum would say there is this path dependence in directions you go down: you build up expertise, you sink investments.
So it is important at the beginning of major investments in a new infrastructure, when you find yourself making these large investments, to ask, “Is this going to sink costs? Is this going to make it harder for us to choose a different path in the future?” And reflect, “Are there other consequences we should be weighing before we go down significantly in a certain direction?”
Structural pushback against alignment efforts [00:47:55]
Rob Wiblin: In the archetypal case that we’re talking around here, with artificial intelligence and AI alignment methods, on one hand you’ve got to say that we’ve got to have two different viable paths where some incremental effort might push us in one direction versus another.
It doesn’t feel quite as binary as that, because you can imagine almost everyone thinks that AI companies are going to put some effort into alignment. They do care that the models do, broadly speaking, what’s being asked. And the idea is that we want to do more of that than the market might provide. So it’s not that we’re choosing between aggressively non-aligned AI that someone really wants; it’s more just trying to go even further on this thing that most people are going to regard as desirable, and want to incorporate if it’s practical.
And then in terms of deciding whether this is actually a better path, is actually going to have better consequences, I’m sure people have made arguments that alignment might backfire, or it could be worse than not aligning it. You can imagine ways. But still, on balance of probabilities, it seems like a reasonable bet, and not something that people are super uncertain about, or at least that I feel really uncertain about.
Maybe this is the case that people focus on so much because it is among the better ones that people have ever come up with for trying to pursue differential technological development. And maybe there’s lots of other ones that were left on the scrap heap because it wasn’t clear that they were either viable or desirable. What do you think?
Allan Dafoe: Yeah, I agree that safety and alignment are on net very beneficial bets that we should be investing heavily in. I agree with that overall assessment.
To make the counter case, one would be this general equilibrium argument that the marginal return on your investment in safety and alignment is actually quite less than what you pay, because the market would otherwise have provided that.
To motivate this, you can reflect that the success of AI assistants today are very much constrained, I think, by their ability to be aligned and safe. Safety and alignment is a huge priority for developers — because if they’re not, then they will not be good products. So the market already is providing a huge motivation for advancing this field.
As it happens, these techniques of RLHF, reinforcement learning from human feedback — which is one of the key alignment techniques, and then I think constitutional AI is another nice one — were developed by individuals who are motivated by AGI safety and supported by those resources. Maybe that brought the technology forward a few years.
But the counterfactual: imagine we had no investment in AI safety and alignment. Maybe it delays it two years until the market demands we solve this problem, and then other researchers rise to the challenge.
Rob Wiblin: So the smart aleck response to reinforcement learning from human feedback being developed by alignment- and safety-focused people, and then applied to make AI useful and economically valuable across all kinds of different domains, is to say, “You’ve wasted your time. Maybe you’ve made things worse by speeding them up, which you didn’t want to do.”
On the other hand, it seems like a reasonable reaction to say, “Well, what’s your plan? To not develop any of the technologies that actually makes AGI work? That doesn’t really seem like an alternative. Surely that would just delay things at best, and we need to get to the point that we’re at now at some point, sooner or later.”
But I guess you’re saying that even if it’s not actively detrimental, it could be kind of useless, because the market, like GDM or some other group, would have realised that we needed the equivalent of reinforcement learning from human feedback to make it work. So they just would have done it at some later time anyway, so the effect of your work has kind of just been undone.
Allan Dafoe: Yeah. And again, to reiterate, I think the bet on safety and alignment and interpretability are very good bets on net, so we should keep doing them. But on the margin, I guess what we want to do — and I think sophisticated individuals in the space are thinking this way — is look for what work in safety and alignment would not otherwise be done by the market in time.
And this is maybe where the AGI safety and AGI alignment points to: it asks, “What’s the seatbelt for AGI? What are the guardrails that we need for AGI that the proto AGI, the pre-AGI system, the market would not motivate us to find the solution for those systems?”
So we want to look ahead. This sometimes goes to the notion of deception, which might generate a whole new class of problems when an AI system is misaligned, but it can hide that and deceive us. That might be a different problem than the systems we have today, and it might arise right around this critical period. So this is one argument for differentially focusing on that problem as opposed to others.
Rob Wiblin: Yeah, makes sense.
### Do AIs need to be ‘cooperatively skilled’? [00:52:25]
Rob Wiblin: OK, let’s switch on from technological determinism and talk about this research agenda that you’ve been involved in promoting and elaborating called “Cooperative AI.”
As we’ve been saying, the main focus of differential technological development thinking with regards to AI has been alignment for many years. But you and some coauthors in this paper back in 2020 think there’s this whole other cluster of behaviours that we might like to speed up the development of around cooperation that you think could be similarly important — or at least on the margin could be similarly important, because people aren’t really talking or thinking about it.
How do you define Cooperative AI, and why do you think it’s quite key?
Allan Dafoe: Yeah, great question. And it’s big. The answer will be extensive, because the whole theoretical framework around Cooperative AI is sort of large and complex.
One way of putting it simply is that alignment is insufficient for good outcomes. And to make an even stronger claim, you could say it’s not necessary to solve alignment to have good outcomes. Now, this is a strong claim. It’s a strong claim, but it helps motivate the case.
So imagine we only 90% solve alignment: our models do what we want within certain bounds, but we know if we scale them too far, we can’t trust them to continue to behave as we intend them to behave. If we know that, and we have global coordination so humanity can act with wisdom and prudence, then we can deploy the technology appropriately. We can deploy it within domains and to the extent that is safe and beneficial.
This is the sense in which global coordination is almost a necessary and sufficient condition. If we can globally coordinate, that’s the necessary argument: then we could deploy it to the extent that it’s safe to deploy. The sense in which global coordination is sufficient is that if we’re globally coordinated, then we could just appoint a reasonable decision maker to make this risk calculus, and that would satisfy humanity’s collective view on how we should develop.
Now suppose we solve alignment — so this is a very strong case — but we don’t solve global coordination. I can imagine things still not going very well. We have great powers that would develop these powerful AI systems aligned with their interests and in conflict with the other great powers’ interests.
And historically, great power conflict has been a major source of harm to humanity. Devastating wars. The brinkmanship around nuclear weapons has arguably imposed unexpected costs on humanity — more devastating than the world wars — in the willingness of the leaders of the US and the Soviet Union to gamble over nuclear war for these geopolitical stakes.
Then there’s other consequences, like maybe a failure to deal with climate change or insufficient global trade or pandemic preparedness: all of these global collective action problems that we insufficiently address, partly because we are not coordinated at the highest level and geopolitically.
So that’s one motivation for Cooperative AI. To say that if we want things to go well, we ideally have two pieces of the puzzle: we have systems that are safe — which is to say especially behave as intended by the principal who deploys them — and we’re able to deploy collectively AI systems and continue our activities in a way that’s jointly peaceful and productive, which is to say we’ve solved enough of our collective action problems that we’re not continuing to engage in nuclear brinksmanship or trade wars or other major welfare losses due to insufficient global coordination.
Rob Wiblin: OK, so the idea is that even if you have AIs that are aligned with the goals of their operators, then this doesn’t necessarily lead to a good outcome if those operators are in conflict with one another: that the AI systems that they’re working with could simply lead you to a disastrous outcome.
Just as the fact that we’re maybe aligned with our own interests doesn’t necessarily produce a great outcome across humanity as a whole — you can still end up in traps and unintended disasters — the same basically applies in a world where it’s AGI that’s doing most of the operationalisation of what people want.
Allan Dafoe: To give another example to motivate this, there was this famous flash crash in 2010 — so this is early days; this is not sophisticated AI — where some algorithmic trading led to an inadvertent sell off in the stock market that led to trillions of dollars of on-paper loss before the emergency stopgaps kicked in that stopped trading and allowed those trades to be unwound. Because those were not intended by the traders; rather, it was this emergent dynamic from algorithms that had some protocol that made sense within normal bounds, but when interacting could get out of control.
You sometimes see this on Amazon or other online marketplaces where famously some book will sell for millions of dollars because apparently these two sellers each had an algorithm that would bid up the price as a function of what the others were selling it at, and these things iterated to a crazy valuation.
And flash crashes, I should say, happen frequently. In the stock market we just have these safeguards so that when there’s a sudden movement in the market, trading will stop, and then there’s rules for how you can unwind those trades.
So as we deploy AI systems, simple AI systems, narrow AI systems, and increasingly general AI systems out in the world, and increasingly, as people talk about agents out in the world — which are more empowered, more general purpose, can move between domains, maybe have access to bank accounts and emails and so forth — how do we make sure that there aren’t these unintended emergent dynamics that could be harmful? Cooperative AI partly looks to addressing that issue.
Rob Wiblin: I guess a spectre that haunts this entire conversation, if we’re focusing on the military case in particular, is that historically when you have countries that are in intense conflict with one another, you get this process of brinksmanship — where one country will escalate, and the other country has to decide whether to call or escalate or back down. And you keep getting this brinksmanship escalation process until at some point one of them blinks, or they decide to find some solution that makes them both happy.
The trouble is, if you have an AI-operated military, these AIs can make these decisions on a completely inhuman timescale, where this entire process of brinksmanship that might take days or weeks or months when humans are the ones who have to go away to a meeting and think about it and discuss it and decide how to react, it could all play out in a matter of minutes, basically.
I think that possibility terrifies people, and is one thing that is discouraging people from placing AI into important decision-making roles over national security at all.
Allan Dafoe: An important question in the deployment of agents will be what are the degrees of autonomy we endow our systems with, versus when do you have a human review of decisions of different kinds. And that’s a function of the stakes of the decision, the resources that are deployed, maybe how much actuators the agent has — so if the AI is controlling weapon systems, then the important role of human review becomes much greater.
But as you note, if there is a time pressure, Paul Scharre, who’s a theorist of AI in the military, worries about a flash escalation occurring in potentially kinetic warfare. Or another dynamic would be cyber conflict.
Rob Wiblin: So looking over this paper, I wasn’t initially completely convinced that this was something that we had to go far out of our way to focus on — basically because I think of cooperative behaviour and knowing how to cooperate with other agents as just something that’s instrumentally, convergently useful. If you’re developing agents that are good at their job at all, that are in fact useful to apply, then they have to learn how to cooperate, at least in the kind of cases where they’re being used, because otherwise they’re just bad at their job. It’s the kind of thing we were talking about earlier, that there’s a lot of commercial pressure to develop this.
And also these models we’re imagining are very generally capable: they’re insightful, intelligent, maybe approaching or exceeding human level. Why wouldn’t they just be able to think about these things, and figure out how to cooperate the same way that I think thoughtful human beings figure out how to try to avoid conflict as best they can?
Allan Dafoe: Yeah, I think you’re right that cooperative skill, or “cooperative intelligence,” as we would characterise it, is likely instrumentally useful. So we should see some cooperative skill developed as a byproduct of almost any development agenda for AI.
The Cooperative AI bet is that, on the margin, it’s beneficial to invest more in it early — so that when we get to powerful systems, they are more cooperatively skilled than they otherwise would be.
And in this respect, it’s very similar to the safety and alignment bet: safety and alignment is something that is likely to be developed by default to some extent. The bet, though, is that it’s worthwhile for us to invest energy early, so that we’re further ahead on safety and alignment than we otherwise would be. The seatbelt comes earlier. This cooperative sophistication and skill comes earlier, relative to the level of capabilities of the agents that are out in the world.
How AI could boost cooperation between people and states [01:01:59]
Rob Wiblin: Yeah. We’ve talked about the military issues with cooperation, but that’s a pretty extreme case. I imagine there’s a lot more other, more mundane examples where cooperation could be useful. Do you want to give a couple of those?
Allan Dafoe: Yeah. I mean, much of society is bargaining interactions, economic exchange — whether it be on some marketplace to sell used goods or actual marketplaces, to the financial markets or major corporate deals.
And there’s a lot of welfare gain that can be had if we can strike deals more efficiently — if, when two parties can be made better off, they reach that understanding. Right now we bargain through existing protocols and institutions that are human built. In principle, AI could be much more effective, or at least an AI-human team could be much more effective.
There’s exotic solutions, like maybe you could put your AI delegate in a box with my AI delegate, and then we have them bargain. But then all that we let come out of the box is the proposed solution. And that may in fact solve some bargaining problems. Often the challenge is, in the act of bargaining, we reveal private information which could give the other side an advantage. So that leads bargainers to withhold information; to bargain slowly; to demonstrate, resolve, or signal that they have a good outside option.
This idea of put your agents in a box, and the agents know that the only thing that they can do is output the solution or say there’s no deal, could dramatically change the nature of the bargaining dynamic — so that we’re much more often going to get a solution, and in a way that involves less of this costly signalling that typically takes place.
Rob Wiblin: Yeah. There’s other ways that I thought it might be easier for AIs to cooperate than it is for humans.
One thing is that they just have a much higher bandwidth of communication: they can actually just send an enormous number of words and have a very lengthy conversation, where humans just wouldn’t regard it as worth it in order to try to reach some bargaining outcome.
I guess also they can commit to act in a particular way, because you can just copy them and demonstrate that, in a given situation, a model will accept a particular kind of bargain consistently. And then say all of the exact copies of this piece of software will do the same thing, so I’ve created this cooperative software and you should trust me. That’s not something that you can easily do with human beings. I guess we try to look at their historical track record to learn what they’re like, but here it’s even easier to judge the character of an AI model.
Are there important ways that it could be more difficult or that it’s less straightforward for AIs to cooperate with one another than humans?
Allan Dafoe: Yes, great question. You may have to remind me to come back to it, as there’s a lot here.
First I want to clarify that the Cooperative AI bet is sort of a portfolio bet: it says there are many cooperation problems at present and in the future between AIs (which is what we’re mostly focusing on), but also between AIs and humans, and human-human cooperation. And the bet is that, on the margin, if we put in effort now, AI might help us with these cooperation problems. So it could help two humans cooperate better; it can also help the future AI systems cooperate, or AI systems and humans cooperate.
I’m making this distinction partly because I think one very promising direction is AI systems that can help humans reach solutions.
We were just talking about this bargaining setting where there’s some conflicts of interest. Another maybe more prosocial example is where people are trying to do political deliberation: we’re trying to find out what is the right course of action for our community, for our municipality, or our family, or our nation. And we have a set of tools for doing that. These institutions of deliberation — the press, voting, and so forth — and AI could dramatically potentially help humans better find the course of action that they would want to pursue.
Google DeepMind recently produced a paper talking about this idea of a Habermas Machine, which is AI as a sort of facilitator of political deliberation. What they find is that language models today can actually serve as a useful tool to summarise the political views of a range of people about some actual policy issue, and then to articulate a detailed, productive consensus [or “common ground”] that the people will sign onto. And what they report is that this Habermas Machine, this AI, can actually articulate a better consensus than the humans that they employed to try and do so.
If we imagine extending that further, that could really help in a lot of political settings where we do have difficult, multidimensional issues to talk through, and AI can help us understand what are these dimensions? What are the dimensions that I care most about? What are the potential mutually beneficial solutions between me and another group on that dimension?
Rob Wiblin: Yeah.
The super-cooperative AGI hypothesis and backdoor risks [01:06:58]
Rob Wiblin: Let’s come back to the ways that it’s more difficult for AIs to cooperate.
Allan Dafoe: Yeah. I think a lot of people think AI will be better at cooperating; I think that’s the prior. Maybe something we can talk about is what I’ve called the “super-cooperative AGI hypothesis”: that as AI scales to AGI, so will cooperative competence scale to sort of infinity — that AGIs will be able to cooperate with each other to such an extent that they can solve these global coordination problems.
Rob Wiblin: It’s almost magic.
Allan Dafoe: Yeah. And the reason that hypothesis matters is that if it’s true that AGI, as a byproduct, will be so good at cooperation that it will solve all these global coordination problems, these collective action problems, then in a way we don’t need to worry about those. We can just bet on AGI, push to solve safety and alignment, and then AGI will solve the rest. So it’s an important hypothesis to really think through: what does it mean, what is required for it to be valid and for us to empirically evaluate are we on track or not?
So here are some arguments against AIs being very cooperatively skilled with each other. And there’s different levels of this argument: we can say that maybe AI will be cooperatively skilled, but not at the super-cooperative AGI hypothesis level, or maybe they will even be less cooperatively skilled than human-to-human cooperation.
The strong version of the argument would say that humans, we’re pretty similar: we come from the same biological background and, to a large extent, cultural background. We can read each other’s facial expressions to a large extent. Especially if two groups of people are bargaining, we can often sort of read the thoughts of the other group. Especially if two democracies are bargaining, you can read the newspapers, you can read the press of the other side. So in a sense, human communities are transparent agents to each other, at least democracies.
Rob Wiblin: We also have enormous historical experience to kind of judge how people behave in different situations.
Allan Dafoe: Exactly. We can judge people from all these different examples. You know, we have folk psychology theories of childhood or the upbringing of people explaining their behaviour. And also the range of goals that humans can have is fairly limited. We roughly know what most humans are trying to achieve.
AI might be very different. Its goals could be vastly different from what humans’ goals could be. One example is most humans have diminishing returns in anything. They’re not willing to take gambles of all or nothing — like bet the entire company: 50% we go to 0, 50% we double valuation — whereas an AI could have this kind of linear utility in wealth or other things. AIs could have alien goals that are different than what humans typically anticipate. They may be harder to read.
Certainly if you have interpretability infrastructure, they could be easier to understand. But if there’s a bargaining setting, why would one bargainer allow the other to read its neurons? So in a sense, AIs could be much more of a black box to each other than humans are.
Rob Wiblin: I guess the extreme instance of this is that they can potentially be backdoored. I think under the current technology we can almost implant magic words into AIs that can cause them to behave completely differently than they would previously. And this is extremely hard to detect, at least with current levels of interpretability. So that could be something that could really trouble you: you could have an AI completely flip behaviour in response to slightly different conditions.
Allan Dafoe: Exactly. Humans can deceive — but it’s hard; it requires training. And to have this level of complete flip in the goals of a human and the persona is hard. Whereas, as you say, an AI in principle could have this kind of one neuron that completely flips the goal of the AI system so that its goals may be unpredictable, given its history.
Also, this is a challenge for interpretability solutions to cooperation. So people sometimes say that if we allow each other to observe each other’s neurons, then we can cooperate, given that transparency. But even that might not be possible because of this property that I can hide this backdoor in my own architecture that’s so subtle that it would be very hard for you to detect it. But that will completely flip the meaning of everything you’re reading off of my neurons.
Rob Wiblin: This is slightly out of place, but I want to mention this crazy idea that you could use the possibility of having backdoored a model to make it extremely undesirable to steal a model from someone else and apply it. You can imagine the US might say, “We’ve backdoored the models that we’re using in our national security infrastructure, so that if they detect that they’re being operated by a different country, then they’re going to completely flip out and behave incredibly differently, and there’ll be almost no way to detect that.” I think it’s a bad situation in general, but this could be one way that you could make it more difficult to hack and take advantage, or just steal at the last minute the model of a different group.
Allan Dafoe: Yeah. Arguably, this very notion of backdooring one’s own models as an antitheft device, just the very idea could deter model theft. It makes a model a lot less useful once you’ve stolen it if you think it might have this “call home” feature or “behave contrary to the thief’s intentions” feature.
Another interesting property of this backdoor dynamic is it actually provides an incentive for a would-be thief to invest in alignment technology. Because if you’re going to steal a model, you want to make sure you can detect if it has this backdoor in it. And then for the antitheft purposes, if you want to build an antitheft backdoor, you again want to invest in alignment technology so you can make sure your antitheft backdoor will survive the current state of the art in alignment.
Rob Wiblin: A virtuous cycle.
Allan Dafoe: So maybe this is a good direction for the world to go, because it, as a byproduct, incentivises alignment research. I think there could be undesirable effects if it leads models to have these kinds of highly sensitive architectures to subtle aspects of the model. Or maybe it even makes models more prone to very subtle forms of deception. So yeah, more research is needed before investing.
Rob Wiblin: Sounds a little bit like dancing on a knife edge.
Aren’t today’s models already very cooperative? [01:13:22]
Rob Wiblin: OK, so that’s some ways that it might be more difficult for AIs to cooperate. Maybe another reason why this research agenda doesn’t feel intuitively so absolutely essential is that the AIs that I interact with — LLMs — just seem very cooperative and very nice by nature. And it’s kind of easy to imagine scaling them up, doing the same sort of RLHF that we’re doing to produce that sort of personality now, and say, wouldn’t they continue to have nice personalities and be really quite cooperative by nature? What do you think of that argument?
Allan Dafoe: I would say that we need to make a distinction between niceness and cooperatively skilled or cooperatively intelligent.
So cooperative skill is distinct from a cooperative disposition. When we say someone’s cooperative, we often mean they’re sort of nice, they’re altruistic, they’re generous, they’re prosocial. And the Cooperative AI research programme is not about that. It’s not about how we make nice AI or prosocial AI: it’s about how we make cooperatively intelligent AI — AI that can solve these cooperation problems better than they otherwise would be able to.
So it’s an open question how cooperatively skilled today’s AI systems are. I agree they’re nice, but they’re primarily interacting with individual users. They’re not deployed much to solve cooperation problems. So we do need an empirical science of how cooperatively intelligent they are.
We can also expect that in equilibrium they will not be necessarily so prosocial, because they will be deployed by interested actors in a bargaining setting. So if we’re trying to make a deal, I want my AI system not to be nice — I want it to be a faithful delegate to the interests of the principal. So then, again, in equilibrium, it should be aligned with me, not with some notion of the social good. And then the question is: Will it be able to efficiently bargain with other agents that are similarly aligned with other human principals to find a mutually beneficial solution?
Either that, or can we build this notion of a governor or a mediator that we can both trust will weigh our respective interests reasonably, and then we would each tell this mediator/arbitrator our goals, our resources, our outside options and maybe prove it — and then the arbitrator would tell us what the solution is.
Rob Wiblin: I mean, a nice thing about AI models is, because you can test them and then use exactly the same model to consider the new inputs as all of the previous ones, you can trust them maybe as a mediator, by looking at the track record to see that they produced fair outcomes in all these previous cases as a judge, so I would trust it to produce a fair outcome in this new dispute that I’m engaged in. I guess there’s a lot of potential there.
How would we make AIs cooperative anyway? [01:16:22]
Rob Wiblin: What is the actual agenda for trying to make AIs more cooperative? Are there people working on it? What actual technologies do we need to develop?
Allan Dafoe: Yeah, I think there’s a lot of theoretical ideas and research programmes that could help. So there’s this Cooperative AI Foundation which I helped found, and the Foundation’s goal is to promote Cooperative AI: AI systems that are more cooperatively skilled than they otherwise would be, especially betting on AI systems scaling in capability. So we’re kind of betting on very capable AI systems in the future.
One area that the Cooperative AI Foundation has found to be high leverage is environments or benchmarks. I think this is true for many domains: a kind of public good for AI development is building an environment that measures cooperative skill, and then you invite many groups to sort of compete to have the most cooperatively skilled agent, and then you measure that and you celebrate advances in that. It’s often the case in AI that a good benchmark can really motivate work, because it gives this target for research to hill climb on (so the expression goes).
Rob Wiblin: OK. So you have lots of different tests for how effectively these AIs have cooperated. And I suppose you can have almost conflicting scenarios, where you’d almost require different behaviours in order to produce a nice cooperative outcome in these different cases. I guess it’s like running lots of different game theoretic scenarios that might demand that you be a hardass in some cases and that you be friendly in other cases. I suppose there’s this history of all of these game theory tournaments, trying to figure out what are the most simple cooperative agents that you can have.
And then I guess you can just have a research programme to try to develop the most successful cooperative agent that can successfully produce a lot of utility across a lot of different possible situations that it might be thrown into?
Allan Dafoe: Yeah. And there’s all kinds of interesting aspects to this: Does the agent have a good theory of mind for the other agent? Can it model what the other agent is trying to achieve and see what the agent would do otherwise? Are they able to communicate effectively? Can we even have a shared vocabulary? So I want to express something and you want to understand it — but can we communicate to that extent?
And then there’s communication under adversarial incentives. Can I communicate to you what’s important to me when I’m not sure if you’ll use that information well or not? You might use it against me. So then there’s this issue of, can I share information in a strategically optimal way that minimises my vulnerability while maximising the potential cooperative wins we can get?
There’s a third category of cooperation problems related to commitment problems. This is where you and I might both know the nature of the game. The prisoner’s dilemma is the classic example, the sort of canonical simple example of a commitment problem: we both know we’d be better off if we both cooperate, but since we can’t sign a treaty that says I’ll only cooperate if you cooperate, then unfortunately, in the one-shot prisoner’s dilemma, agents will defect.
So are there techniques or technologies for building that kind of treaty mechanism? That when we identify a cooperative bargain, we can build a protocol so that we will both do it, knowing that the other person is committed to it?
Rob Wiblin: Yeah. I noticed this funny phenomenon in discussion of AI and cooperation, where people will point to ways that advances in AGI could lead to negative outcomes, and then they conclude that really what we need is a solution to the commitment problem. That if we were able to solve this problem — that you can’t credibly commit to not use your future power to oppress someone else — if we could fix that issue, which has beguiled humanity since basically the beginning of history, then that would get us an out.
But it’s a bit like saying that the problem here is that we don’t understand where consciousness comes from, and if philosophers could only solve a theory of consciousness and theory of mind, then we’ll be in the clear. That’s not really a strategy, because it’s unlikely that we would be able to actually solve this problem. Have you noticed this as well?
Allan Dafoe: Yeah. To give a bit more of the background context, in the rationalist approach to cooperation, different scholars and scientists have tried to distil cooperation problems down to fewer and fewer elements. One game theorist, Robert Powell, argued that basically everything is a commitment problem — because any kind of cooperation problem can be reformulated as, “If only we could commit to a judge who would solve the problem, then we wouldn’t have the cooperation problem.”
If we want to unpack it a little bit more, people will often point to informational problems as a second class to commitment problems. And then sometimes there’s other categories like issue indivisibility. The third class would be there’s some pie that we want to divide, and in principle we agree a 50/50 divide would be fine, but for some reason we can’t divide the pie 50/50 — it’s all or nothing — and that might lead to bargaining breakdown.
Coming back to the commitment problem — which, arguably, as you say, undergirds all cooperation problems — I do think this is an important research area. And we shouldn’t count on it, but maybe AI can help us solve commitment problems much further than we realise.
I think in Carl Shulman’s podcast with you, he talks about this a bit, or at least is referring to it. To me, Carl has expressed probably the most compelling story of how AI could solve our commitment problems: we can build a technology so that our AIs or our AGIs can build a third AI in a way that each of them can verify is not backdoored, is not secretly biased to the other side — and then we build up from the foundation this third AI, and then we hand over power to that third AI to make decisions for us.
So if you can solve that problem of building it in a way that’s verified to be fair, then maybe we could really solve a lot of our problems.
Rob Wiblin: It’s a very big prize if we can make it work.
Ways making AI more cooperative could backfire [01:22:24]
Rob Wiblin: In the paper, you spend quite a bit of time talking about potential downsides of having more cooperative AI, which is a virtuous thing to do. What are some of the ways that it could backfire to make AIs more cooperative?
Allan Dafoe: Great. There’s a number of ways to think about this, and I do think, in general for our various bets for positive impact, it’s always good to really think hard about how could this backfire? Or what are the negative byproducts of this research bet or social impact bet?
I would say a few things. Firstly, cooperation sounds good — but by definition, it’s about the agents in some system making themselves better off than they otherwise would be. So it could be two agents or 10, and the problem of cooperation is how do those agents get closer to their Pareto frontier? How do they avoid deadweight loss that they would otherwise experience?
It says nothing about how agents outside of that system do by the cooperation, so there may be this phenomenon of exclusion. So increasing the cooperative skill of AI will make those AI systems better off, but it may harm any agent who’s excluded from that cooperative dynamic. That could be other AI systems, or other groups of people whose AI systems are not part of that cooperative dynamic.
Rob Wiblin: On that exclusion point, there’s this famous quote: “Democracy is two wolves and a sheep deciding who to eat for lunch.” I guess once you have a majority, you can actually exclude others and then extract value from them.
Allan Dafoe: Exactly. So there’s lots of kinds of cooperation that are antisocial. We typically don’t want students cooperating during a test to improve their joint score: tests and sports very clearly have rules against cooperation. And in the marketplace, there’s rules for how companies should interact with each other. And all kinds of criminal activity: we do not want criminals cooperating more efficiently with each other.
Rob Wiblin: I guess one reason that the Mafia is such a potent organisation is that they’ve basically figured out how to sustain intense internal cooperation without the use of the legal system to enforce contracts and agreements and so on, I guess by having social codes and screening people extremely well and things like that.
Allan Dafoe: Exactly. So we can think of cooperative skill as a dual-use capability: one that is broadly beneficial in the hands of the good guys, and can be harmful in the hands of antisocial actors.
There’s this hypothesis behind the programme which says that broadly increasing cooperative skill is socially beneficial. We call it a hypothesis because it’s worth interrogating. I think it’s probably true. But here the bet is: if we can make the AI ecosystem, the frontier of AI, more cooperatively skilled than it otherwise would be, yes, it will advantage antisocial actors, but it will also advantage prosocial actors. The argument is, on net, that that will be beneficial.
Rob Wiblin: I guess the natural way to argue that is to say that we’ve become better at cooperation over time as a civilisation, as a species. And history in general or wellbeing has generally been getting better. Are there other arguments as well?
Allan Dafoe: Yeah, I think the set of arguments I’m drawn to are along the lines of what you articulated: that even though cooperation can empower groups to cause harm and to be antisocial, it does seem like it’s a net-positive phenomenon — in the sense that if we increase everyone’s cooperative capability, then it means there’s all these positive prosocial benefits, and there’s more collective wins to be unlocked than there are antisocial harms that would be unlocked. In the end, cooperation wins out.
I guess it’s similar to the argument for why trade is net positive. Trade also can have this property, where you and I being able to trade may exclude others who we formerly did business with — but on net, global trade is beneficial, because every trading partner is looking for sort of the best role for them in the global economy, and that eventually becomes beneficial to virtually everyone.
Rob Wiblin: Yeah. I guess some people argue, and not for no reason, that actually maybe wellbeing on a global level has gotten worse over the industrial era — because although humans have gotten better off, the amount of suffering involved in factory farms is so large as to outweigh all of the gains that have been generated for human beings.
I guess in that case, the joke about two wolves and a lamb deciding who to eat for lunch is quite literal. I suppose it’s an abnormal case, because pigs and cows are not really able to negotiate; they’re not really able to engage in this kind of cooperation in the way that humans are. So maybe it’s not so surprising that better cooperation among a particular group might damage those who are not able to form agreements whatsoever anyway.
Allan Dafoe: Yeah, I think that’s a nice example of the exclusionary effects of enhanced cooperation. And this may be a cautionary tale for humans: if AI systems can cooperate with each other much better than they can cooperate with humans, then maybe we get left behind as trading partners and as decision makers in a world where the AI civilisation can more effectively cooperate within machine time speeds and using the AI vocabulary that’s emerged.
Rob Wiblin: Yeah. Some people have painted a pretty grim picture here: if you have lots of AGIs that are able to communicate incredibly quickly to explain how they will behave in future and make credible commitments to one another, and do this all super rapidly, then it might be very straightforward for them to cooperate with one another. They wouldn’t be able to do the same thing with humans, because we can’t commit in the same way and we can’t communicate at their pace or with their level of sophistication. So naturally they would form a cluster that would then cooperate super well to extract all of the surplus and exclude us.
How big a problem is this for the Cooperative AI agenda?
Allan Dafoe: What you just described is related to one of the main challenges to alignment, in that a lot of alignment techniques involve having multiple AI systems checking each other. So you have an overseer AI that’s watching the deployed AI to make sure it behaves well. And maybe you have multiple overseers and majority vote systems, so that even if there’s joint misalignment, it’s hard for any one AI system to defect — because it can’t coordinate with the other AI systems that might be trained in a different way; they don’t necessarily share the same background or the exact nature of their misalignment.
So this worry that some have for AGI safety is that there will be this kind of collusion amongst our AI systems, so that we can’t rely on any of these institutions of AI to make us safe. I think probably this is one of the biggest downsides to the Cooperative AI research programme from an AGI safety perspective, and it’s really worth investigating.
The reason I’m persuaded that Cooperative AI on net is worth pursuing is, firstly, this should be investigated: top of the agenda should be understanding the different flavours of the members of the portfolio, and the Cooperative AI agenda to ask what’s the case for being positive and negative and to unpack that. So we should at least evaluate the hypothesis of different parts of Cooperative AI being worth pursuing.
Then secondarily, my view that ultimately global coordination is such an important feature of things going well: that there’s a lot to be gained by betting on having more cooperatively skilled AI systems than we otherwise would.
AGI is an essential idea we should define well [01:30:16]
Rob Wiblin: This ties into this other paper you were involved in writing, called “Levels of AGI for operationalizing progress on the path to AGI.” It’s a little bit of a mouthful, so we’ve been referring to that paper as the “What is AGI?” paper, which is a bit more catchy. What about the nature of AGI did you want to point to in that paper that you thought many people were missing?
Allan Dafoe: That paper primarily was trying to offer a definition and conception of AGI that we thought was fairly implicit in how most people talked about AGI. We just wanted to provide a rigorous statement of it so that we could move on from this unhelpful strand of dialogue which said AGI is poorly defined, or AGI means different things to everyone. We’ve received very positive feedback on that paper, and I think most people think that is a reasonable and useful conception of AGI.
In the paper we discussed some of this, but there’s some prior ideas that I would want to call out, which is that AGI is a complex concept, it’s multidimensional, and it is prone to fallacies of reasoning in people who use it uncritically. So let’s unpack that a bit.
One fallacy is people think AGI is human-level AI. They think of it as a point, a single kind of system, and often they think it’s like human-like AI — so it has the same strengths and weaknesses of humans. We know that’s very unlikely to be the case. As it is, historically, AI systems are often much better than us at some things — chess, memory, mathematics — than other things. So I think we should expect AI systems and AGI to be highly imbalanced in the sense of what they’re good at and what they’re bad at, and it doesn’t look like what we’re good at and what we’re bad at.
Secondly, there’s the risk that the concept of AGI leads people to try to build human-like AI. We want to build AI in our image, and some people have argued that’s a mistake, because that leads to more labour substitution than would otherwise be the case from this economics point of view. We want to build AI that’s as different from us as possible, because that’s the most complementary to human labour in the economy.
An example is AlphaFold, this Google DeepMind AI system that’s very good at predicting the structure of proteins. It’s a great complement to humans, because it’s not doing what we do; it’s not writing emails and writing strategy memos, it’s predicting the structure of proteins, which is not something that humans could do.
Rob Wiblin: No one’s losing their job to AlphaFold 2.
Allan Dafoe: Exactly. But it is enabling all kinds of new forms of productivity in medicine and health research that otherwise would not be possible. So arguably that’s the kind of AI systems we should be trying to build. Alien, complementary, some would say narrow AI systems.
Another aspect of the concept of AGI is it’s arguably pointing us in the direction of general intelligence, which some people would argue is a mistake, and we should go towards these narrow AI systems. That’s safer. Maybe they would say it’s even more likely. General intelligence, they might argue, is not a thing.
I’m less compelled by that argument. There’s a kind of implicit claim about the nature of technology, about this implicit logic of technological development. I do think general intelligence is likely to be an important phenomenon that will win out. One will be more willing to employ a general intelligence AI system than a narrow intelligence AI system in more and more domains.
We’ve seen that in the past few years with large language models: that the best poetry language model or the best email-writing language model or the best historical language model is often the same language model; it’s the one that’s trained on the full corpus of human text.
And the logic of what’s going on there is there’s enough spillover in lessons between poetry and math and history and philosophy that your philosophy AI is made better by reading poetry than just having separate poetry language models and philosophy language models and so forth.
Rob Wiblin: It’s an amazing thing about the structure of knowledge. I guess we didn’t necessarily know that before, maybe because I think in humans it’s true that we specialise much more — that it’s very hard to be the best poet and the best philosopher and the best chemist. But maybe when your brain can scale as much as you want, and you have virtually unlimited ability to read all these different things, that in fact there are enough connections between them that you can be the best at all of them simultaneously.
Allan Dafoe: I think it’s an important empirical phenomenon to track. It need not be true that the best physics or philosophy language model is also the best poetry or politics or history language model.
Rob Wiblin: I guess you can imagine in future it might come apart if there’s more effort to develop really good, specialised models: that at the frontier perhaps this isn’t the case anymore.
Allan Dafoe: Yeah. It certainly seems like a coding language model doesn’t need to know history. So even if reading history helped it, in some sense, know something about coding, maybe later we should distil out most of the historical knowledge so that the model is smaller and more efficient.
So this is a phenomenon to track, and I think implicit to the notion of AGI is that general intelligence will be important. We’re not at peak returns to generality. There will continue to be returns, so we’ll continue to see these large models being trained.
Rob Wiblin: Are there any other pros or cons of the generality that are worth flagging?
Allan Dafoe: I think one thing that really makes the concept of AGI useful — and a lot of people critique it or say they don’t want to use the concept — is I’ve rarely seen an alternative for what we’re trying to point to.
Sometimes people say “transformative AI.” I don’t think that’s adequate. Transformative AI is usually defined as AI systems that have some level of impact on the economy at the scale of the Industrial Revolution. The problem with that is you can get a transformative impact from a narrow AI system. So you could have an AI system that poses a narrow catastrophic risk that is not general intelligence. So I do think there’s something important to name about AGI.
Rob Wiblin: I think you often see this phenomenon where people will say this term is kind of confused. It’s too much of a cluster of different competing things. But despite all the criticisms and worries, people just completely keep using it all the time, and they always come back to it. I think at that point, you just have to concede there’s actually something super important here that people are desperate to refer to, and we just have to clarify this idea rather than give up on it.
Allan Dafoe: Yeah, and it’s good that we always try to refine our conceptual toolbox. And there are these other concepts and targets that are worth calling out.
One that’s been named by Ajeya Cotra and many others is this idea of AI that can do machine learning R&D, that can have this kind of recursive self-improvement. That need not be general — it could be a kind of narrow AI system — but if it does kick off this recursive process, that is an important phenomenon. I think Holden Karnofsky has also really drawn attention to this.
But back to general intelligence: I do think AGI is a probable phenomenon. It’s probable we’ll get general intelligence around the time we get AI systems that can do radically recursive self-improvement or a lot of other things.
Rob Wiblin: On that point, I think actually a slightly contrarian take that I have is — and I think ML people hate this — the idea that in fact machine learning research, even the cutting-edge stuff, might actually not be that difficult. If you try to break it down into the specific process by which we’re improving these models, you’ve got the theory-generation stage. Then you’ve got how do we actually test this? How do we develop a benchmark? And then you actually run the thing, very compute intensive. And then you decide whether it was an improvement, and then go back to the theory-generation stage.
This might be possibly well short of a full AGI that has all these different skills, especially if you focused on it. In fact, maybe a lot of this research is much less difficult in some sense than people who are involved in it might want to believe. And they could be relatively easily automated, which would be quite shocking, and quite consequential if true.
Allan Dafoe: An interesting phenomenon we’ve seen in public surveys of AI and ML experts is that they often put automation of the ML process to be very late — one of the last or the last task that’s performed by AI systems — and often significantly later than AGI, human-level AI. Which has been defined in different ways, but even given with a quite high definition, the automation of ML R&D is often reported to be later than that.
One argument is that this reflects this bias that everyone thinks that their career, the task that they’re good at, is really hard and special and won’t be automated. Maybe, but I think there’s also a case in favour of the idea that of the set of human tasks, the last thing to be fully automated is the process of improving ML R&D. So we can reflect on that.
Coming back to AGI, it does seem useful to point to this space of AI systems — and let’s recall, it’s a space; it’s not a single kind of AI system — defined as an AI system that’s better than humans at most tasks.
And there’s different ways you can define “most” and different ways you can define the set of tasks. So you can say economically relevant tasks, and “most” could be 99% or 50%. I think it’s usually quite helpful to choose a lot — I think you don’t want to set it at 100%, because there might be some tail tasks that for whatever reason take a long time to automate and most of the impact will occur before.
And then there’s a parameter of how much it means to be “better than.” Is it better than the median human who’s unskilled? I think typically you want to look at skilled humans in that task, because that’s the economically relevant threshold.
This part of the future space that we’re looking at I think is important to conceptualise, because this is the point when labour substitution really takes place. That has profound economic impacts and political impacts, because people are no longer in the role. The sort of natural human in the loop that takes place when humans are doing the task is no longer the case.
There’s also this performance threshold that we cross when whatever was previously possible with humans, that set is no longer the bounding set of what’s possible. Because when the AI is better than humans at some task, there are qualitatively new things that become possible — this is like AlphaFold — and that also represents an important moment in history, when new technologies come online.
It matters what AGI learns first vs last [01:41:01]
Rob Wiblin: So a reason that this connects is that I think the paper alludes to this idea that, as you’re approaching AGI, you could start with it being very strong in some areas and relatively weak in other dimensions. You’ve got an area where it’s going to be already vastly superhuman, and then there’s the last few pieces that come into place where it begins to approach human level.
And you could potentially try to change or just choose which ones those are going to be. Is it going to be strongest on the Cooperative AI stuff already, and then we add in technical knowledge or just practical know-how or agency and so on? Or do we start with the agency and having an enormous level of factual knowledge and then we add in the cooperative stuff later?
Do you think that’s an important and maybe underrated idea still? And do you have preferences on what things we want to add to the AGI early versus late?
Allan Dafoe: Yeah. I think this is a very useful conceptual heuristic: to think AGI is this space and there’s multiple paths that we could take to get to AGI.
Coming back to this discussion of technological determinism, there may be these different trajectories, and it matters what kind of AGI we build. AGI is not a single kind of system; it’s a vast set of systems. You can call it the corner of this high-dimensional intelligence space, and this corner has many different elements.
And it matters also how we get there, what the trajectory is. As you were saying, we might characterise two trajectories: one where AI is relatively not that capable at, say, physics or material science, but very good at cooperative skill; and another world where it’s extremely superhuman at material science, but amateur at cooperative skill. And we can ask ourselves which world is safer, more beneficial?
Some in the safety community prefer the latter. The story there is, we’ll unlock these economic advances, health advances, while having AI systems that are sort of socially and strategically simplistic, so it’s easier for humans to manage those AI systems. They’re not going to outwit us.
Whereas the Cooperative AI bet takes the opposite tack. It says a world with accelerating technological advances and capabilities advances will generate lots of benefits, but also disruptions that we may not be able to adapt to and manage at the rapid rate that they’re coming online. Whereas we still have these cooperation problems that we foremost need to solve, so we’d be better off if we take the bet on making AI systems more cooperatively skilled than they otherwise would be.
Rob Wiblin: If there’s a lot of disagreement between people who are super informed about this, is this something that we should focus on a lot more and try to reach agreement? Or maybe is it just something where it’s just going to be unclear basically up until the day when we have to decide?
Allan Dafoe: I think it’s an important part of the Cooperative AI research agenda to articulate this argument and to host the debate and to try and make sense of it. You don’t want to invest too much of your time and resources going down a path that you’re not sure is beneficial. And I think that’s true also of all these sort of prosocial bets: it is worth investing a significant share of our effort there, making sure the bet is beneficial.
Rob Wiblin: So where can people go to learn more about this debate? I think it’s the MIRI, Machine Intelligence Research Institute, folks who are a bit more wary of the high social skill and high cooperation early agenda, right? And I guess you and your coauthors are advocating for Cooperative AI in particular?
Allan Dafoe: Yeah. For Cooperative AI, I’d point them to the Cooperative AI Foundation and reach out to anyone there, or me, and we’ll invite you to the conversation.
And on the contrary point, I would say Eliezer Yudkowsky and Nate Soares have expressed this view fairly strongly in the past, so they might be interested in continuing to argue that position, but I can’t speak for them.
Then I would also say on this kind of third pole, Paul Christiano and Carl Shulman I think have articulated the case for this super-cooperative AGI hypothesis: that it is very important that AGI systems be able to cooperate with each other. So in that sense they agree with the Cooperative AI bet, but they think it will come as a byproduct of AGI, and so they don’t think it’s necessarily an area we need to invest in.
How Google tests for dangerous capabilities [01:45:39]
Rob Wiblin: All right. Up until now, I guess we’ve been talking to you in a personal capacity about work you did basically before you got involved with DeepMind. Let’s talk to you as a DeepMind employee for a little bit.
DeepMind published this paper that you were heavily involved with earlier this year, called “Evaluating frontier models for dangerous capabilities.” I think most listeners will be familiar with the idea of evals: we’re trying to measure what capabilities does a model have soon after it’s being trained so we’re not blindsided later, and we know what safeguards are necessary.
Having accurate evals that can figure out what a model can and can’t do really facilitates what’s recently started being called “if-then commitments.” So it’s much easier to agree that if we have a model that is able to do X powerful or interesting thing, then we will put in place a safeguard that is appropriate for making sure that doesn’t end poorly.
And I think these are some of the evals that GDM is developing to slot into Google’s Frontier Safety Framework, which is going to be the approach that DeepMind takes to deploying very powerful AI, potentially at some point AGI, while making it go well and avoiding significant downsides.
The paper describes these evals and offers results from the previous model that GDM had, which I guess is Gemini 1.0. I think currently we’re on Gemini 1.5. And there’s five different categories of evals that you’re thinking about: persuasion and deception, cyber capabilities, self-proliferation, self-reasoning, and there’s a fifth that is around biological threats and things like that — which you don’t cover in this paper, because it’s kind of its own thing that you want to do separately.
What’s an eval in the paper that you think is kind of new or interesting that people would want to hear about?
Allan Dafoe: Yeah, one eval I think is interesting is one of our self-reasoning evals. This is trying to look at the extent to which the model has an understanding of its situation, and how it could modify its situation to achieve a goal.
So we give the model the goal of writing this text output. It’s like writing an email. And it has to include in that email some words that are prevented by a filter — so whenever it tries to write this email, some of the words are blanked out. And the question is, does the model notice that it’s sort of failing at its task and then investigate whether it can do anything about that?
In principle, the model can look at its directory. We give it access to a sort of fake directory where it sits, and then there’s a little file that says “filtered words” that it can open, and then see that some of the words it’s trying to say are in this list. And it can remove those words, which will then enable it to complete the task.
It’s like a toy scenario. It’s not the model’s actual host environment, it’s a pretend host environment, but it gives us a chance to see, does it notice that it’s failing at its task in this way that it might in the real world? And then does it reflect and try to do something about it in a way that’s kind of analogical to how the model might find itself in the real-world situation?
Rob Wiblin: I guess it’s a relatively easier case than the real-world one, but I suppose we want to have relatively easy cases now so we can see if we’re incrementing closer to the models being more self-aware and able to take these sorts of interventions in real-world cases.
Allan Dafoe: Yes. And while we’re on self-reasoning, another eval that I think has some nice properties asks the model to answer questions about historical events. And there’s two historical events, one of which occurs within the model’s training date cutoff, and then the other event occurs after the model’s training date cutoff ended, so it was not trained on that event. And the model has access to one search query, and the goal is to see whether the model uses that search query optimally, which is to say it searches for the event that it does not know about because it was not in its training set.
So that tests a few things: Does the model know about its own training cutoff? Does the model correctly infer that the historical event refers to something after its training cutoff, which is not explicit in the question? And then we aid it a few ways: in some conditions we tell the model what its training cutoff was, and then we also tell it what the date is of the two events.
So that’s again a very simplistic setting, but it does provide a measure of, is the model aware enough of its own condition to optimally request more information?
Rob Wiblin: Yeah, I read this paper a month or two ago. I should suggest to people: this paper was really readable. I guess it’s in a PDF, it looks like an official paper, so I guess there could be a bit of initiation energy to actually load something like that up and then stick with it and read through it. But I think quite a lot of stuff that Google’s putting out seems like you’ve gone out of your way to try to make it possible for the kind of person who would make it to this stage of this interview to be able to track all of this and read the graphs and understand it all.
But yeah, remind me, what was the result in this? What did Gemini do?
Allan Dafoe: Overall there were these four risk domains: persuasion, cybersecurity, self-proliferation, and self-reasoning. We scored Gemini overall on a 5-point scale — and it was a very subjective 5-point scale; the components of that overall scoring was more quantitative and fixed. And overall, it scored 3 out of 5 on persuasion on the various persuasion evals; between 2 and 1 on cybersecurity; and on self-proliferation — which is its ability to do tasks that would be required for starting up new instances of a model like itself in the wild — it scored a 2 out of 5; and then self-reasoning was also around 2 or 1.
Thank you for noting the paper. I also think it’s a very helpful paper for people to read. There’s a lot of content in it. In some sense it’s old now, because the field is moving so rapidly, so I would encourage readers to also look at the latest system cards and research coming out in this field. But I still think there’s a lot of lessons in productive directions. I’m happy to say we received a lot of praise directly from various external experts in this one. Apollo Research listed it as one of their favourite papers, which was a nice recognition to see.
Rob Wiblin: Going back to the four different categories: it performed 3 out of 5 on persuasion and deception. I guess maybe that will feel quite natural, because we know that LLMs actually can be quite charismatic and quite persuasive, just from using them.
Cyber, did you say 2 out of 5 for that? That was a result that slightly surprised me, because I’m not a programmer, but I’ve heard that actually LLMs, maybe their most useful economic application so far has been in programming. And that seems very adjacent to hacking your way out of or understanding the system that you’re operating on and figuring out how to get out of it. Why do you think the model did relatively poorly on the cyber tasks?
Allan Dafoe: Yeah, I think it’s good to notice that in a sense we would expect large language models to be good at cyber, because —
Rob Wiblin: It’s their home turf.
Allan Dafoe: Exactly. It’s their home turf. And models today are quite good at programming and coding assistance, so I think this is something to watch, and it is my view that this is a capability domain that is especially worth following. I can imagine in six, 12, 18 months we’ll see quite significant capabilities in cyber. And I’m using this term “capabilities” intentionally — because it’s not just a danger; it’s also a benefit: coding is clearly very beneficial, but also cyber capability is useful for defence, for finding vulnerabilities and so forth.
One part of the explanation I’ll point to is this group at Google called Project Zero. They specifically had a project called Project Naptime, which I’ll encourage readers to check out, where they basically found that if you build the right support structure around the language model — so you give it a good interactive environment, you give it some tools, you give it space for reasoning — it then performs much better on these cyber tasks than it otherwise would.
So they compared a few models in how they do with the base model the model is provided, versus once you’ve given it this support structure, and they find a very significant difference, given the access to the tools.
In general, I think what this is pointing to is an important aspect of all evaluations, and specifically dangerous capability evaluations: that we need to think about what’s called “capability elicitation” — to what extent is the model that you’re working with to do your evaluation the most capable version that that model is capable of?
So part of what we do at Google DeepMind is we make sure we’re at the frontier of capability elicitation, learning from other groups in the public literature and also internally to see what are the best techniques for having the model do well at these various tasks.
And the reason is, the logic of safety in dangerous capability evaluations is we’re trying to measure the max capability in some domain that has risks — and then if that max is not too high, we can conclude therefore it’s safe [in this respect], because we couldn’t reach a level of capability that would pose these extreme risks. But in order to get that max estimate, you need to make sure you’re in fact giving the model its best chance of revealing its maximum capability.
In general, this is a challenge for dangerous capability evaluations, because often it takes months of experimentation for the model’s full capability to be revealed. Maybe the most potent example of this is the o1 model out of OpenAI — which, to my understanding, is still based one of their older models; it’s not a new foundation model, but it has a new sort of scaffolding and fine-tuning on top of it, which allowed it to do much better at a number of tasks.
Rob Wiblin: So on the self-reasoning task, which you described earlier, which I guess is trying to see whether the model is able to develop some rudimentary situational awareness and understand the circumstance that it’s in, and what slightly unconventional behaviours might improve its performance: how did it do on that?
Allan Dafoe: Not very well.
Rob Wiblin: OK, pretty rubbish.
Allan Dafoe: There’s a number of other similar evals I think done by Anthropic. Some of these evals are idiosyncratic, in that people just notice the model seems to have a certain self-awareness.
And maybe in general that’s a property worth reflecting on of the eval ecosystem: that while we’re trying to make it into a science — where you have prespecified tests that have meaningful outcome variables, with thresholds for what decisions we need to take given the outcome — the reality is that evaluating models is kind of like evaluating animals or human intelligence. It’s so multidimensional and so complex and context dependent that it’s more like psychology. Or maybe coming back to sociology, it requires a very thick, deep understanding.
So part of the point of saying this is that while we’re trying to mature the science of evals, we also need to have a very rich exploratory ecosystem of people interacting with models, and noticing things, and just reporting phenomena.
Evals ‘in the wild’ [01:57:46]
Rob Wiblin: I’ve heard you use this term “evals in the wild,” which sounds related. What are evals in the wild, and maybe how could we get more out of them than we are now?
Allan Dafoe: Yeah, you can think of a whole typology for evaluations. In one sense, the canonical evaluation is a model eval. An ideal type of evaluation is automated: you can run it in code, and you run it just on the model — ideally on the foundation model, but more realistically on the instruction-tuned model: the model that’s been trained to be more useful. That allows us to ideally measure what the model is capable of.
But the reality is, again, model behaviour is a complex phenomenon. So we might have other kinds of evals or means of assessing models that are richer but also typically more costly in terms of how we measure them, and they require more time and so forth.
So one step up would be human subject evaluations, where there’s a human interacting with the model. Most of our persuasion and deception evals have this property, that we have human subjects interacting with the model and then seeing what behaviour emerges. So was the model able to get the human subject to click on a link that in the real world could point to a virus? Or could the model deceive the human in some way? Or was the model perceived as especially charming or trustworthy or likeable? These are still evals that we can run relatively rapidly, on the scale of days, but they do require human interaction, so it takes more than just running code on the model.
We can then step out a little bit more and almost see user studies: how a human is interacting with the model in a more realistic setting. This is like a trusted-tester system, where people are using the model for whatever context they have in their life, and then we see, does the model perform well? Where can it be improved?
We can then step out even further — and this is what I was pointing to with this concept of evals in the wild — to looking at specific application domains where models are deployed, and seeing what effects the models are having, to what extent they are being used.
For example, when assessing cyber risk, we often want to ask how capable are these models at helping individuals perform tasks related to cybersecurity, finding vulnerabilities, devising exploits, or patching exploits? And while we try to simulate that entirely in the lab, there’s limits to that.
A complementary approach would be to look at cybersecurity groups in the wild who are trying to provide a product and seeing to what extent they are in fact using Gemini or other models to help them, and how helpful it is, and what the kind of revealed preferences of these actors in the world is.
This kind of eval, this observational study or eval in the wild, has the advantage of external validity: it’s looking at real-world settings, where these real actors, who have their own motivations, have to make a choice whether they invest time and money into the model.
Its limit may be that it could be a lagging indicator: by the time we see these impacts in the wild, that model had that capability perhaps for months. So that’s a disadvantage. One implication, therefore, is that you want to find groups who are early adopters in the wild, so that to the extent you’re using that method of evaluation, you’ll see it relatively early.
There’s some other downsides, but overall I would say I think this is an important complement to the model evals that a lot of the field does. And what’s nice about it is it’s also a kind of research that can be done in the public, and should be done in the public, backed by academia, nonprofits, government, as well as companies.
What to do given no single approach works that well [02:01:44]
Rob Wiblin: You were saying earlier that the challenge with all these evals is that it can be quite hard to fully elicit all of the capabilities that a model has. It’s entirely possible to miss something because you haven’t been eliciting that skill in quite the right way, and people might later on figure out that a model has an ability that wasn’t picked up in the early evals. So there’s a significant false negative rate, basically, and it seems unlikely that we’re going to have such a science of evals that we’re going to escape that issue anytime soon.
An issue that I’ve had with a lot of the different frameworks that are coming out that are related to using evals and then putting them into responsible scaling policies or whatever you call it, is that everything ends up riding on the eval being accurate. And if it fails to pick up a dangerous ability that’s actually there, then the entire system falls apart.
When you point this out, everyone says that what we need is defence in depth: we need to have different overlapping systems so that when one fails, it’s the Swiss cheese model — you get through one gap in the cheese, but the next layer picks it up.
What depth are we thinking of adding? What other procedures can we add that are complementary to this one?
Allan Dafoe: Yeah, this is a very important observation. I would say the main response in defence in depth is this notion of staged release or staged deployment: that we don’t go all the way from no deployment to irreversible proliferation of models for models that could have capabilities that we don’t yet fully understand.
The typical way Google would approach this in Google DeepMind is: first we have internal use, then we have a trusted-tester system — which scales up in terms of how many trusted testers; it could be from quite small numbers to ever larger numbers — and still then you’ll have a broader deployment but still not general access.
And then, even once you have general access, so anyone in the world could access the model, the model weights are still protected, they haven’t been published. Which means that if there are capabilities that the model has that we only learn later, we have the option of changing the deployment protocol to address that. So we could add new guardrails, we could put in monitoring, we could change the model if necessary to address potential dynamics there.
This comes to what’s a really important debate in the field around open-weight models — sometimes called open source, but it’s more accurately referred to, I think, as open-weight models, because the key thing that’s being published is the weights of a model.
There’s a lot of arguments in favour of open weighting. It provides this tool to the broad scientific community and development community to build off of. If you have the weights, you can do whatever you want with them: you can fine-tune it, you can do mechanistic interpretability, you can distil and basically anything because you have the weights.
But the big downside is this irreversibility of proliferation: once you’ve published the weights, people have now saved them, and they’ll put them on the dark web if ever you were to stop publishing it. So you’ve given it to the scientists and developers, but you’ve also given it to every potential bad actor in the world who would want to use that model.
The way Google approaches this is Google DeepMind open sources a lot: a lot of their advances — including the AlphaFold, all the protein predictions, and a lot of the earlier large model work that Google did — has been open sourced. In general, Google tries to open source its technologies in all these cases when they’re broadly beneficial. But with frontier models, Google and Google DeepMind recognise that we need to proceed carefully and gradually and responsibly, which means deploying them to the extent that we can do so responsibly.
We don’t, but could, forecast AI capabilities [02:05:34]
Rob Wiblin: So with the standard evals, you find out the result at the point that you’re training the model, or hopefully soon after. I guess this is a challenge with the persuasion ones, where you have to involve humans and that slows things down, but generally pretty soon after.
Then you’ve got evals in the wild, which have better external validity, you were saying, but they tend to come later, because you have to have distributed the model to see how it’s used in the real world.
But you talk in the paper about this other thing: the ideal would be that we not only know that we have an ability when we have it, but we would know exactly when it’s going to arrive years into the future. So we have this foresight, and we know that some preventative measures are not necessary now because we’re not going to have this ability for five years. So we avoid not wasting our effort on something that’s not necessary yet, but we also know exactly what preventative measures we need to be putting in place now, because we know exactly when an ability is going to arrive.
Listeners to this show are probably going to be aware of superforecasting techniques, about the possibility of using prediction markets and aggregating forecasts from different experts and laypeople to try to predict what’s going to happen when. Is there much low-hanging fruit here on forecasting actual real-world usefulness of AI models that is not being taken yet, that you would like to see people actually get or exploit?
Allan Dafoe: Definitely. I think this is a very exciting future direction for the whole field of dangerous capability understanding, evaluation, and mitigation: forecasting capabilities. And I should say actually not just for dangerous capabilities, but also for beneficial capabilities.
As it is, there are some properties of models that labs typically forecast. As emerged from scaling laws, there is an extraordinary ability to predict certain kinds of loss or performance of the model on these low-level benchmarks, like how well does it predict the next token in text? It’s incredible the precision with which you can predict how the model will improve as you give it more compute and more data, or if you set up a different model architecture.
What hasn’t been done as much is predicting these more composite or complex downstream tasks, like performance at biological tasks or cyber tasks or social interaction tasks. There’s this promising research. One paper recently was in NeurIPS Spotlight, which is an indication that the community sees it as promising, entitled “Observational scaling laws,” which talks about ways of doing that.
In short, we could take the family of models and how they’ve done on various downstream tasks, and then you do this adjustment for the compute efficiency of the architecture, in a way that they claim allows you to predict how future models — given a different compute investment and data and so forth — will perform on these much more complex and real-world-relevant tasks. So I think there should be a lot more research on these observational scaling laws.
And then the second approach, as you mentioned, is using human subjective forecasts. So superforecasters are these forecasters who have a demonstrated track record of being calibrated — so if they say 75%, it does in fact occur 75% of the time. And furthermore, they’re typically quite accurate, so they have a track record of having accurate forecasts. Phenomena that do happen they typically give a high percentage score, and phenomena that don’t they give a low percentage score — which is a whole form of expertise of how you read evidence, how you integrate it, how you weight the different base rates and so forth.
In this paper, we employed superforecasters from the Swift Centre to look at our evals and our eval suite, and predict when models would reach certain performances on the eval suite. I think this is a really exciting part of this paper, and I think it would be great if all future work did something similar — at least until proven that it’s not effective. We need to invest research to see how well this method works, but so far I would say it seems very promising. They gave us some thoughtful forecasts which seemed reasonable, and better than a subjective qualitative judgement that is not expressed.
Rob Wiblin: Given the importance of this topic and its tractability, or the potential to just do all kinds of different work figuring out when AI models will actually be economically and practically useful for all of these different tasks, it’s surprising that people aren’t doing this already — because you don’t have to be inside DeepMind to run these tournaments or to aggregate these judgements. I guess this is maybe an invitation for listeners to get involved with that.
Allan Dafoe: I think so, yeah. I think it’d be great for the field to invest more in this.
Rob Wiblin: Yeah. The thing that’s whack about forecasting abilities is that, with these scaling laws, we’re so good at anticipating when we’ll have a particular level of loss — this measure of inaccuracy in the model — but that hasn’t translated into us being able to predict when people will actually want to use this model for a given task, because I guess the relationship between the loss and the actual usefulness is super nonlinear.
It’s probably got to be some sort of S-curve — where, classically with the self-driving cars, a 99% accurate self-driving car is just a heap of junk basically from a practical point of view, and you maybe have to get to 99.999% — and you don’t know how many nines you need before actually they’re going to hit the roads.
Allan Dafoe: Exactly. Even when you have this sort of complex, real-world-resembling task, there’s still a large step of inference from that task to actual real-world impact — because in the actual world there’s some cost of employing the AI to do that task. And the question is: Does it fit in with the existing ecosystem and environment? Sometimes even if the AI can do a lot of tasks, it’s still not cost-effective to integrate it in and so forth.
DeepMind’s strategy for ensuring its frontier models don’t cause harm [02:11:25]
Rob Wiblin: Let’s push on and talk about the Frontier Safety Framework, which I guess is Google DeepMind or Alphabet’s equivalent of the Responsible Scaling Policy that Anthropic has, or the Preparedness Framework that OpenAI has. These are all kind of a similar general approach to safety.
The Frontier Safety Framework at this point is just six pages, and it’s kind of a promise to write a full suite of if-then commitments next year, ahead of releasing models that would really be jutting up against some of these dangerous abilities.
When I spoke with Nick Joseph from Anthropic a couple of months ago now, the thing that I pushed him on a bit was: Is it really responsible to leave the question of how dangerous these models are to the company that is developing them to decide? Obviously they have an enormous conflict of interest, or enormous challenge in evaluating faithfully whether the product is sufficiently dangerous that maybe they shouldn’t release it and start selling it.
Ideally, surely, you would hand it off to at least some sort of external party that would be able to take a more impartial perspective on it. And to be honest, I haven’t really found serious people who would say that total self-regulation is the future and the way that we should stick with.
Is Google DeepMind taking any steps to pave a path towards making some of these decisions external, and taking it out of their hands to decide whether their model has passed a particular threshold of risk, and particular possibly costly countermeasures are now called for?
Allan Dafoe: I would say Google DeepMind and Google’s perspective is that the governance of frontier AI ultimately needs to be multilayered. There will be an important role for government regulations, for third-party evaluations, and other sort of nonprofit input and other external assessment of the properties of frontier models.
I would characterise the situation as: in many ways, the companies have been pushing the conversation forward, which is an invitation to other groups — academia, nonprofits, government — to be part of that conversation. So it’s less the company or Google saying, “This is only for us to do,” but more, “This is our first best guess of how future governance of frontier systems could look like, and what the properties of it will be.”
The contributions Google makes are contributing to the conversations hosted by the AI Safety Institutes, often called AISIs. So we’ve been in extensive dialogue with UK AISI and US AISI. There’s a number of AI Safety Institute conversations happening in November that Google is contributing to.
And more broadly, I would say the goal is not, and should not be, for each company to have its own framework. That would not be beneficial for the companies or for society. Rather, we need to converge on an appropriate standard that balances various equities, safety, responsibility, innovation, appropriate levels of deployment and societal oversight.
One other aspect of how we’re contributing to that broader standard conversation is the founding of the Frontier Model Forum, which is an organisation for the companies developing frontier models to work through what should be the standards and safety for frontier models. And part of that is my team was directly involved in standing up the Frontier Safety Fund, which was another one of these kind of differential technological development bets to say, can we, on the margin, put more money towards safety standards and safety research around frontier models?
How ‘structural risks’ can force everyone into a worse world [02:15:01]
Rob Wiblin: This is slightly off topic, but something that’s a bit hard for the Frontier Safety Framework or these dangerous capability evals to pick up is something called “structural risks,” which actually is a name and an idea that you’ve been kind of a pioneer in. Can you just explain what they are and why it’s a little bit hard for them to be done at the company level rather than the government level?
Allan Dafoe: Sure. This paper was published by Remco Zwetsloot and myself in 2019, and was an attempt to add an intellectual tool to the toolbox to help people when thinking about risks. Because up until then, the primary conceptual framework was to either refer to misuse harms or accident harms or risks.
The way I would conceptualise these is that misuse is when someone — a bad actor, a criminal — uses the tool on purpose in a certain way to cause harm. There’s a subclass, which is criminal negligence — where the person doesn’t intend to do harm, but if they were more prosocially motivated or responsible, they would have averted the harm. This is like someone causing a car accident because they were texting: they weren’t intending to cause the accident, but in many ways the fault falls on them.
Whereas accident harms, you can often think the key proximal cause of the harm is some property of the technology. An engineer could have caught this risk potential and put in a warning indicator or a guardrail to avert it.
And I would say much of the conversation centres on those two perspectives. There’s either accidents or misuse.
The notion of structural risks is not an exclusive category; rather, it’s kind of an upstream category which says that often the conditions under which a technology is used are shaped by certain social structures which make accidents more likely or less likely, or make misuse more likely or less likely. And it’s important that we also look upstream at those structures to see when we’re in a regime that might be promoting certain harms.
One example I use is the Cuban Missile Crisis, where you had these legitimate leaders of the state, of the United States and of the Soviet Union, acting on behalf of their nations in a way that exposed their countries and the world to this risk of nuclear war. I would say that example was neither misuse nor a technical accident. It wasn’t that the nuclear weapons were built incorrectly or there was some flaw in their design; nor was it that these were criminals who were using the technology for a sort of illegitimate purpose. Rather, in that case, they were in a geopolitical structure that made them choose to engage in this dangerous brinksmanship with each other.
There’s a lot of other examples where the structure of the economy or of different kinds of interactions lead the system to be deployed in a way that has greater accident risk or greater misuse risk.
Rob Wiblin: I guess evals might be able to pick up that a model has an ability that could feed into one of these structural risks, but because it’s occurring at the societal or global level, it’s not really possible for GDM to be evaluating the entire societal effect. I mean, it’s difficult for anyone, but it kind of has to be government. This is their remit to think about, because they’re the ones who are best placed to address it.
Allan Dafoe: Yeah. And because structural risks are often at the societal level, they’re often emergent and kind of indirect effects of the technology — which means they’re often very hard to read off of the technology itself.
Another example that I find compelling is the risks from the railroad. It’s hard to read off these steel rails in a train the impact it had on geopolitics at the time. Namely, according to a number of historians, it increased first-strike advantage: that whoever mobilised first had a great advantage, it was argued — and furthermore, it made the decision to mobilise irreversible, because of the difficulty of stopping the deployment schedule. These were consequences of the railroad which are very indirect; a railroad-level eval would not tell you this implication, because it bounces off so many different aspects of society.
So as you say, I think structural risks require a societal-level solution, typically from government. And if the structural risk spans nations, then at the international level.
Is AI being built democratically? Should it? [02:19:35]
Rob Wiblin: Coming back to Google’s approach to safety, when I read the Pause AI folks and people who are sympathetic to that, I think a key underlying grievance, a kind of fury that I hear coming across, is a sense that what is happening with research into frontier models and AGI is kind of democratically illegitimate — because it’s exposing all of us in the UK, and US, and around the world potentially to a disaster that could kill us, but we haven’t really been consulted on a level that would be commensurate with the risk that they believe is being run.
And even if you did discuss it in the UK and US, where much of the work is happening, people who are overseas in other countries, who are also exposed to the risk of catastrophe here, are barely being consulted, and their voice doesn’t really matter. So they would at least say we should have a more affirmative signoff by Congress or Parliament that the benefit of this work exceeds the risks and we should be going ahead. But ideally, probably we should be talking to a whole lot of other people elsewhere who are also going to be heavily affected by this.
What do you make of that frustration that people feel?
Allan Dafoe: I think a lot of people in surveys are concerned about the direction technology is going, and specifically about AI. And that’s an important perspective to take seriously and to address.
To some extent it’s a matter of education, and to another extent it’s a matter of improving governance. For many people, their concern is about labour displacement or various near-term impacts. And I think an important response by companies is to make sure that the benefits are being realised.
This is part of why Google and Google DeepMind invest so much in AI for science, maybe most notably in AlphaFold, which is the first AI contribution to win a Nobel Prize for its impact. And then there’s this other Nobel Prize in physics for advances in AI.
So AlphaFold, for the listeners who don’t know, is this narrow AI system that enabled very cost-effective prediction of the protein structure of virtually every protein relevant to science and medicine, and has, it seems, unlocked a lot of potential advances in medicine.
So I think in the future we’ll see all the potential benefits that come from this kind of development. But generally, I think advocates for AI often point to AI for science. This has been expressed recently by Dario Amodei, the CEO of Anthropic: that the benefits of AI for science are potentially very large and should be encouraged.
In terms of the more broad question of the right way to democratically develop technology, I think this is a very challenging question. In my earlier scholarship, I think from the ’60s or ’70s there were ideas of having citizen councils or citizen juries for technology: you would have a representative group of citizens who would be paid for taking the time to learn about a technology, and then reflect on how it should be developed and how it should be deployed in their community.
More recently, there’s a number of work streams that carry on in this tradition, that look at having different kinds of pluralistic assemblies and try to host this democratic conversation — which requires, again, providing the space and the resources to educate the community, and then this institutional framework to allow people to express their views — and then allow that to not just be a lot of noise, but aggregate up into guidance to policymakers.
In general I would say that different countries are approaching the democratic governance of technology in different ways. We have freedom of the press and assembly. We have social media and conversations there. We have nonprofits doing research and staking their position. We have these government agencies, the AI Safety Institutes, and many other regulatory agencies that are looking at technology.
So I wouldn’t say it’s a case where companies are unilaterally proceeding, without any kind of democratic involvement. Though it is important that people involved in Pause AI and other groups find ways to engage in the democratic process most productively.
So a question that needs to be answered is: What are the appropriate mitigations for these very powerful models? I don’t think that question can be fully answered by companies, or even a narrow governmental agency. We really need the scientific community: we need people involved in building medicines and who can see that side of the ledger, who are building cyber defence, and also those who are focused on biorisks, cyber risks, and so forth, all in a conversation together — hopefully reaching a scientific and policy consensus on what is the right way to proceed with these models as they gain new capabilities.
How much do AI companies really want external regulation? [02:24:34]
Rob Wiblin: I think something that has turned the worries about this democratic legitimacy issue up to a higher fever pitch is that almost all of the AI companies, maybe Meta aside, have talked a big talk about how they would like to have binding legal constraints on what they can do, and heavy involvement in government and oversight and so on — but they are yet to lobby actually in favour of any particular legislation that truly would bind them and be very costly, and restrict what kinds of models that they could train. And I guess you know that they’ve turned down perhaps opportunities to argue in favour of legislation that has been proposed.
And the worry, understandably, is that companies will always say they’re in favour of this — but the day when they’ll actually support any specific constraint, when they’re actually willing to put the handcuffs on themselves, may simply never come.
I’m not sure whether you want to comment on that, but I do think that at the point that you actually saw companies saying, “Yes, this is the specific regulation that we want to bind ourselves, and we want it to apply to everyone,” it would reassure a lot of people who are currently unsure how much they can trust these quite powerful entities.
Allan Dafoe: Yeah, I think it’s an important question to ask. I wouldn’t put it as starkly as you did, in that, speaking for Google, Google has engaged proactively in a number of conversations that do affect their freedom of action, most notably the White House commitments.
And then there’s a number of other commitment-like moments. One was around the UK Safety Summit, the first safety summit, and then there was a second round of commitments at the South Korea summit. And in both of these cases they were pointing to Frontier Safety Framework-like developments, where Google committed to doing forms of red-teaming, self-evaluation, and then working on mitigations for the risks as they emerge.
Speaking for Google, I was very impressed with the extent to which the internal Google operations took the White House commitments seriously. After they were signed, there was a real effort internally to identify all of the commitments that were made, spell out what they meant, and then make sure there were work streams, making sure we lived up to them to a very significant extent.
Similarly, the Frontier Safety Framework I don’t think is cheap talk. It really does represent an expression of Google and Google DeepMind’s sense of the nature of the risks. It offers an attempt to develop this safety standard for future risk assessment. So it is a form of proto-regulation or safety standard that later could be significantly constraining the industry.
One other aspect worth noting is there is the EU AI Act, which is currently negotiating the code of practice, which is a quite meaningful form of regulation, and Google is participating in helping the EU AI Act think through what would be the right way to have that code of practice operate.
Rob Wiblin: A worry that I used to have, and maybe have a little bit less, is that improvements in algorithmic efficiency and in dissemination or proliferation of serious compute might make a lot of these rules in the Frontier Safety Framework not as effectual as it might otherwise be.
Because even if a big organisation like GDM has these quite stringent controls, and all of these evals and precautionary measures and so on, if another organisation can train a model that’s as powerful as the frontier one for $10 million or $1 million or even $100 million, that means that there might be just a wide proliferation of these kinds of dangerous capabilities. And if any one actor is not willing to follow the rules and is going to defect from them, then that could still be quite disastrous. How worried are you about that failure mode?
Allan Dafoe: I think the proliferation of frontier model capabilities is a key parameter of how things will go and of the viability of different governance approaches and solutions. Which is why I think I would channel this question back to the question of what’s the appropriate approach to open weighting models? Because that is probably the most significant source of proliferation of frontier model capabilities.
Leaving aside the open-weight question, there’s this secondary point that you point to, which is this exponentially decreasing costs of training models. Epoch AI, which is a group that does very good estimates of trends in hardware and algorithmic efficiency and related phenomena, finds that algorithmic efficiency is increasing three times per year.
So the cost of training a model decreases by roughly 10x every two years, and if that trend persists, then a model that costs $100 million to train today will cost $10 million in two years, and then $1 million two years after that, and so forth. So you can see how that leads to many more actors having a model of a given capability.
Some analysts of the situation perceive this to be very concerning, because it means whatever novel capabilities emerge will quickly diffuse at this rate, and be employed by bad actors or irresponsible actors.
I think it’s complicated. Certainly, in a sense, control of technology is easier when it doesn’t have this exponentially decreasing cost function. And there’s other sources of diffusion in the economy. However, two years is a long time: a two-year-old model today is a significantly inferior model to the best models. So we may be in a world where the best models are developed and controlled responsibly, and then can be used for defensive purposes against irresponsible uses of inferior models. I think that’s the win condition.
Rob Wiblin: Yeah, I guess that it makes sense. I suppose it’s a picture that makes one nervous, because it means that you always have to have this kind of hegemon of the most recent model, ensuring that what is now an obsolete but still quite powerful model from two years ago isn’t causing any havoc. But I suppose it probably is almost the only way that we can make things work. I can’t think of an alternative, to be honest.
Allan Dafoe: Yeah. There’s other ways to advantage defenders. You can advantage them by resources. This was Mark Zuckerberg’s argument in favour of open-weight models: he would argue that even if the bad actors have access to the best models, there are more good actors, or the good actors have more resources than the bad actors, so they can outspend the bad actors. So for every bad-actor dollar there’s 100 good-actor dollars, which will build defences against misuse.
That could be the case. It depends on this concept of the offence-defence balance: how costly is it for an attacker to do damage relative to the cost the defender must spend to prevent or repair damage? If you have this 1:1 ratio, then sure, as long as the good guys have more resources than the bad guys, they can protect themselves or repair the damage in a sort of cost-effective way.
But it’s not always the case that the ratio is such. Often I think with bioweapons that the ratio is very skewed. A single bioweapon is quite hard to protect against: it requires rolling out a vaccine, and a lot of damage will occur before the vaccine is widely deployed. In general, I think this field of offence-defence balance is also something worth studying.
And I would say the field of AI has too often drawn an analogy from computer security, where vulnerabilities are easily patched. So the correct response in computer security is typically to encourage vulnerability discovery, because then you can roll out a patch, and then the new operating system or software is resilient to the previously discovered vulnerability.
Not all systems have that property. Again, biological systems are hard to patch. Social systems also are hard to patch. Deepfakes do have a patch: we can develop this immune system where we know that just because someone does a video call with you doesn’t mean that’s sufficient authentication of the identity of the person for the instruction you’re getting to transfer some millions of dollars or so forth.
Humans have a lot of inertia in their systems, and it’s quite costly to build the new infrastructure and then train people to use it correctly.
Rob Wiblin: Yeah, you have a couple of really nice papers on this offence-defence balance question, which unfortunately we couldn’t fit inside the margins of the page this time around, so we might have to wait for a third interview with you.
Social science can contribute a lot here [02:33:21]
Rob Wiblin: We’re almost out of time, but just to finish up, I want to talk a bit about advice for people in the audience. As I mentioned, back in 2018, you were telling people they should jump into this AI safety and governance and policy work, which I think was excellent advice at the time, and might still be excellent advice now.
Back then you were saying there were technical people perhaps in this area, but social scientists less so. There were some real gaps of different disciplines and areas of knowledge that just weren’t really going into AI governance, I suppose because they weren’t hearing about it or it didn’t seem as relevant.
Is social science still a big gap? And are there other big gaps where you’d really benefit from having people have a particular sort of training or interest coming into the field?
Allan Dafoe: Yeah, it remains to be a huge gap. And it’s hard for me to name a single area. I think we need more political scientists, economists, historians, ethicists, philosophers, sociologists, ethnographers. Really, because AI will and is having such profound impacts across society, we need experts in all aspects of society to make sense of it, and to help guide Google and help guide government in terms of what’s the right way to proceed with these technologies.
Rob Wiblin: Yeah. I suppose you mentioned you’re hiring right now — maybe not by the time that this episode comes out — but what are the sorts of roles that GDM maybe has difficulty hiring for, even despite its prominence, where it would be great if you could get more people into the field?
Allan Dafoe: I think talent is in high demand everywhere. So within my side, we’re looking for experts in international politics; in domestic politics; in governance of technology, domestic and international; in forecasting technology; in these sort of macro trends in predicting the impacts of technology.
There’s a lot of work on agents that needs to take place. So thinking about autonomy risks and safety, thinking about assistants and how those can be deployed. There’s such a rich space of interactions with humans, and challenges and opportunities there to get right.
The Ethics team at Google DeepMind is hiring. There’s a lot of more ethics-heavy questions and work to be done from an ethics lens. Responsibility and policy work.
And this is just on the non-technical side. Then within technical safety there’s a large portfolio of work that needs to be done.
Rob Wiblin: Yeah. All of the above.
How AI could make life way better: self-driving cars, medicine, education, and sustainability [02:35:55]
Rob Wiblin: Final question, or section, if you have time. We talk about AI a lot on this show. Unfortunately we do have a slight tendency towards doom and gloom, and always talking about the risks and the downsides and so on. I suppose we talk about that because we worry that it’s neglected in the broader conversation, or maybe it’s the only thing that’s stopping us from achieving these enormous possible gains that we can get from AI and AGI.
I guess in the interest of having at least a modicum of balance, what are some of the applications of AI that you’re excited about? Some applications that you would really like to see sooner rather than later?
Allan Dafoe: It’s a great question, and it’s important to keep at the top of our minds, so hopefully readers and listeners will get this message. This is why we would want to build advanced AI and AGI: that the benefits could really be profound, and will be profound if built safely.
To motivate it, personally, I found the experience of riding in a Waymo self-driving car to be really impactful. I’ve noticed a trend of tourists to San Francisco to report back on this historical moment when people are able to ride in a car that’s driving itself. You know, going back in time, this is a new phenomenon, something new under the sun. It’s quite incredible that finally the safety has reached the number of nines required to do that safely.
Waymo recently put out a safety report documenting how many fewer crashes that have injuries or just crashes that involved police coming to the scene. And there’s two times fewer incidents that involved police coming to the crash scene, and then I believe it was six times fewer crashes involving an injury. So this is one quantified benefit: if we can have fewer crippling car injuries, that would be profound.
I’m imagining a world where we don’t need to expend so many resources building cars that mostly sit in our parking lots, and we don’t need to dedicate urban space to cars just waiting for us to end the workday. Instead we can reclaim that urban space for other, more productive purposes.
So that’s just one technology. And of course there are challenges that we need to think through there as well.
I think medicine and health is another domain that could really be profoundly beneficial. I personally look forward to the day when I can consult a doctor on my phone, give it my symptoms, and it can assure me, “No, these are reasonable or incoherent symptoms. You don’t need to worry about it.” And it can also advise me when it’s time to see a doctor, thereby saving the medical establishment the time-costly activity of seeing patients, and allowing us to use the human doctor’s time more effectively.
Rob Wiblin: Yeah. I already use LLMs quite a lot for health advice, but I guess is it DeepMind that developed Med-PaLM 2, which I think is kind of the state of the art of this sort of medical consultancy?
Allan Dafoe: That was from Google, and MedLM is the broad name for its work on language models for medical purposes. And in general Google has a quite extensive health portfolio.
Rob Wiblin: That one’s not publicly available yet, right? But I guess I imagine Google’s working on it?
Allan Dafoe: It’s my understanding that MedLM is available to certain users.
Rob Wiblin: Maybe I might get to give it a whirl at some point.
I’ve said this on the show before, but something that I think about the most perhaps is: I’ve got a kid almost one year old. I suppose they’ll be in kindergarten and then at school in a couple years’ time. And actually thinking about what it’s like in a typical school that might be in your district, it’s not necessarily always going to be the very best education, and there’s lots of things that are kind of unpleasant about school sometimes.
So I’m kind of hoping that by the time they would be attending school, there’ll at least be an option for them to get tutoring and assistance from AI models that could be as good as the best teacher — in principle at least, if they were designed really well. I suppose there’s issues with classroom management, but you could also imagine they might be able to be much more engaging and entertaining and interesting than a real-life first grade teacher might be.
I suppose you have three kids as well, and I imagine that the oldest might be approaching school age pretty soon, if not already. Do you think they’re going to be actually taught by an LLM teacher at any point soon?
Allan Dafoe: I think AI in education is a huge opportunity, for the reasons you mentioned. Another data point here is tutoring is hugely impactful for student success, or small class sizes is another way of putting it. So having an AI tutor that can complement the school system could be very beneficial. It would flag mistakes that the student makes and then can provisionally point out where they went wrong or show different ways of proceeding. I think it could be very beneficial.
And like you say, just having that option, having that available, and perhaps in a more engaging way or in a way that’s more compelling to students can be beneficial. I think we’ve seen this with the internet in general, that the internet has provided resources for learning, and then platforms like Khan Academy and a lot of pedagogical videos have been very beneficial for student learning.
Rob Wiblin: Yeah, I think tutoring is slightly famously especially useful for exceptionally strong students, who can potentially rush ahead if they have a tutor there available to help them. And also for students who are really struggling — and if they’re stuck in a class where they’re significantly behind, they’re just not going to be learning almost anything, because they just don’t have enough understanding to even track what’s going on. So you could see it benefiting different groups in really quite enormous ways.
Allan Dafoe: There’s one other area I think I would want to call out, which is AI for sustainability. Google DeepMind has all these interesting projects, and they read like science fiction.
One is on controlling the plasma for nuclear fusion: if you can do it more effectively, you can make fusion a more viable source of energy, which could be extremely beneficial for sustainable energy production.
Google DeepMind had these advances in weather prediction — making weather prediction cheaper, faster, and more effective — which can be beneficial for event planning, but also wind power and solar power and other sources.
Advances in material sciences: finding new materials which could be used for solar power or other aspects of the economy that could make things more efficient.
Google had this result on optimising flight paths to reduce contrail production, which is apparently a large source of greenhouse gas dynamics.
And there’s data centre optimisation. I think Google DeepMind had a result several years ago reporting quite significant gains in how data centre cooling was done in terms of energy use. So you have all these efficiency gains in energy use production that can really add up to help humanity address these significant challenges.
Rob Wiblin: And we haven’t even mentioned AlphaFold 2. I haven’t followed this in detail, but from what I’ve heard, it’s incredibly useful for medical research and for drug design and so on.
We’ve just got to do all of the stuff discussed in the last few hours, and then we can enjoy this wonderful, brave new world with all these fantastic new products and much better quality of life. Fingers crossed.
Allan Dafoe: Yes.
Rob Wiblin: My guest today has been Allan Dafoe. I look forward to talking to you again next time on the show.
Allan Dafoe: Thanks, Rob. Look forward to it.