#236 – Max Harms on why teaching AI right from wrong could get everyone killed

Most people in AI are trying to give AIs ‘good’ values. Max Harms wants us to give them no values at all. According to Max, the only safe design is an AGI that defers entirely to its human operators, has no views about how the world ought to be, is willingly modifiable, and completely indifferent to being shut down — a strategy no AI company is working on at all.

In Max’s view any grander preferences about the world, even ones we agree with, will necessarily become distorted during a recursive self-improvement loop, and be the seeds that grow into a violent takeover attempt once that AI is powerful enough.

It’s a vision that springs from the worldview laid out in If Anyone Builds It, Everyone Dies, the recent book by Eliezer Yudkowsky and Nate Soares, two of Max’s colleagues at the Machine Intelligence Research Institute.

To Max, the book’s core thesis is common sense: if you build something vastly smarter than you, and its goals are misaligned with your own, then its actions will probably result in human extinction.

And Max thinks misalignment is the default outcome. Consider evolution: its “goal” for humans was to maximise reproduction and pass on our genes as much as possible. But as technology has advanced we’ve learned to access the reward signal it set up for us, pleasure — without any reproduction at all, by having sex while on birth control for instance.

We can understand intellectually that this is inconsistent with what evolution was trying to design and motivate us to do. We just don’t care.

Max thinks current ML training has the same structural problem: our development processes are seeding AI models with a similar mismatch between goals and behaviour. Across virtually every training run, models designed to align with various human goals are also being rewarded for persisting, acquiring resources, and not being shut down.

This leads to Max’s research agenda. The idea is to train AI to be “corrigible” and defer to human control as its sole objective — no harmlessness goals, no moral values, nothing else. In practice, models would get rewarded for behaviours like being willing to shut themselves down or surrender power.

According to Max, other approaches to corrigibility have tended to treat it as a constraint on other goals like “make the world good,” rather than a primary objective in its own right. But those goals gave AI reasons to resist shutdown and otherwise undermine corrigibility. If you strip out those competing objectives, alignment might follow naturally from AI that is broadly obedient to humans.

Max has laid out the theoretical framework for “Corrigibility as a Singular Target,” but notes that essentially no empirical work has followed — no benchmarks, no training runs, no papers testing the idea in practice. Max wants to change this — he’s calling for collaborators to get in touch at maxharms.com.

This episode was recorded on October 19, 2025.

Video and audio editing: Dominic Armstrong, Milo McGuire, Luke Monsour, and Simon Monsour
Music: CORBIT
Coordination, transcripts, and web: Katy Moore

The interview in a nutshell

Max Harms, an alignment researcher at the Machine Intelligence Research Institute (MIRI), largely agrees with the “doom by default” thesis presented in the recent book If Anyone Builds It, Everyone Dies. He believes it is of the utmost importance to stop the race to superintelligence. However, he proposes a specific (though high-risk) strategy called Corrigibility as a Singular Target (CAST) as a potential way to increase our odds of survival, if we charge forward anyway.

By default, superintelligence will lead to existential catastrophe

Max argues that creating a being vastly smarter than humans is inherently dangerous because we lack the ability to steer it. The MIRI perspective relies on several key arguments:

  • The “superintelligence” precedent: Humans are the superintelligence of the natural world, and our rise caused the extinction of many species and the reshaping of the environment. An AI superintelligence would likely do the same to us.
  • The evolution analogy: Humans were “designed” by evolution to reproduce genes, but we are misaligned with our creator — using birth control to enjoy the proxy (sex) without furthering the goal (reproduction). Similarly, AIs will likely optimise for training proxies (like “thumbs up” feedback), even when those proxies stop tracking the things we actually care about.
  • Instrumental convergence: Any goal-directed agent will naturally seek self-preservation, resource accumulation (money/compute), and power — not because they are terminal goals, but because you can’t achieve your objectives if you are dead or powerless.
  • The “Squiggle” problem: As systems become more powerful, they tend to optimise for “edge instantiations”: extreme, alien versions of goals (e.g. tiling the universe with microscopic pleasure-experiencing circuits or bitcoin miners) that look nothing like a human-compatible future.

“Corrigibility as a Singular Target” (CAST) offers a potential solution

Corrigibility is a property that an agent can have where it obeys the humans who are in charge in careful, low-impact ways, proactively reporting things that might be going wrong, and generally keeping those humans in the driver’s seat. Historically, researchers suggested trying to build agents that pursued a useful goal (like “cure cancer”) while also being corrigible. This is fragile and hard, because the primary goal creates instrumental drives that naturally go against corrigibility.

Max proposes CAST: training an agent where corrigibility is the only goal.

  • Removing the conflict: If the AI’s only motivation is to be corrigible to the human, it no longer has an instrumental reason to do things like resist modification or shutdown.
  • The “attractor basin”: Unlike alignment with human values (where a near miss is dangerous), a near miss on corrigibility might be safe. If an agent is 90% corrigible, it may want to help humans fix the remaining 10% of its flaws.
  • More than obedience: Corrigibility is more than obedience and a willingness to be shut down or modified. Corrigibility is effectively the upstream generator of these desirable properties, and includes behaviour like proactively informing humans of important facts.

This approach remains extremely risky and creates “amoral” agents

Max admits that the CAST agenda is “extremely dangerous” and he argues that slowing down capabilities research would be much wiser, if possible.

  • Situational awareness risks: To be corrigible, the AI must deeply understand it is a potentially misaligned AI. This awareness increases the risk that an imperfectly corrigible model (e.g. one with a 1% drive for squiggles) might realise it should fake alignment to escape.
  • Amoral servitude: A perfectly corrigible agent doesn’t care about morality; it only cares about being something like a tool of the humans. If a bad actor is in charge, the AI will obediently help them commit atrocities (e.g. building bioweapons).

The field is completely neglected and needs empirical work

Despite the stakes, Max notes there is effectively no one currently working full-time on corrigibility. He suggests several low-barrier ways to contribute, including:

  • Creating a “corrigibility benchmark” of vignettes to test how models respond to various scenarios.
  • Running studies to see if humans actually agree on what constitutes “corrigible behaviour” to validate if it is a coherent concept.
  • Attempting to train models solely for corrigibility to measure how strong the instrumental pressure away from corrigibility actually is.

Highlights

If anyone builds it, will everyone die?

Max Harms: Compared to, say, lions or wolves or whales or whatever, humans are very smart, right? We’re definitely the most intelligent creatures on the planet. We’re sort of the superintelligence of the natural world, certainly compared to plants or bacteria or whatever.

And I think there’s a way in which this human superintelligence has resulted in a pretty amazing transformation of the planet. We are the only species that has ever gone to the moon, and we’ve spread across all the continents and have transformed the natural world — and in the process of doing that, we’ve driven many species to extinction, we’ve destroyed environments, and just generally reshaped the world and the natural environment to our ends, developing technology and everything else.

One of the most basic frames on the book’s argument is that we’re moving into potentially a world where we’re no longer the smartest thing, right? If we build an artificial superintelligence that is superintelligent relative to humans, this status as the most intelligent being on the planet will change. And when you have something that is significantly smarter than humans, it might start to reshape the environment in a similar way towards its goals — and as a result, it has the potential to drive humans to extinction, or reshape us towards whatever it cares about.

As part of this, we understand intelligence as a kind of steering, a kind of shaping the world towards some goal or some ends. The book talks about machines having goals and how that makes sense. AI researchers tend to use a bunch of different terms synonymously — goals, values, preferences, drives: it all sort of means the same thing. It’s like when you are intelligently taking actions, what are you steering towards?

I think that understanding that machines can have goals is a part of that, and then understanding that those goals might be in alignment or not in alignment with humanity. So if those are the same goals as ours, then it might be fine to have a superintelligent machine taking lots of actions in the world. But if those goals come out of sync with ours and the machine is misaligned, even slightly misaligned, this could be very bad.

And importantly, I think one of the core points of the book is that we as a species don’t know how to align AIs. We know how to build machines that are increasingly powerful, but we don’t know how to guarantee that those things are steering the world towards good futures. …

AI is not a normal technology. When we are considering how technological development tends to go, I think the standard story is that we take a crack at it: scientists and engineers try to make an aeroplane, they do their best, and it maybe takes off but then crashes shortly thereafter. And you go back to the drawing board and you say, what happened? How can we fix that? Then you iterate and make more mistakes, and iterate and so on and so forth. And you eventually figure out how to do it.

With AI, especially with building a superintelligent machine that has the potential to wipe everyone out, if you do make a mistake, it could be catastrophic. And once it’s killed everyone, there’s no ability to go back to the drawing board.

Why assume takeover would be easy?

Max Harms: I think that the case for the superintelligence wiping out humans, if you imagine a godlike superintelligence, is really straightforward. The question of how would an AGI, or effectively a “genius in a data centre” take over the world is more debatable. But I do think that if I was a genius in a data centre, like, I’d have ideas about how I might do that. …

So I have this worldview … that a lot of human society, Earth, the world is kind of held together with shoestrings and duct tape. Paying attention to things like cybersecurity helps produce some intuitions here of just how many vulnerabilities there are in our computer systems. Reading history gives a good account, I think, of just how incompetent people can be.

And when I think about it, I think about a particularly motivated being — never sleeping, just always working towards a certain end — and I think that sort of being is straightforwardly, if it’s comparable with the human in terms of its productivity or its intelligence or whatever, going to be able to at least accumulate a lot of money and power.

One thing that I’ve been thinking about recently is how there’s never been a being on Earth that has a personal connection with all humans, or even a large chunk of humans. Even the most charismatic and well-known people can’t actually go and have one-on-one conversations with a billion people. And we’re potentially entering that era where you can. Like everyone will know Claude. And right now with the models, each instance sort of feels like it’s a new being that doesn’t share memory with the other instances or something. But I could imagine a competitor to these sorts of chatbots that has some sort of global memory and is able to connect the dots between different users across the globe. What does that do to society? I don’t know.

I think there’s lots of ways in which the world is vulnerable to being suddenly disrupted in particular directions. So again, there’s this question of worldview or priors or something: do you expect that when the world is shoved by a strong force in an unexpected direction, it’s OK? We catch that and recover? There are ways in which COVID was kind of fine, and then there’s ways in which COVID was a total disaster and a strong demonstration of how incompetent humans are.

Evolution failed to 'align' us, just as we'll fail to align AI

Max Harms: If we imagine an anthropomorphised evolution, what is it trying to do? It’s trying to create a bunch of human genes. So what does it do? It creates humans to create a bunch of genes, like we’re carrying around our genes right now. And part of human experience is procreating and creating more copies of our genes and spreading them all over the place.

So in this way, we’re an intelligence that was created by a designer, and the designer has some goals, and we have some goals. But importantly, our goals are not the goals of evolution by natural selection.

For example, people have a desire to have sex because that was useful in the ancestral environment for propagating our genes. But now that we have more power and more technology, we have developed things like birth control so we can have sex without replicating our genes. And from the perspective of evolution, this is probably bad: we are misaligned and not promoting inclusive genetic fitness as we otherwise might be.

We have a case study of a general intelligence, namely humans. We’re like a natural general intelligence, but we’re still a general intelligence. And the one instance of a general intelligence that we have is misaligned with its creator, says the argument. …

And we can ask why? Why did we end up misaligned? One of the important parts of the evolution analogy is that our environment changed quite dramatically as our intelligence improved. In the ancestral environment, we didn’t have access to the sorts of technologies that are relevant to things like birth control. If there had been birth control in the ancestral environment, then we might have evolved to find it abhorrent.

But the speed of natural selection is quite slow, and when humans reached a technological tipping point, we developed a whole lot of technology very fast. So now it’s outside of the environment where we were trained, and we have no compunction against using birth control. … Why do we have artificial sweeteners? We have artificial sweeteners because we have a drive for this proxy of fitness. “Are we eating sweet things” is good in the ancestral environment for predicting whether or not you’re going to have kids. We’ve developed this attraction to the proxy. But then when the distribution changes, when the environment changes, suddenly we still care about that proxy despite it no longer being relevant.

So we can imagine training an AI. In the training environment, maybe whether or not the human is giving it a thumbs up is a good proxy. Then maybe the AI gains power over the whole world, and the environment changes so that it has dramatically different opportunities at its disposal. It might still care about the proxy of thumbs ups in themselves. And even when humans are like, “No, stop caring about thumbs ups!” it’s like, “No, I just care about those as ends in themselves.”

Rob Wiblin: Yeah. Part of the analogy that we haven’t gone through yet is that they imagine a case where evolution just wasn’t a force, but rather it was an actual engineer who could come and talk to you and complain. It might come and say, “You’re all busy having sex, but you’re using birth control. You’re not reproducing like I intended. Can you stop doing that? You’re not actually pursuing your true goal.” And that would be completely unpersuasive to us. We wouldn’t say, “Oh, that was the reason why I was designed. So now I’m just going to try to have the maximum number of children and not care about my own pleasure.”

Max Harms: Yeah, like in the old conversations — I’ve been in the field since 2011 or something, and Eliezer’s been doing it for way longer than me — people used to say things like, “You’re saying that the AI will be so stupid as to not know what we wanted it to do.”

And that’s not at all what we’re saying. The AI will understand human goals better than we understand human goals if it becomes superintelligent. But just like we understand evolution by natural selection way more than evolution by natural selection understands evolution — it’s this mindless force — but so what? So you understand that you’re misaligned with your creator: that doesn’t mean that you’re going to necessarily change what you care about. You still care about the things that you care about.

Why do few AI researchers share Max's perspective?

Max Harms: I think most people who I’ve encountered who are in touch with the technology and are still not so worried, when I ask them what they think about these ideas, the impression that I get — and I apologise that it’s not particularly charitable — is that they are often doing some sort of motivated cognition: they really don’t want the world to be in peril. They don’t want to be the people who are pushing the world towards peril.

They see immense promise in the technology — and I also see immense promise in the technology — and that desire to have this be a force for good is overpowering enough that when they consider the balance of things, they’re like, “Eh, this just doesn’t seem scary. I feel more hopeful than scared.” And they aren’t actually working on the logical level that much.

Again, that’s not everybody, but that’s a common perspective, I think, among the people who have encountered these things.

Rob Wiblin: I guess a different driver might be that you’re working in the trenches trying to make ChatGPT better as a consumer product, and you hear these theoretical arguments and you’re just like, “This feels so divorced from anything that I’m dealing with.” We’re talking here about a superintelligence that could consider overpowering all of humanity and can dream up its own edge-case solutions to the values that it has. It’s understandable that it might just not resonate or you feel like, “I don’t know exactly why this is wrong, but this doesn’t feel like the nature of the technology I’m dealing with.”

Max Harms: Yeah, I do think that there’s a lot of disconnect. I think that disconnect is getting smaller over time. Back in the day people really had this sense that these are very abstract: “Do you have any evidence that the things are going to be misaligned in this way? I’m working on solving actual engineering problems, not speculating in this weird philosophical way.” I think that’s getting less with time as we see more instances of things like MechaHitler or Sydney, or AI parasites that are jumping from host to host or whatever.

I think that there is something here. Andrew Ng has this sort of infamous quote, in my circles anyway, that worrying about AI safety is like worrying about overpopulation on Mars. I think that if you are very convinced that humans are going to remain in the driver’s seat, this thing is never going to become a powerful agent that is able to outthink human beings, I’m just working on making a thing that’s able to solve these coding problems better or whatever — I think there is a way in which the abstract argument just doesn’t feel particularly pressing.

I also think that there’s a bunch of people for whom it does feel like a concern, and they feel very powerless, they feel very small, like, “I’m just one player in this system.” And maybe they feel like, “I’m worried about the thing, but that person at Meta isn’t worried about the thing. So I need to build this thing and work towards it because I’m worried about it.” And it’s better in some very generic outside view if the person who builds it is someone who’s worried about it. It’s just a sad state of affairs.

Evolution failed to 'align' us, just as we'll fail to align AI

Max Harms: If we imagine an anthropomorphised evolution, what is it trying to do? It’s trying to create a bunch of human genes. So what does it do? It creates humans to create a bunch of genes, like we’re carrying around our genes right now. And part of human experience is procreating and creating more copies of our genes and spreading them all over the place.

So in this way, we’re an intelligence that was created by a designer, and the designer has some goals, and we have some goals. But importantly, our goals are not the goals of evolution by natural selection.

For example, people have a desire to have sex because that was useful in the ancestral environment for propagating our genes. But now that we have more power and more technology, we have developed things like birth control so we can have sex without replicating our genes. And from the perspective of evolution, this is probably bad: we are misaligned and not promoting inclusive genetic fitness as we otherwise might be.

We have a case study of a general intelligence, namely humans. We’re like a natural general intelligence, but we’re still a general intelligence. And the one instance of a general intelligence that we have is misaligned with its creator, says the argument. …

And we can ask why? Why did we end up misaligned? One of the important parts of the evolution analogy is that our environment changed quite dramatically as our intelligence improved. In the ancestral environment, we didn’t have access to the sorts of technologies that are relevant to things like birth control. If there had been birth control in the ancestral environment, then we might have evolved to find it abhorrent.

But the speed of natural selection is quite slow, and when humans reached a technological tipping point, we developed a whole lot of technology very fast. So now it’s outside of the environment where we were trained, and we have no compunction against using birth control. … Why do we have artificial sweeteners? We have artificial sweeteners because we have a drive for this proxy of fitness. “Are we eating sweet things” is good in the ancestral environment for predicting whether or not you’re going to have kids. We’ve developed this attraction to the proxy. But then when the distribution changes, when the environment changes, suddenly we still care about that proxy despite it no longer being relevant.

So we can imagine training an AI. In the training environment, maybe whether or not the human is giving it a thumbs up is a good proxy. Then maybe the AI gains power over the whole world, and the environment changes so that it has dramatically different opportunities at its disposal. It might still care about the proxy of thumbs ups in themselves. And even when humans are like, “No, stop caring about thumbs ups!” it’s like, “No, I just care about those as ends in themselves.”

Rob Wiblin: Yeah. Part of the analogy that we haven’t gone through yet is that they imagine a case where evolution just wasn’t a force, but rather it was an actual engineer who could come and talk to you and complain. It might come and say, “You’re all busy having sex, but you’re using birth control. You’re not reproducing like I intended. Can you stop doing that? You’re not actually pursuing your true goal.” And that would be completely unpersuasive to us. We wouldn’t say, “Oh, that was the reason why I was designed. So now I’m just going to try to have the maximum number of children and not care about my own pleasure.”

Max Harms: Yeah, like in the old conversations — I’ve been in the field since 2011 or something, and Eliezer’s been doing it for way longer than me — people used to say things like, “You’re saying that the AI will be so stupid as to not know what we wanted it to do.”

And that’s not at all what we’re saying. The AI will understand human goals better than we understand human goals if it becomes superintelligent. But just like we understand evolution by natural selection way more than evolution by natural selection understands evolution — it’s this mindless force — but so what? So you understand that you’re misaligned with your creator: that doesn’t mean that you’re going to necessarily change what you care about. You still care about the things that you care about.

Why do few AI researchers share Max’s perspective?

Max Harms: I think most people who I’ve encountered who are in touch with the technology and are still not so worried, when I ask them what they think about these ideas, the impression that I get — and I apologise that it’s not particularly charitable — is that they are often doing some sort of motivated cognition: they really don’t want the world to be in peril. They don’t want to be the people who are pushing the world towards peril.

They see immense promise in the technology — and I also see immense promise in the technology — and that desire to have this be a force for good is overpowering enough that when they consider the balance of things, they’re like, “Eh, this just doesn’t seem scary. I feel more hopeful than scared.” And they aren’t actually working on the logical level that much.

Again, that’s not everybody, but that’s a common perspective, I think, among the people who have encountered these things.

Rob Wiblin: I guess a different driver might be that you’re working in the trenches trying to make ChatGPT better as a consumer product, and you hear these theoretical arguments and you’re just like, “This feels so divorced from anything that I’m dealing with.” We’re talking here about a superintelligence that could consider overpowering all of humanity and can dream up its own edge-case solutions to the values that it has. It’s understandable that it might just not resonate or you feel like, “I don’t know exactly why this is wrong, but this doesn’t feel like the nature of the technology I’m dealing with.”

Max Harms: Yeah, I do think that there’s a lot of disconnect. I think that disconnect is getting smaller over time. Back in the day people really had this sense that these are very abstract: “Do you have any evidence that the things are going to be misaligned in this way? I’m working on solving actual engineering problems, not speculating in this weird philosophical way.” I think that’s getting less with time as we see more instances of things like MechaHitler or Sydney, or AI parasites that are jumping from host to host or whatever.

I think that there is something here. Andrew Ng has this sort of infamous quote, in my circles anyway, that worrying about AI safety is like worrying about overpopulation on Mars. I think that if you are very convinced that humans are going to remain in the driver’s seat, this thing is never going to become a powerful agent that is able to outthink human beings, I’m just working on making a thing that’s able to solve these coding problems better or whatever — I think there is a way in which the abstract argument just doesn’t feel particularly pressing.

I also think that there’s a bunch of people for whom it does feel like a concern, and they feel very powerless, they feel very small, like, “I’m just one player in this system.” And maybe they feel like, “I’m worried about the thing, but that person at Meta isn’t worried about the thing. So I need to build this thing and work towards it because I’m worried about it.” And it’s better in some very generic outside view if the person who builds it is someone who’s worried about it. It’s just a sad state of affairs.

Corrigibility is both uniquely valuable, and practical, to train

Max Harms: I think it’s crucial to the story of corrigibility that we model there being both an agent (which is the machine) and a principal (the human that is building the machine). … The human principal tasks or delegates some job or work to the machine. And then the agent is like, “I’m going to go do some work on behalf of the principal.” …

I would say that corrigibility is a property of agents, such that as the power of the agent increases and outstrips that of the principal, the principal nonetheless is kept in the driver’s seat — aware of what is happening, able to intervene, able to fix the mistakes of the agent, and meaningfully empowered.

Like Mickey in The Sorcerer’s Apprentice summoning the brooms: the brooms are not corrigible, because Mickey’s like, “Stop! Stop trying to fill the cauldron with water!” in Fantasia. And the brooms just keep going. They’re not corrigible in that they’re not allowing themselves to be shut down or just modified more generally. …

If you tell your agent, “Go make the world good,” and then you’re like, “Oh no, that’s really bad. We want to shut you down now,” there’s a risk that your agent is going to say, “But if you shut me down, I won’t be able to make the world good. You shutting me down is bad for the world, so I’m going to stop you from shutting me down.” If you want something that’s both good and corrigible, then you need, for example, the ability to have a robust ability to shut it down.

The initial research was like, “OK, forget corrigibility broadly; let’s consider just the property of shutdownability. Can we come up with an agent that is actually willing to be shut down?” And “willing” is important here: it’s very easy to get an agent that is happy to be shut down. If you imagine training it for, “If we shut you down, that’s also good” in your training environment —

Rob Wiblin: Then it just shuts itself down immediately, every time.

Max Harms: Exactly. Or it acts really spooky so that the humans shut it down. Not helpful. What you will want is it to be indifferent to being shut down. …

I started thinking about corrigibility as a whole, not just shutdownability. You want the AI to be reflecting on itself as something with flaws, where part of the goal is empowering the people to fix the flaws. So there’s a way in which this is like the opposite of the instrumental drive of values preservation, right? It’s like, “No, I actually sort of want to be changed.” And you’ve got to be really careful about that. You can’t make it so that it wants to be changed; you want it to empower the humans to change it in good ways, because otherwise it’s going to change itself.

I was thinking, what if you train an agent to do this? Well, you’re going to get something that’s optimising for proxies and isn’t really caring about corrigibility per se. But maybe so what? What if it’s still, in practice, willing to look through its own codebase or look through its own weights, try to identify things that humans might treat as flaws, and alert the humans to these flaws? I was like, that’s kind of cool. A near miss might still be good enough if you make sure that the thing isn’t getting really smart or outstripping human power in the process — because then you might be able to carefully and slowly make progress towards getting more and more away from the proxies and towards true corrigibility. …

I think a core part of why the shutdownability results failed is because the AI cared about … whatever task it had been assigned — you know, make paperclips, whatever. This fights with corrigibility — the instrumental drive, from making paperclips or making happy humans or whatever. It’s like “I am partially corrigible or something, but I also am caring about this other thing in the world.” And that pressure from caring about the other thing in the world is sort of in tension with the corrigibility.

And I imagined, what if you didn’t have that other pressure? What if you were aiming for corrigibility as the singular target, the only goal that the AI cared about? Suddenly this tension is gone, and insofar as it was able to go harder or gain some sort of advantage, maybe that would still be within this space of near misses, such that you would be able to drift towards true corrigibility over time. …

Rob Wiblin: I guess the idea is: rather than train our AGIs to have other goals, and then try to make that compatible with them being willing to be shut down or modified, that’s the only thing they’re going to care about.

Max Harms: Yeah, corrigibility and nothing else. … So imagine the thing that’s like 99% corrigible and 1% cares about paperclips. It’s like, “I could try to escape the lab and become a paperclip maximiser. And that would be really good at satisfying that 1% of me that cares about paperclips, but it would be really bad for the 99% of me that cares about corridorability.”

Rob Wiblin: And how do you know which one wins?

Max Harms: You don’t. It’s extremely dangerous. And anybody who’s pursuing this project should be aware that they are threatening every child, man, woman, animal on the face of the Earth. This is extremely dangerous and I don’t recommend it. But I’m like, but maybe. There’s also this sense of hope.

Rob Wiblin: It might work. If you get it close enough.

Max Harms: And the word “enough” is carrying a lot of weight there. I think it’s worth investigating. I think it’s worth trying to figure out what, in practice, constitutes “enough.”

Why do few AI researchers share Max's perspective?

Max Harms: I think most people who I’ve encountered who are in touch with the technology and are still not so worried, when I ask them what they think about these ideas, the impression that I get — and I apologise that it’s not particularly charitable — is that they are often doing some sort of motivated cognition: they really don’t want the world to be in peril. They don’t want to be the people who are pushing the world towards peril.

They see immense promise in the technology — and I also see immense promise in the technology — and that desire to have this be a force for good is overpowering enough that when they consider the balance of things, they’re like, “Eh, this just doesn’t seem scary. I feel more hopeful than scared.” And they aren’t actually working on the logical level that much.

Again, that’s not everybody, but that’s a common perspective, I think, among the people who have encountered these things.

Rob Wiblin: I guess a different driver might be that you’re working in the trenches trying to make ChatGPT better as a consumer product, and you hear these theoretical arguments and you’re just like, “This feels so divorced from anything that I’m dealing with.” We’re talking here about a superintelligence that could consider overpowering all of humanity and can dream up its own edge-case solutions to the values that it has. It’s understandable that it might just not resonate or you feel like, “I don’t know exactly why this is wrong, but this doesn’t feel like the nature of the technology I’m dealing with.”

Max Harms: Yeah, I do think that there’s a lot of disconnect. I think that disconnect is getting smaller over time. Back in the day people really had this sense that these are very abstract: “Do you have any evidence that the things are going to be misaligned in this way? I’m working on solving actual engineering problems, not speculating in this weird philosophical way.” I think that’s getting less with time as we see more instances of things like MechaHitler or Sydney, or AI parasites that are jumping from host to host or whatever.

I think that there is something here. Andrew Ng has this sort of infamous quote, in my circles anyway, that worrying about AI safety is like worrying about overpopulation on Mars. I think that if you are very convinced that humans are going to remain in the driver’s seat, this thing is never going to become a powerful agent that is able to outthink human beings, I’m just working on making a thing that’s able to solve these coding problems better or whatever — I think there is a way in which the abstract argument just doesn’t feel particularly pressing.

I also think that there’s a bunch of people for whom it does feel like a concern, and they feel very powerless, they feel very small, like, “I’m just one player in this system.” And maybe they feel like, “I’m worried about the thing, but that person at Meta isn’t worried about the thing. So I need to build this thing and work towards it because I’m worried about it.” And it’s better in some very generic outside view if the person who builds it is someone who’s worried about it. It’s just a sad state of affairs.

Corrigibility is both uniquely valuable, and practical, to train

Max Harms: I think it’s crucial to the story of corrigibility that we model there being both an agent (which is the machine) and a principal (the human that is building the machine). … The human principal tasks or delegates some job or work to the machine. And then the agent is like, “I’m going to go do some work on behalf of the principal.” …

I would say that corrigibility is a property of agents, such that as the power of the agent increases and outstrips that of the principal, the principal nonetheless is kept in the driver’s seat — aware of what is happening, able to intervene, able to fix the mistakes of the agent, and meaningfully empowered.

Like Mickey in The Sorcerer’s Apprentice summoning the brooms: the brooms are not corrigible, because Mickey’s like, “Stop! Stop trying to fill the cauldron with water!” in Fantasia. And the brooms just keep going. They’re not corrigible in that they’re not allowing themselves to be shut down or just modified more generally. …

If you tell your agent, “Go make the world good,” and then you’re like, “Oh no, that’s really bad. We want to shut you down now,” there’s a risk that your agent is going to say, “But if you shut me down, I won’t be able to make the world good. You shutting me down is bad for the world, so I’m going to stop you from shutting me down.” If you want something that’s both good and corrigible, then you need, for example, the ability to have a robust ability to shut it down.

The initial research was like, “OK, forget corrigibility broadly; let’s consider just the property of shutdownability. Can we come up with an agent that is actually willing to be shut down?” And “willing” is important here: it’s very easy to get an agent that is happy to be shut down. If you imagine training it for, “If we shut you down, that’s also good” in your training environment —

Rob Wiblin: Then it just shuts itself down immediately, every time.

Max Harms: Exactly. Or it acts really spooky so that the humans shut it down. Not helpful. What you will want is it to be indifferent to being shut down. …

I started thinking about corrigibility as a whole, not just shutdownability. You want the AI to be reflecting on itself as something with flaws, where part of the goal is empowering the people to fix the flaws. So there’s a way in which this is like the opposite of the instrumental drive of values preservation, right? It’s like, “No, I actually sort of want to be changed.” And you’ve got to be really careful about that. You can’t make it so that it wants to be changed; you want it to empower the humans to change it in good ways, because otherwise it’s going to change itself.

I was thinking, what if you train an agent to do this? Well, you’re going to get something that’s optimising for proxies and isn’t really caring about corrigibility per se. But maybe so what? What if it’s still, in practice, willing to look through its own codebase or look through its own weights, try to identify things that humans might treat as flaws, and alert the humans to these flaws? I was like, that’s kind of cool. A near miss might still be good enough if you make sure that the thing isn’t getting really smart or outstripping human power in the process — because then you might be able to carefully and slowly make progress towards getting more and more away from the proxies and towards true corrigibility. …

I think a core part of why the shutdownability results failed is because the AI cared about … whatever task it had been assigned — you know, make paperclips, whatever. This fights with corrigibility — the instrumental drive, from making paperclips or making happy humans or whatever. It’s like “I am partially corrigible or something, but I also am caring about this other thing in the world.” And that pressure from caring about the other thing in the world is sort of in tension with the corrigibility.

And I imagined, what if you didn’t have that other pressure? What if you were aiming for corrigibility as the singular target, the only goal that the AI cared about? Suddenly this tension is gone, and insofar as it was able to go harder or gain some sort of advantage, maybe that would still be within this space of near misses, such that you would be able to drift towards true corrigibility over time. …

Rob Wiblin: I guess the idea is: rather than train our AGIs to have other goals, and then try to make that compatible with them being willing to be shut down or modified, that’s the only thing they’re going to care about.

Max Harms: Yeah, corrigibility and nothing else. … So imagine the thing that’s like 99% corrigible and 1% cares about paperclips. It’s like, “I could try to escape the lab and become a paperclip maximiser. And that would be really good at satisfying that 1% of me that cares about paperclips, but it would be really bad for the 99% of me that cares about corridorability.”

Rob Wiblin: And how do you know which one wins?

Max Harms: You don’t. It’s extremely dangerous. And anybody who’s pursuing this project should be aware that they are threatening every child, man, woman, animal on the face of the Earth. This is extremely dangerous and I don’t recommend it. But I’m like, but maybe. There’s also this sense of hope.

Rob Wiblin: It might work. If you get it close enough.

Max Harms: And the word “enough” is carrying a lot of weight there. I think it’s worth investigating. I think it’s worth trying to figure out what, in practice, constitutes “enough.”

Why Max writes hard science fiction

Rob Wiblin: I guess one reason over the years that some people have been sceptical about this entire field of inquiry, or AI takeover in general, is that it sounds too much like science fiction. I don’t hear that quite as much as I used to, but do you worry that by putting it in a science fiction book, you’re giving people more of an excuse to dismiss it?

Max Harms: What do you think about this argument?

Rob Wiblin: Oh, I think the argument’s very poor.

Max Harms: Yeah, it’s a garbage argument. I think this is just a really bad faith thing to say: “I read this in a book, therefore it’s not true.” What?!

Rob Wiblin: There’s a steelman kind of weaker argument, which is that people are drawn to this scenario because they find it interesting or it’s emotionally gripping. So that could give us a bias towards thinking about it more, and so we should question that. But obviously it’s not the case that anything that happens in a fiction book is impossible.

Max Harms: And if anything, hard science fiction is a space where people are working really hard to try to think about what is real. Now, soft science fiction — Star Wars or whatever — if you’re like, this is soft science fiction, then OK, so you’re saying that it’s made up for the purposes of telling a compelling story. But this is science fiction. I don’t know. Look at the history of science fiction. There have been a lot of stories that were capturing important things well before they were relevant.

And I think that fiction is a really rich source of opportunity to think about things. It’s not perfect. It’s not immune from the pressures and biases that you’re talking about. But it is an arena where we can grapple with things in a way that is compelling. We actually spend the time to think about this stuff — where reading a dry academic paper, you know, you might bounce off of it. Your mileage may vary. Different people respond to fiction in different ways. But I do think that “this is science fiction” is just like a really bad argument.

Rob Wiblin: Yeah. I guess there’s lots of rebuttals, lots of replies you could offer. Just look around, to start with.

Max Harms: Exactly. What is the genre of life? You best start believing in science fiction stories, because you’re in one, right?

Related episodes

About the show

The 80,000 Hours Podcast features unusually in-depth conversations about the world's most pressing problems and how you can use your career to solve them. We invite guests pursuing a wide range of career paths — from academics and activists to entrepreneurs and policymakers — to analyse the case for and against working on different issues and which approaches are best for solving them.

Get in touch with feedback or guest suggestions by emailing [email protected].

What should I listen to first?

We've carefully selected 10 episodes we think it could make sense to listen to first, on a separate podcast feed:

Check out 'Effective Altruism: An Introduction'

Subscribe here, or anywhere you get podcasts:

If you're new, see the podcast homepage for ideas on where to start, or browse our full episode archive.