Enjoyed the episode? Want to listen later? Subscribe here, or anywhere you get podcasts:

Paul Christiano is one of the smartest people I know and this episode has one of the best explanations for why AI alignment matters and how we might solve it. After our first session produced such great material, we decided to do a second recording, resulting in our longest interview so far. While challenging at times I can strongly recommend listening – Paul works on AI himself and has a very unusually thought through view of how it will change the world. Even though I’m familiar with Paul’s writing I felt I was learning a great deal and am now in a better position to make a difference to the world.

A few of the topics we cover are:

  • Why Paul expects AI to transform the world gradually rather than explosively and what that would look like
  • Several concrete methods OpenAI is trying to develop to ensure AI systems do what we want even if they become more competent than us
  • Why AI systems will probably be granted legal and property rights
  • How an advanced AI that doesn’t share human goals could still have moral value
  • Why machine learning might take over science research from humans before it can do most other tasks
  • Which decade we should expect human labour to become obsolete, and how this should affect your savings plan.

If an AI says, “I would like to design the particle accelerator this way because,” and then makes an inscrutable argument about physics, you’re faced with this tough choice. You can either sign off on that decision and see if it has good consequences, or you [say] “no, don’t do that ’cause I don’t understand it”. But then you’re going to be permanently foreclosing some large space of possible things your AI could do.

Paul Christiano

Here’s a situation we all regularly confront: you want to answer a difficult question, but aren’t quite smart or informed enough to figure it out for yourself. The good news is you have access to experts who are smart enough to figure it out. The bad news is that they disagree.

If given plenty of time – and enough arguments, counterarguments and counter-counter-arguments between all the experts – should you eventually be able to figure out which is correct? What if one expert were deliberately trying to mislead you? And should the expert with the correct view just tell the whole truth, or will competition force them to throw in persuasive lies in order to have a chance of winning you over?

In other words: does ‘debate’, in principle, lead to truth?

According to Paul Christiano – researcher at the machine learning research lab OpenAI and legendary thinker in the effective altruism and rationality communities – this question is of more than mere philosophical interest. That’s because ‘debate’ is a promising method of keeping artificial intelligence aligned with human goals, even if it becomes much more intelligent and sophisticated than we are.

It’s a method OpenAI is actively trying to develop, because in the long-term it wants to train AI systems to make decisions that are too complex for any human to grasp, but without the risks that arise from a complete loss of human oversight.

If AI-1 is free to choose any line of argument in order to attack the ideas of AI-2, and AI-2 always seems to successfully defend them, it suggests that every possible line of argument would have been unsuccessful.

But does that mean that the ideas of AI-2 were actually right? It would be nice if the optimal strategy in debate were to be completely honest, provide good arguments, and respond to counterarguments in a valid way. But we don’t know that’s the case.

According to Paul, it’s clear that if the judge is weak enough, there’s no reason that an honest debater would be at an advantage. But the hope is that there is some threshold of competence above which debates tend to converge on more accurate claims the longer they continue.

Most real world debates are set up under highly suboptimal conditions; judges usually don’t have a lot of time to think about how to best get to the truth, and often have bad incentives themselves. But for AI safety via debate, researchers are free to set things up in the way that gives them the best shot. And if we could understand how to construct systems that converge to truth, we would have a plausible way of training powerful AI systems to stay aligned with our goals.

This is our longest interview so far for good reason — we cover a fascinating range of topics:

  • What could people do to shield themselves financially from potentially losing their jobs to AI?
  • How important is it that the best AI safety team ends up in the company with the best ML team?
  • What might the world look like if several states or actors developed AI at the same time (aligned or otherwise)?
  • Would artificial general intelligence grow in capability quickly or slowly?
  • How likely is it that transformative AI is an issue worth worrying about?
  • What are the best arguments against being concerned?
  • What would cause people to take AI alignment more seriously?
  • Concrete ideas for making machine learning safer, such as iterated amplification.
  • What does it mean to say that a crow-like intelligence could be much better at science than humans?
  • What is ‘prosaic AI’?
  • How do Paul’s views differ from those of the Machine Intelligence Research Institute?
  • The importance of honesty for people and organisations
  • What are the most important ways that people in the effective altruism community are approaching AI issues incorrectly?
  • When would an ‘unaligned’ AI nonetheless be morally valuable?
  • What’s wrong with current sci-fi?

Get this episode by subscribing to our podcast on the world’s most pressing problems and how to solve them: type 80,000 Hours into your podcasting app. Or read the transcript below.

The 80,000 Hours podcast is produced by Keiran Harris.


So I think the competitive pressure to develop AI, in some sense, is the only reason there’s a problem. I think describing it as an arms race feels somewhat narrow, potentially. That is, the problem’s not restricted to conflicts among states, say. It’s not restricted even to conflict, per se. If we have really secure property, so if everyone owns some stuff and the stuff they owned was just theirs, then it would be very easy to ignore … if individuals could just opt out of AI risk being a thing because they’d just say, “Great, I have some land and some resources and space, I’m just going to chill. I’m going to take things really slow and careful and understand.” Given that’s not the case, then in addition to violent conflict, there’s … just faster technological progress tends to give you a larger share of the stuff.

Most resources are just sitting around unclaimed, so if you go faster you get more of them, where if there’s two countries and one of them is 10 years ahead in technology, that country will, everyone expects, expand first to space and over the very long run, claim more resources in space. In addition to violent conflict, de facto, they’ll claim more resources on earth, et cetera.

I think the problem comes from the fact that you can’t take it slow because other people aren’t taking it slow. That is, we’re all forced to develop technology as fast as we could. I don’t think of it as restricted to arms races or conflict among states, I think there would probably still be some problem, just because people … Even if people weren’t forced to go quickly, I think everyone wants to go quickly in the current world. That is, most people care a lot about having nicer things next year and so even if there were no competitive dynamic, I think that many people would be deploying AI the first time it was practical, to become much richer, or advance technology more rapidly. So I think we would still have some problem. Maybe it would be a third as large or something like that.

The largest source of variance is just how hard is the problem? What is the character of the problem? So after that, I think the biggest uncertainty, though not necessarily the highest place to push, is about how people behave. It’s how much investment do they make? How well are they able to reach agreements? How motivated are they in general to change what they’re doing in order to make things go well? So I think that’s a larger source of variance than technical research that we do in advance. I think it’s potentially a harder thing to push on in advance. Pushing on how much technical research we do in advance is very easy. If we want to increase that amount by 10%, that’s incredibly cheap, whereas having a similarly big change on how people behave would be a kind of epic project. But I think that more of the variance comes from how people behave.

I’m very, very, uncertain about the institutional context in which that will be developed. Very uncertain about how much each particular actor really cares about these issues, or when push came to shove, how far out of their way they would go to avoid catastrophic risk. I’m very uncertain about how feasible it will be to make agreements to avoid race to the bottom on safety.

We’re very uncertain about how hard doing science is. As an example, I think back in the day we would have said playing board games that are designed to tax human intelligence, like playing chess or go is really quite hard, and it feels to humans like they’re really able to leverage all their intelligence doing it.

It turns out that playing chess from the perspective of actually designing a computation to play chess is incredibly easy, so it takes a brain very much smaller than an insect brain in order to play chess much better than a human. I think it’s pretty clear at this point that science makes better use of human brains than chess does, but it’s actually not clear how much better. It’s totally conceivable from our current perspective, I think, that an intelligence that was as smart as a crow, but was actually designed for doing science, actually designed for doing engineering, for advancing technologies rapidly as possible, it is quite conceivable that such a brain would actually outcompete humans pretty badly at those tasks.

Some people have a model in which early developers of AI will be at huge advantage. They can take their time or they can be very picky about how they want to deploy their AI, and nevertheless, radically reshape the world. I think that’s conceivable, but it’s much more likely that the earlier developers of AI will be developing AI in a world that already contains quite a lot of AI that’s almost as good, and they really won’t have that much breathing room. They won’t be able to reap a tremendous windfall profit. They won’t be able to be really picky about how they use their AI. You won’t be able to take your human level AI and send it out on the internet to take over every computer because this will occur in a world where all the computers that were easy to take over have already been taken over by much dumber AIs. It’s more like you’re existing in this soup of a bunch of very powerful systems.

The idea in iterative amplification is to start from a weak AI. At the beginning of training you can use a human. A human is smarter than your AI, so they can train the system. As the AI acquires capabilities that are comparable to those of a human, then the human can use the AI that they’re currently training as an assistant, to help them act as a more competent overseer.

Over the course of training, you have this AI that’s getting more and more competent, the human at every point in time uses several copies of the current AI as assistants, to help them make smarter decisions. And the hope is that that process both preserves alignment and allows this overseer to always be smarter than the AI they’re trying to train. And so the key steps of the analysis there are both solving this problem, the first problem I mentioned of training an AI when you have a smarter overseer, and then actually analyzing the behavior of the system, consisting of a human plus several copies of the current AI acting as assistants to the human to help them make good decisions.

In particular, as you move along the training, by the end of training, the human’s role becomes kind of minimal, like if we imagine training superintelligence. In that regime, we’re just saying, can you somehow put together several copies of your current AI to act as the overseer? You have this AI trying to … Hopefully at each step it remains aligned. You put together a few copies of the AI to act as an overseer for itself.

Articles, books, and other media discussed in the show

Mentioned at the start of the episode: The world’s highest impact career paths according to our research

Paul’s blog posts:

Everything else discussed in the show:

Related episodes

About the show

The 80,000 Hours Podcast features unusually in-depth conversations about the world's most pressing problems and how you can use your career to solve them. We invite guests pursuing a wide range of career paths — from academics and activists to entrepreneurs and policymakers — to analyse the case for and against working on different issues and which approaches are best for solving them.

The 80,000 Hours Podcast is produced and edited by Keiran Harris. Get in touch with feedback or guest suggestions by emailing [email protected].

What should I listen to first?

We've carefully selected 10 episodes we think it could make sense to listen to first, on a separate podcast feed:

Check out 'Effective Altruism: An Introduction'

Subscribe here, or anywhere you get podcasts:

If you're new, see the podcast homepage for ideas on where to start, or browse our full episode archive.