Enjoyed the episode? Want to listen later? Subscribe here, or anywhere you get podcasts:

It’s funny, if you track a lot of the nay-saying that existed circa 2017 or 2018 around AGI, a lot of people would be like, “Well, call me when AI can do that. Call me when AI can tell me what the word ‘it’ means in such and such a sentence.” And then it’s like, “Okay, well we’re there, so, can we call you now?”

Brian Christian

Brian Christian is a bestselling author with a particular knack for accurately communicating difficult or technical ideas from both mathematics and computer science.

Listeners loved our episode about his book Algorithms to Live By — so when the team read his new book, The Alignment Problem, and found it to be an insightful and comprehensive review of the state of the research into making advanced AI useful and reliably safe, getting him back on the show was a no-brainer.

Brian has so much of substance to say this episode will likely be of interest to people who know a lot about AI as well as those who know a little, and of interest to people who are nervous about where AI is going as well as those who aren’t nervous at all.

Here’s a tease of 10 Hollywood-worthy stories from the episode:

  • The Riddle of Dopamine: The development of reinforcement learning solves a long-standing mystery of how humans are able to learn from their experience.
  • ALVINN: A student teaches a military vehicle to drive between Pittsburgh and Lake Erie, without intervention, in the early nineties, using a computer with a tenth the processing capacity of an Apple Watch.
  • Couch Potato: An agent trained to be curious is stopped in its quest to navigate a maze by a paralysing TV screen.
  • Pitts & McCulloch: A homeless teenager and his foster father figure invent the idea of the neural net.
  • Tree Senility: Agents become so good at living in trees to escape predators that they forget how to leave, starve, and die.
  • The Danish Bicycle: A reinforcement learning agent figures out that it can better achieve its goal by riding in circles as quickly as possible than reaching its purported destination.
  • Montezuma’s Revenge: By 2015 a reinforcement learner can play 60 different Atari games — the majority impossibly well — but can’t score a single point on one game humans find tediously simple.
  • Curious Pong: Two novelty-seeking agents, forced to play Pong against one another, create increasingly extreme rallies.
  • AlphaGo Zero: A computer program becomes superhuman at Chess and Go in under a day by attempting to imitate itself.
  • Robot Gymnasts: Over the course of an hour, humans teach robots to do perfect backflips just by telling them which of 2 random actions look more like a backflip.

We also cover:

  • How reinforcement learning actually works, and some of its key achievements and failures
  • How a lack of curiosity can cause AIs to fail to be able to do basic things
  • The pitfalls of getting AI to imitate how we ourselves behave
  • The benefits of getting AI to infer what we must be trying to achieve
  • Why it’s good for agents to be uncertain about what they’re doing
  • Why Brian isn’t that worried about explicit deception
  • The interviewees Brian most agrees with, and most disagrees with
  • Developments since Brian finished the manuscript
  • The effective altruism and AI safety communities
  • And much more

Get this episode by subscribing to our podcast on the world’s most pressing problems and how to solve them: Type 80,000 Hours into your podcasting app. Or read the transcript below.

Producer: Keiran Harris
Audio mastering: Ben Cordell
Transcriptions: Sofia Davis-Fogel

Highlights

Reinforcement learning

Brian Christian: There’s an idea called ‘temporal difference learning,’ which says rather than waiting until you actually get the reward, you can learn from your own estimate changing. For example, if I predict on Monday that I think there’s going to be an 80% chance of rain on Friday, and then on Tuesday I think there’s only going to be a 60% chance of rain on Friday, the idea of temporal difference learning is that you learn from that delta in your guess. You don’t have to actually wait until Friday to see what happens. You can already learn something by the fact that your later estimate is probably more accurate.

It’s the same thing in a chess game, if you make a move and then your opponent replies and then you think, oh crap, I’m probably going to lose now, you don’t have to wait until you actually lose. Your opponent may make an even worse blunder and you ultimately win, but you can learn something by the change in your prediction. That’s an example of some of the theoretical stuff, some of the foundational stuff. But in terms of what we’ve actually been able to do with it, I mean, reinforcement learning in its modern incarnation is behind everything from AlphaGo to the systems that play Atari games, to robotic self-driving cars.

Shaping

Brian Christian: I think part of the reminder to people who work on reinforcement learning in more conventional settings is that you want to think about not only the ultimate policy that this agent develops at the end, but what is the actual trajectory of them learning that. Often in research settings, we just kind of throw away everything that the agent did until they became maximally expert. But in the real world, kids exist. They share the world with us and they’re still figuring things out, but they can make real mistakes. They can hurt themselves, they can hurt others, et cetera. So thinking about the interaction between the reward and not just the final behaviour that comes out of it, but the actual, entire learning trajectory.

Knowledge-seeking agents

Brian Christian: If you had an ostensible basketball-playing agent that you rewarded for getting a high score, a reasonably intelligent basketball agent would learn to get good at basketball, but a superintelligent basketball playing agent would learn that the score is just some electrical current flowing through the scoreboard, and so it would learn to like re-solder its scoreboard and just make it say ‘9,999,’ or whatever. And so there were a lot of ways in which an agent could sort of deceive itself or optimise for some proxy that wasn’t the real thing that the designers intended. And in almost every case, they found it hard to imagine that the agent would avoid one of these kinds of degenerate scenarios.

But the one case where things seemed to go according to plan was the idea of what they called a knowledge-seeking agent. This is the idea of an agent motivated to kind of learn as much about the universe as possible. The beautiful thing about the knowledge-seeking agent is that it can’t self deceive. Because if it manipulates its own inputs in any way, well, it’s just cutting off its access to the real world, and that’s where all the surprise and information comes from. So the idea is that the knowledge-seeking agent might be uniquely immune to forms of self-deception, it might be immune to the sort of escapism or retreating into virtual fantasy that other types of agents might have. Now, it doesn’t necessarily mean that it would be safe to build a superintelligent knowledge-seeking agent. “What’s in the core of the earth? Let’s find out.” Or “Let’s build the world’s largest telescope by harvesting the entire solar system.” So it’s not necessarily safe per se, but it conforms to this idea of wanting to keep one’s eyes open. Not wanting to self deceive, not wanting to pull this escapist veil over one’s eyes.

Inverse reinforcement learning

Brian Christian: If you attempt to do a physical action — say you’re reaching for a clothes pin, but it’s out of reach, or you’re trying to move through a doorway, but it’s too narrow, or you’re trying to put something away, but you don’t have a free hand to open the cabinet, or whatever — children as young as, I want to say, 18 months, can figure out what you’re trying to do based on your behaviour and will spontaneously walk over and help you. They’ll pick the item up off the floor and hand it to you. They’ll open the cabinet door for you. And I think this is very remarkable because this capacity develops multiple years before theory of mind.

So a child can’t even understand that you’re sensorily perceiving something that’s different to what it’s perceiving. It doesn’t know that you believe things that are different to what it believes. But it can still figure out that you want something, and it can try to help you. So I think that’s very remarkable. It’s a very deeply rooted capacity. So this is another one of these areas where we’re trying to do something like that to get human normativity into machines.

This is broadly known as inverse reinforcement learning. The reason it’s inverse is that reinforcement learning is about, given some reward, some scheme for assigning points in an environment, how do you develop a behaviour to maximise those points? Inverse reinforcement learning goes the other way. It says, given some behaviour that you’re observing — which is presumably maximising some points — what are the points that it’s maximising?

Inverse reward design

Brian Christian: The basic idea here is that even in a situation where you explicitly give the system a reward function that says “Doing this is worth 100 points, doing this is worth -5 points, have at it,” even in that case, the system should still take that reward function as mere evidence about what you really want, rather than as the word from on high chiseled into stone tablets of what it should really do.

So there should be some kind of inferential process of saying, okay, this is the set of scenarios that the designer of the system had in mind when they created this reward function, but here’s a set of scenarios that are outside of that distribution in which this reward function might do some weird stuff, or there might be some kind of implicit ambiguity that wasn’t important to resolve in this one set of environments, but now we really need to get clear on what exactly you want me to do. So I wouldn’t be surprised to see something like that end up getting baked into models of the future. Even when presented with an explicit kind of operational objective, we’ll still say to you, “Now wait a minute, just to be clear, here’s a situation, I’m not sure what to do, let’s go back to the drawing board for a second.”

And this can get surprisingly delicate, because the system has to model your, as it were, irrationality. So there may be cases where the system overrides you for your own good and that’s the right thing to do. So one example would be you accidentally bump the autopilot disengage button with your elbow when you’re reaching to get a drink out of the cup holder. It’s probably good that the car has some model of, okay, you’re not holding the steering wheel, so I’m pretty sure that this is not really what you want me to do. I think that’s interesting because a lot of the pre-existing horror stories that we have about AGI have that Kubrick aspect of, “Open the pod bay doors, HAL,” “I’m sorry, I can’t do that,” — that sort of disobedience. But in this case you bump the autopilot disengage, and not disengaging probably is correct. So there’s a bit of a tightrope act to be done in terms of figuring out when the AI model of your preferences diverges from your behaviour, how to adjudicate that. It’s not totally simple.

Why Brian aligns so closely with Dario Amodei

Brian Christian: When I started working on this book, I had this hunch that the technical AI safety agenda and the sort of fairness/accountability/transparency agenda were really part of the same project. That it’s this question of, we want to make our ML systems do what we want. That affects the six-parameter linear classifiers that do parole recommendations, and that also affects the 100 billion-parameter language models, but it’s kind of the same project. That view was not widely held at the time. In fact, it was pretty polarising, I would say. About half the people that I talked to agreed and half disagreed. So Dario was one person where, when I kind of floated that hypothesis very early on was like, “Yeah, it’s underappreciated that even at a technical level, these are really intimately related problems.”

I think that view has also aged well, not to be too immodest about it. But I think more people have come to think that than have gone the other way. And also, I think his Concrete Problems paper showing that — again, this is very early seminal stuff — but showing that what people were worried about who were thinking more abstractly about AI safety, that this could all be cashed out in the language of actual ML problems that were essentially shovel-ready for the ML community. We can work on robustness to distributional shift. We can work on transparency and explainability. That was an intuition that I also shared. And I think there’s also… Within the community opinions differ on whether AGI is coming by way of the standard kind of deep learning ML regime, or if there’s going to be some paradigm shift. And I think of him and others, obviously, as part of this camp that’s saying, “No, I think what’s coming is coming in familiar terms. It’s not going to be some unimaginable other thing.”

Articles, books, and other media discussed in the show

Brian’s books

Brian in the media

Books by other authors

Papers

Other links

Related episodes

About the show

The 80,000 Hours Podcast features unusually in-depth conversations about the world's most pressing problems and how you can use your career to solve them. We invite guests pursuing a wide range of career paths — from academics and activists to entrepreneurs and policymakers — to analyse the case for and against working on different issues and which approaches are best for solving them.

The 80,000 Hours Podcast is produced and edited by Keiran Harris. Get in touch with feedback or guest suggestions by emailing [email protected].

What should I listen to first?

We've carefully selected 10 episodes we think it could make sense to listen to first, on a separate podcast feed:

Check out 'Effective Altruism: An Introduction'

Subscribe here, or anywhere you get podcasts:

If you're new, see the podcast homepage for ideas on where to start, or browse our full episode archive.