#80 – Stuart Russell on the flaws that make today’s AI architecture unsafe, and a new approach that could fix them

Stuart Russell, Professor at UC Berkeley and co-author of the most popular AI textbook, thinks the way we approach machine learning today is fundamentally flawed.

In his new book, Human Compatible, he outlines the ‘standard model’ of AI development, in which intelligence is measured as the ability to achieve some definite, completely-known objective that we’ve stated explicitly. This is so obvious it almost doesn’t even seem like a design choice, but it is.

Unfortunately there’s a big problem with this approach: it’s incredibly hard to say exactly what you want. AI today lacks common sense, and simply does whatever we’ve asked it to. That’s true even if the goal isn’t what we really want, or the methods it’s choosing are ones we would never accept.

We already see AIs misbehaving for this reason. Stuart points to the example of YouTube’s recommender algorithm, which reportedly nudged users towards extreme political views because that made it easier to keep them on the site. This isn’t something we wanted, but it helped achieve the algorithm’s objective: maximise viewing time.

Like King Midas, who asked to be able to turn everything into gold but ended up unable to eat, we get too much of what we’ve asked for.

This ‘alignment’ problem will get more and more severe as machine learning is embedded in more and more places: recommending us news, operating power grids, deciding prison sentences, doing surgery, and fighting wars. If we’re ever to hand over much of the economy to thinking machines, we can’t count on ourselves correctly saying exactly what we want the AI to do every time.

Stuart isn’t just dissatisfied with the current model though, he has a specific solution. According to him we need to redesign AI around 3 principles:

  1. The AI system’s objective is to achieve what humans want.
  2. But the system isn’t sure what we want.
  3. And it figures out what we want by observing our behaviour.

Stuart thinks this design architecture, if implemented, would be a big step forward towards reliably beneficial AI.

For instance, a machine built on these principles would be happy to be turned off if that’s what its owner thought was best, while one built on the standard model should resist being turned off because being deactivated prevents it from achieving its goal. As Stuart says, “you can’t fetch the coffee if you’re dead.”

These principles lend themselves towards machines that are modest and cautious, and check in when they aren’t confident they’re truly achieving what we want.

We’ve made progress toward putting these principles into practice, but the remaining engineering problems are substantial. Among other things, the resulting AIs need to be able to interpret what people really mean to say based on the context of a situation. And they need to guess when we’ve rejected an option because we’ve considered it and decided it’s a bad idea, and when we simply haven’t thought about it at all.

Stuart thinks all of these problems are surmountable, if we put in the work. The harder problems may end up being social and political.

When each of us can have an AI of our own — one smarter than any person — how do we resolve conflicts between people and their AI agents? How considerate of other people’s interests do we expect AIs to be? How do we avoid them being used in malicious or anti-social ways?

And if AIs end up doing most work that people do today, how can humans avoid becoming enfeebled, like lazy children tended to by machines, but not intellectually developed enough to know what they really want?

Despite all these problems, the rewards of success could be enormous. If cheap thinking machines can one day do most of the work people do now, it could dramatically raise everyone’s standard of living, like a second industrial revolution.

Without having to work just to survive, people might flourish in ways they never have before.

In today’s conversation we cover, among many other things:

  • What are the arguments against being concerned about AI?
  • Should we develop AIs to have their own ethical agenda?
  • What are the most urgent research questions in this area?

Get this episode by subscribing to our podcast on the world’s most pressing problems and how to solve them: type 80,000 Hours into your podcasting app. Or read the transcript below.

Producer: Keiran Harris.
Audio mastering: Ben Cordell.
Transcriptions: Zakee Ulhaq.

Highlights

Purely altruistic machines

The principle says that the machine’s only purpose is the realization of human preferences. So that actually has some kind of specific technical content in it. For example, if you look at, in comparison, Asimov’s laws, he says the machine should preserve its own existence. That’s the third law. And he’s got a caveat saying only if that doesn’t conflict with the first two laws. But in fact it’s strictly unnecessary because the reason why you want the machine to preserve its own existence is not some misplaced sense of concern for the machine’s feelings or anything like that. The reason should be because its existence is beneficial to humans. And so the first principle already encompasses the obligation to keep yourself in functioning order so that you can be helping humans satisfy their preferences.

So there’s a lot you could write just about that. It seems like “motherhood and apple pie”: of course machines should be good for human beings, right? What else would they be? But already that’s a big step because the standard model doesn’t say they should be good for human beings at all. The standard model just says they should optimize the objective and if the objective isn’t good for human beings, the standard model doesn’t care. So just the first principle would include the fact that human beings in the long run do not want to be enfeebled. They don’t want to be overly dependent on machines to the extent that they lose their own capabilities and their own autonomy and so on. So people ask, “Isn’t your approach going to eliminate human autonomy”?

But of course, no. A properly designed machine would only intervene to the extent that human autonomy is preserved. And so sometimes it would say, “No, I’m not going to help you tie your shoelaces. You have to tie your shoelaces yourself” just as parents do at some point with the child. It’s time for the parents to stop tying the child’s shoelaces and let the child figure it out and get on with it.

Humble machines

The second principle is that machines are going to be uncertain about the human preferences that they’re supposed to be optimizing or realizing. And that’s not so much a principle, it’s just a statement of fact. That’s the distinction that separates this revised model from the standard model of AI. And it’s really the piece that is what brings about the safety consequences of this model. That if machines are certain about the objective, then you get all these undesirable consequences: the paperclip optimizer, et cetera. Where the machine pursues its objective in an optimal fashion, regardless of anything we might say. So we can say, you know, “Stop, you’re destroying the world”! And the machine says, “But I’m just carrying out the optimal plan for the objective that’s put in me”. And the machine doesn’t have to be thinking, “Okay, well the human put these orders into me; what are they”? It’s just the objective is the constitution of the machine.

And if you look at the agents that we train with reinforcement learning, for example, depending on what type of agent they are, if it’s a Q-learning agent or a policy search agent, which are two of the more popular kinds of reinforcement learning, they don’t even have a representation of the objective at all. They’re just the training process where the reward signal is supplied by the reinforcement learning framework. So that reward signal is defining the objective that the machine is going to optimize. But the machine doesn’t even know what the objective is. It’s just an optimizer of that objective. And so there’s no sense in which that machine could say, “Oh, I wonder if my objective is the wrong one” or anything like that. It’s just an optimizer of that objective.

Learning to predict human preferences

One way to do inverse reinforcement learning is Bayesian IRL, where you start with a prior and then the evidence from the behavior you observe then updates your prior and eventually you get a pretty good idea of what it is the entity or person is trying to do. It’s a very natural thing that people do all the time. You see someone doing something and most of the time it just feels like you just directly perceive what they’re doing, right? I mean, you see someone go up to the ATM and press some buttons and take the money. It’s just like that’s what they’re doing. They’re getting money out of the ATM. I describe the behavior in this purposive form.

I don’t describe it in terms of the physical trajectory of their legs and arms and hands and so on. I describe it as, you know, the action is something that’s purpose fulfilling. So we perceive it directly. And then sometimes you could be wrong, right? They could be trying to steal money from the ATM by some special code key sequence that they’ve figured out. Or they could be acting in a movie. So if you saw them take a few steps back and then do the whole thing again, you might wonder, “Oh, that’s funny. What are they doing? Maybe they’re trying to get out more money than the limit they can get on each transaction”? And then if you saw someone with a camera filming them, you would say, “Oh, okay, I see now what they’re doing. They’re not getting money from the ATM at all. They are acting in a movie”.

So it’s just absolutely completely natural for human beings to interpret our perceptions in terms of purpose. In conversation, you’re always trying to figure out “Why is someone saying that”? Are they asking me a question? Is it a rhetorical question? It’s so natural, it’s subconscious a lot of the time. So there are many different forms of interaction that could take place that would provide information to machines about human preferences. For example, just reading books provides information about human preferences: about the preferences of the individuals, but also about humans in general.

One of the ways that we learn about other humans is by reading novels and seeing the choices of the characters. And sometimes you get direct insight into their motivations depending on whether the author wants to give you that. Sometimes you have to figure it out. So I think that there’s a wealth of information from which machines could build a general prior about human preferences. And then as you interact with an individual, you refine that prior. You find out that they’re a vegan. You find out that they voted for President Trump. You try to resolve these two contradictory facts. And then you gradually build up a more specific model for that particular individual.

Enfeeblement problem

Go all the way to, you know, the children who are raised by wolves or whatever. The outcome seems to be that, “Oh my gosh, if they’re abandoned in the woods as infants and somehow they survive and grow up, they don’t speak Latin”. They don’t speak at all. And they have some survival skills, but are they writing poetry? Are they trying to learn more about physics? No, they’re not doing any of those things. So there’s nothing natural about, shall we say, scientific curiosity. It’s something that’s emerged over thousands of years of culture.

So we have to think what kind of culture do we need in order to produce adults who retain curiosity and autonomy and vigor as opposed to just becoming institutionalized. And I think if you look at E.M Forster’s story “The Machine Stops”, I think that’s a pretty good exploration of this. That everyone in his story is looked after. No one has any kind of useful job. In fact, the most useful thing they can think of is to listen to MOOCs. So he invented the MOOC in 1909 so people are giving online open lectures to anyone who wants to listen, and then people subscribe to various podcast series, I guess you’d call them. And that’s kind of all they do. There’s very little actual purposeful activity left for the human race. And this is not desirable; to me, this is a disaster. We could destroy ourselves with nuclear weapons. We could wipe out the habitable biosphere with climate change. These would be disasters, but this is another disaster, right?

A future where the human race has lost purpose. That the vast majority of individuals function with very little autonomy or awareness or knowledge or learning. So how do you create a culture and educational process? I think what humans value in themselves is a really important thing. How do you make it so that people make the effort to learn and discover and gain autonomy and skills when all of the incentive to do that up to now, disappears. And our whole education system is very expensive. As I point out in the book, when you add up how much time people have spent learning to be competent human beings, it’s about a trillion person years and it’s all because you have to. Otherwise things just completely fall apart. And we’ve internalized that in our whole system of how we reward people. We give them grades. We give them accolades. We give them Nobel prizes. There’s an enormous amount in our culture which is there to reward the process of learning and becoming competent and skilled.

And you could argue, “Well that’s from the enlightenment” or whatever. But I would argue it’s mostly a consequence of the fact that that’s functional. And when the functional purpose of all that disappears, I think we might see it decay very rapidly unless we take steps to avoid it.

AI moral rights

Stuart Russell: If they really do have subjective experience, and putting aside whether or not we would ever know, putting aside the fact that if they do, it’s probably completely unlike any kind of subjective experience that humans have or even that animals have because it’s being produced by a totally different computational architecture as well as a totally different physical architecture. But even if we put all that to one side, it seems to me that if they are actually having subjective experience, then we do have a real problem and it does affect the calculation in some sense. It might say actually then we really can’t proceed with this enterprise at all, because I think we have to retain control from our own point of view. But if that implies inflicting unlimited suffering on sentient beings, then it would seem like, well, we can’t go that route at all. Again, there’s no analogues, right? It’s not exactly like inviting a superior alien species to come and be our slaves forever, but it’s sort of like that.

Robert Wiblin: I suppose if you didn’t want to give up on the whole enterprise, you could try to find a way to design them so that they weren’t conscious at all. Or I suppose alternatively you could design them so that they are just extremely happy whenever human preferences are satisfied. So it’s kind of a win-win.

Stuart Russell: Yeah. If we understood enough about the mechanics of their consciousness, that’s a possibility. But again, even that doesn’t seem right.

Robert Wiblin: Because they lack autonomy?

Stuart Russell: I mean, we wouldn’t want that fate for a human being. That we give them some happy drugs so that they’re happy being our servants forever and having no freedom. You know, it’s sort of the North Korea model almost. We find that pretty objectionable.

Articles, books, and other media discussed in the show

Stuart’s work

Everything else

Related episodes

About the show

The 80,000 Hours Podcast features unusually in-depth conversations about the world's most pressing problems and how you can use your career to solve them. We invite guests pursuing a wide range of career paths — from academics and activists to entrepreneurs and policymakers — to analyse the case for and against working on different issues and which approaches are best for solving them.

The 80,000 Hours Podcast is produced and edited by Keiran Harris. Get in touch with feedback or guest suggestions by emailing [email protected].

What should I listen to first?

We've carefully selected 10 episodes we think it could make sense to listen to first, on a separate podcast feed:

Check out 'Effective Altruism: An Introduction'

Subscribe here, or anywhere you get podcasts:

If you're new, see the podcast homepage for ideas on where to start, or browse our full episode archive.