#80 – Stuart Russell on the flaws that make today’s AI architecture unsafe, and a new approach that could fix them

By Robert Wiblin and Keiran Harris · Published June 22nd, 2020 ·

#80 – Stuart Russell on the flaws that make today’s AI architecture unsafe, and a new approach that could fix them

By Robert Wiblin and Keiran Harris · Published June 22nd, 2020

One day we may all have access to AI assistants that are more capable than we are. But if we make them the way we currently develop AI, they probably won't do what we really want.

Read transcript

See all episodes

Stuart Russell, Professor at UC Berkeley and co-author of the most popular AI textbook, thinks the way we approach machine learning today is fundamentally flawed.

In his new book, Human Compatible, he outlines the ‘standard model’ of AI development, in which intelligence is measured as the ability to achieve some definite, completely-known objective that we’ve stated explicitly. This is so obvious it almost doesn’t even seem like a design choice, but it is.

Unfortunately there’s a big problem with this approach: it’s incredibly hard to say exactly what you want. AI today lacks common sense, and simply does whatever we’ve asked it to. That’s true even if the goal isn’t what we really want, or the methods it’s choosing are ones we would never accept.

We already see AIs misbehaving for this reason. Stuart points to the example of YouTube’s recommender algorithm, which reportedly nudged users towards extreme political views because that made it easier to keep them on the site. This isn’t something we wanted, but it helped achieve the algorithm’s objective: maximise viewing time.

Like King Midas, who asked to be able to turn everything into gold but ended up unable to eat, we get too much of what we’ve asked for.

This ‘alignment’ problem will get more and more severe as machine learning is embedded in more and more places: recommending us news, operating power grids, deciding prison sentences, doing surgery, and fighting wars. If we’re ever to hand over much of the economy to thinking machines, we can’t count on ourselves correctly saying exactly what we want the AI to do every time.

Stuart isn’t just dissatisfied with the current model though, he has a specific solution. According to him we need to redesign AI around 3 principles:

The AI system’s objective is to achieve what humans want.
But the system isn’t sure what we want.
And it figures out what we want by observing our behaviour.

Stuart thinks this design architecture, if implemented, would be a big step forward towards reliably beneficial AI.

For instance, a machine built on these principles would be happy to be turned off if that’s what its owner thought was best, while one built on the standard model should resist being turned off because being deactivated prevents it from achieving its goal. As Stuart says, “you can’t fetch the coffee if you’re dead.”

These principles lend themselves towards machines that are modest and cautious, and check in when they aren’t confident they’re truly achieving what we want.

We’ve made progress toward putting these principles into practice, but the remaining engineering problems are substantial. Among other things, the resulting AIs need to be able to interpret what people really mean to say based on the context of a situation. And they need to guess when we’ve rejected an option because we’ve considered it and decided it’s a bad idea, and when we simply haven’t thought about it at all.

Stuart thinks all of these problems are surmountable, if we put in the work. The harder problems may end up being social and political.

When each of us can have an AI of our own — one smarter than any person — how do we resolve conflicts between people and their AI agents? How considerate of other people’s interests do we expect AIs to be? How do we avoid them being used in malicious or anti-social ways?

And if AIs end up doing most work that people do today, how can humans avoid becoming enfeebled, like lazy children tended to by machines, but not intellectually developed enough to know what they really want?

Despite all these problems, the rewards of success could be enormous. If cheap thinking machines can one day do most of the work people do now, it could dramatically raise everyone’s standard of living, like a second industrial revolution.

Without having to work just to survive, people might flourish in ways they never have before.

In today’s conversation we cover, among many other things:

What are the arguments against being concerned about AI?
Should we develop AIs to have their own ethical agenda?
What are the most urgent research questions in this area?

Get this episode by subscribing to our podcast on the world’s most pressing problems and how to solve them: type 80,000 Hours into your podcasting app. Or read the transcript below.

Producer: Keiran Harris.
Audio mastering: Ben Cordell.
Transcriptions: Zakee Ulhaq.

Highlights

Purely altruistic machines

The principle says that the machine’s only purpose is the realization of human preferences. So that actually has some kind of specific technical content in it. For example, if you look at, in comparison, Asimov’s laws, he says the machine should preserve its own existence. That’s the third law. And he’s got a caveat saying only if that doesn’t conflict with the first two laws. But in fact it’s strictly unnecessary because the reason why you want the machine to preserve its own existence is not some misplaced sense of concern for the machine’s feelings or anything like that. The reason should be because its existence is beneficial to humans. And so the first principle already encompasses the obligation to keep yourself in functioning order so that you can be helping humans satisfy their preferences.
So there’s a lot you could write just about that. It seems like “motherhood and apple pie”: of course machines should be good for human beings, right? What else would they be? But already that’s a big step because the standard model doesn’t say they should be good for human beings at all. The standard model just says they should optimize the objective and if the objective isn’t good for human beings, the standard model doesn’t care. So just the first principle would include the fact that human beings in the long run do not want to be enfeebled. They don’t want to be overly dependent on machines to the extent that they lose their own capabilities and their own autonomy and so on. So people ask, “Isn’t your approach going to eliminate human autonomy”?
But of course, no. A properly designed machine would only intervene to the extent that human autonomy is preserved. And so sometimes it would say, “No, I’m not going to help you tie your shoelaces. You have to tie your shoelaces yourself” just as parents do at some point with the child. It’s time for the parents to stop tying the child’s shoelaces and let the child figure it out and get on with it.

Humble machines

The second principle is that machines are going to be uncertain about the human preferences that they’re supposed to be optimizing or realizing. And that’s not so much a principle, it’s just a statement of fact. That’s the distinction that separates this revised model from the standard model of AI. And it’s really the piece that is what brings about the safety consequences of this model. That if machines are certain about the objective, then you get all these undesirable consequences: the paperclip optimizer, et cetera. Where the machine pursues its objective in an optimal fashion, regardless of anything we might say. So we can say, you know, “Stop, you’re destroying the world”! And the machine says, “But I’m just carrying out the optimal plan for the objective that’s put in me”. And the machine doesn’t have to be thinking, “Okay, well the human put these orders into me; what are they”? It’s just the objective is the constitution of the machine.
And if you look at the agents that we train with reinforcement learning, for example, depending on what type of agent they are, if it’s a Q-learning agent or a policy search agent, which are two of the more popular kinds of reinforcement learning, they don’t even have a representation of the objective at all. They’re just the training process where the reward signal is supplied by the reinforcement learning framework. So that reward signal is defining the objective that the machine is going to optimize. But the machine doesn’t even know what the objective is. It’s just an optimizer of that objective. And so there’s no sense in which that machine could say, “Oh, I wonder if my objective is the wrong one” or anything like that. It’s just an optimizer of that objective.

Learning to predict human preferences

One way to do inverse reinforcement learning is Bayesian IRL, where you start with a prior and then the evidence from the behavior you observe then updates your prior and eventually you get a pretty good idea of what it is the entity or person is trying to do. It’s a very natural thing that people do all the time. You see someone doing something and most of the time it just feels like you just directly perceive what they’re doing, right? I mean, you see someone go up to the ATM and press some buttons and take the money. It’s just like that’s what they’re doing. They’re getting money out of the ATM. I describe the behavior in this purposive form.
I don’t describe it in terms of the physical trajectory of their legs and arms and hands and so on. I describe it as, you know, the action is something that’s purpose fulfilling. So we perceive it directly. And then sometimes you could be wrong, right? They could be trying to steal money from the ATM by some special code key sequence that they’ve figured out. Or they could be acting in a movie. So if you saw them take a few steps back and then do the whole thing again, you might wonder, “Oh, that’s funny. What are they doing? Maybe they’re trying to get out more money than the limit they can get on each transaction”? And then if you saw someone with a camera filming them, you would say, “Oh, okay, I see now what they’re doing. They’re not getting money from the ATM at all. They are acting in a movie”.
So it’s just absolutely completely natural for human beings to interpret our perceptions in terms of purpose. In conversation, you’re always trying to figure out “Why is someone saying that”? Are they asking me a question? Is it a rhetorical question? It’s so natural, it’s subconscious a lot of the time. So there are many different forms of interaction that could take place that would provide information to machines about human preferences. For example, just reading books provides information about human preferences: about the preferences of the individuals, but also about humans in general.
One of the ways that we learn about other humans is by reading novels and seeing the choices of the characters. And sometimes you get direct insight into their motivations depending on whether the author wants to give you that. Sometimes you have to figure it out. So I think that there’s a wealth of information from which machines could build a general prior about human preferences. And then as you interact with an individual, you refine that prior. You find out that they’re a vegan. You find out that they voted for President Trump. You try to resolve these two contradictory facts. And then you gradually build up a more specific model for that particular individual.

Enfeeblement problem

Go all the way to, you know, the children who are raised by wolves or whatever. The outcome seems to be that, “Oh my gosh, if they’re abandoned in the woods as infants and somehow they survive and grow up, they don’t speak Latin”. They don’t speak at all. And they have some survival skills, but are they writing poetry? Are they trying to learn more about physics? No, they’re not doing any of those things. So there’s nothing natural about, shall we say, scientific curiosity. It’s something that’s emerged over thousands of years of culture.
So we have to think what kind of culture do we need in order to produce adults who retain curiosity and autonomy and vigor as opposed to just becoming institutionalized. And I think if you look at E.M Forster’s story “The Machine Stops”, I think that’s a pretty good exploration of this. That everyone in his story is looked after. No one has any kind of useful job. In fact, the most useful thing they can think of is to listen to MOOCs. So he invented the MOOC in 1909 so people are giving online open lectures to anyone who wants to listen, and then people subscribe to various podcast series, I guess you’d call them. And that’s kind of all they do. There’s very little actual purposeful activity left for the human race. And this is not desirable; to me, this is a disaster. We could destroy ourselves with nuclear weapons. We could wipe out the habitable biosphere with climate change. These would be disasters, but this is another disaster, right?
A future where the human race has lost purpose. That the vast majority of individuals function with very little autonomy or awareness or knowledge or learning. So how do you create a culture and educational process? I think what humans value in themselves is a really important thing. How do you make it so that people make the effort to learn and discover and gain autonomy and skills when all of the incentive to do that up to now, disappears. And our whole education system is very expensive. As I point out in the book, when you add up how much time people have spent learning to be competent human beings, it’s about a trillion person years and it’s all because you have to. Otherwise things just completely fall apart. And we’ve internalized that in our whole system of how we reward people. We give them grades. We give them accolades. We give them Nobel prizes. There’s an enormous amount in our culture which is there to reward the process of learning and becoming competent and skilled.
And you could argue, “Well that’s from the enlightenment” or whatever. But I would argue it’s mostly a consequence of the fact that that’s functional. And when the functional purpose of all that disappears, I think we might see it decay very rapidly unless we take steps to avoid it.

AI moral rights

Stuart Russell: If they really do have subjective experience, and putting aside whether or not we would ever know, putting aside the fact that if they do, it’s probably completely unlike any kind of subjective experience that humans have or even that animals have because it’s being produced by a totally different computational architecture as well as a totally different physical architecture. But even if we put all that to one side, it seems to me that if they are actually having subjective experience, then we do have a real problem and it does affect the calculation in some sense. It might say actually then we really can’t proceed with this enterprise at all, because I think we have to retain control from our own point of view. But if that implies inflicting unlimited suffering on sentient beings, then it would seem like, well, we can’t go that route at all. Again, there’s no analogues, right? It’s not exactly like inviting a superior alien species to come and be our slaves forever, but it’s sort of like that.
Robert Wiblin: I suppose if you didn’t want to give up on the whole enterprise, you could try to find a way to design them so that they weren’t conscious at all. Or I suppose alternatively you could design them so that they are just extremely happy whenever human preferences are satisfied. So it’s kind of a win-win.
Stuart Russell: Yeah. If we understood enough about the mechanics of their consciousness, that’s a possibility. But again, even that doesn’t seem right.
Robert Wiblin: Because they lack autonomy?
Stuart Russell: I mean, we wouldn’t want that fate for a human being. That we give them some happy drugs so that they’re happy being our servants forever and having no freedom. You know, it’s sort of the North Korea model almost. We find that pretty objectionable.

Articles, books, and other media discussed in the show

Stuart’s work

Everything else

Rohin Shah’s AI Alignment Newsletter and his summary of Human Compatible
Steven Pinker and Stuart Russell on the Foundations, Benefits, and Possible Existential Threat of AI on The FLI Podcast
Superintelligence: Paths, Dangers, Strategies by Nick Bostrom
When algorithms surprise us, examples of ML misbehaviour by AI Weirdness
The Culture Series by Iain M. Banks
The Machine Stops by E. M. Forster
Never Let Me Go by Kazuo Ishiguro
A for Andromeda by Fred Hoyle & John Elliot
Concrete Problems in AI Safety by Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman & Dan Mané (2016)
Intelligence Explosion Microeconomics by Eliezer Yudkowsky (2013)
Evolving Virtual Creatures by Karl Sims (1994)
Conversation with Rohin Shah by AI Impacts (2019)
IMDB: Chernobyl (2019)

Transcript

Table of Contents

1 Rob’s intro [00:00:00]
2 The interview begins [00:19:06]
3 Human Compatible: Artificial Intelligence and the Problem of Control [00:21:27]
4 Principles for Beneficial Machines [00:29:25]
5 AI moral rights [00:33:05]
6 Humble machines [00:39:35]
7 Learning to predict human preferences [00:45:55]
8 Animals and AI [00:49:33]
9 Enfeeblement problem [00:58:21]
10 Counterarguments [01:07:09]
11 Orthogonality thesis [01:24:25]
12 Intelligence explosion [01:29:15]
13 Policy ideas [01:38:39]
14 What most needs to be done [01:50:14]
15 Rob’s outro [02:12:23]

Rob’s intro [00:00:00]

Robert Wiblin: Hi listeners, this is the 80,000 Hours Podcast, where each week we have an unusually in-depth conversation about one of the world’s most pressing problems and how you can use your career to solve it. I’m Rob Wiblin, Director of Research at 80,000 Hours.

Today’s guest, Stuart Russell has been a leading researcher in artificial intelligence for several decades, and for many years has been worried about ways it could go wrong.

He recently published a book — Human Compatible — explaining in simple terms why he thinks the way current ML systems are designed is fundamentally flawed and poses a risk to the world, so naturally I was very excited to be able to interview him.

Early in the conversation he’s quite critical of Steven Pinker’s view that artificial intelligence is likely to remain safe to people as it becomes more powerful and influential.

If that sparks your interest you may like to hear the two of them talk with one another about exactly this topic, in an interview that just came out on the Future of Life Podcast on June 15th.

That show, hosted by Lucas Perry and Ariel Conn has regular interviews with many guests you may have heard of, on artificial intelligence, climate change and other issues related to humanity’s long-term future which have come up on this show over the years.

One particularly good recent interview you may want to check out is with George Church, one of the top researchers in synthetic biology, which came out on May 15th. That’s the Future of Life Podcast if you’d like to subscribe.

Alright, I only had 2 hours with Stuart and didn’t want to waste too much time having him explain the basic thesis of his book Human Compatible, so I’m going to give a quick summary of his views here.

If you’ve already read the book and would like to skip this refresher feel free to skip forward X minutes.

I’m indebted to Rohin Shah for his summary of the book which he put out on his AI Alignment Newsletter, and which I worked from to write this. Naturally we’ll link to that in the show notes.

Alright, here’s a quick rundown of the core thesis of Human Compatible. We won’t cover counterarguments here, but they’ll come up in the interview after.

So, some people think that in order to get machines to be generally intelligent in many domains, the key thing they need is more computational power. So many estimates have been made of the amount of computational power a human brain has in order to figure out how long it will be before AI systems have that kind of firepower.

Stuart Russell has a different prediction: that the bottleneck for AI is in the efficiency of the algorithms rather than the hardware. His best guess is that we’ll need several conceptual breakthroughs, for example in language or common sense understanding, the ability to learn cumulatively, figuring out the hierarchy in a set of ideas, and the ability to prioritize what to think about next.

It’s not clear how long these will take, and whether there will need to be more breakthroughs after these occur, but these seem like the most obvious and necessary ones.

What could happen if we do manage to build an AI system or set of systems that is more intelligent than humans in almost all domains, and designed to be beneficial to people?

One could think about more dramatic transformations that have science fiction feel to them, but as a modest lower bound, it should at least be able to automate away almost all existing human labor. Assuming that such a superintelligent AI eventually becomes cheap to operate, most services and many goods that people now produce would also become extremely cheap.

Even many primary products like food and natural resources would become a lot cheaper, as human labor is still a significant fraction of their production cost. If we assume that this could bring up everyone’s standard of life up to that of the 90th percentile American, that would result in nearly a tenfold increase in world GDP per year.

Assuming a 5% discount rate per year, the resulting economic growth has a $13,500 trillion net present value, or over 100x the world’s economic output.

Such a giant prize should in principle remove many reasons for conflict. If we can become so rich, there’s more reason for countries to cooperate to ensure we get that prize rather than a war.

Of course, this doesn’t mean that there aren’t any problems, even if we do develop AI that does what its owner wants.

Depending on who has access to powerful AI systems, we could see then used for automated surveillance, lethal autonomous weapons, mass blackmail, fake news and behavior manipulation, and other abuses we haven’t conceived of yet.

Another way things could go wrong is that once AI is better than humans at all tasks, we may end up delegating everything to AI, leading to humanity being gradually enfeebled and unable to make sensible choices. If children don’t have to do anything or expect to need to get a job, how do they become sharp and informed enough to figure out what they really want AIs to do?

Of course all that assumes that we are able to reliably control what AI systems get up to when we give them instructions.

But if nothing else, we should be careful about creating entities that are more intelligent than us. After all, the gorillas probably aren’t too happy about the fact that their habitat, happiness, and existence depends on our moods and whims.

Stuart calls this the gorilla problem: specifically, “the problem of whether humans can maintain their supremacy and autonomy in a world that includes machines with substantially greater intelligence”.

Of course, we aren’t in the same position as the gorillas: we get to design the more intelligent “species” ourselves. But before going ahead we should probably have some good arguments to explain why our design isn’t ultimately going to succumb to the gorilla problem.

This is especially important in the case if we see rapid increases in the intelligence and capability of the systems we’re building — a so-called hard AI take-off — because in that scenario we won’t have much time to gradually notice and solve any problems that arise.

Do we currently have strong arguments that everything will be fine?

Stuart thinks we really don’t, and in fact there are good reasons to think we will succumb to the gorilla problem.

To see the risk we need to realise that the vast majority of research in AI and related fields assumes that there is some definite, completely known objective that must be optimized.

In reinforcement learning we optimize a reward function; in search, we look for states that match a goal criterion; in statistics, we minimize expected loss; in control theory, we minimize the cost function (typically deviation from some desired behavior); in economics, we design mechanisms and policies to maximize the utility of individuals, welfare of groups, or profit of corporations.

Stuart says that the standard model of machine intelligence is that Machines are intelligent to the extent that their actions can be expected to achieve their explicitly stated objectives. But this means that if we put in the wrong objective, the machine’s obstinate pursuit of that objective will lead to outcomes we won’t like.

To be more concrete, consider, for example the content selection algorithms used by big tech companies. YouTube’s video recommender algorithm is typically asked to maximise some measure of engagement, like viewing time. Given that goal you’d expect them to try to figure out what videos users would like to watch, which they do do.

But unfortunately despite how simple they are, they have also learned other tricks we don’t like.

Those algorithms found that they can increase viewing time or click through rates not by choosing videos the user likes more given their current beliefs and value, but by changing the users beliefs and values.

Basically, if the algorithm recommends videos that convince you of strong and distinctive political opinions, whatever those might be, it can more easily predict what videos you will like next — that is, videos that continue to confirm those views.

Since more predictable users can be given suggestions they are more likely to click on and watch, one way the algorithm can best achieve its goal is by radicalising you.

In practice, this means that users are pushed to become more extreme in their political views, but of course the outcome in a specific case could be different. Either way it’s worrying, because algorithms are deliberately shaping people in a way we never asked for and probably don’t want. While we can’t be sure, these algorithms may have already caused a lot of damage to the world.

A lesson is that we don’t really know exactly what objectives we want to put inside of our AI systems, so that when they go out and maximise that objective the result is clearly good for us. Stuart calls this the “King Midas” problem: as the legend goes, King Midas wished that everything he touched would turn to gold, not realizing that “everything” included his daughter and his food, a classic case of a badly specified objective (AN #1).

In some sense, we’ve known about this problem for a long time, both from King Midas’s tale, and in stories about genies, where the characters inevitably want to undo their wishes that haven’t been not stated carefully enough and end up achieved in some extreme and perverse way.

You might think that we could simply turn off the power to the AI.

For now we can, though we don’t always choose to. Recommender algorithms that are bad for society might not be turned off because a company is nonetheless making money from them.

But there’s another reason we shouldn’t be confident we’ll always be able to turn off our AI systems. That’s because within the current paradigm of machine learning, for almost any goal we give an AI system, it has a strong incentive to stay operational. That’s because staying operational is necessary for it to be able to achieve its goal.

The logic is captured in one of Stuart’s quips: you can’t fetch the coffee if you’re dead. If an AI system is advanced and capable enough, and programmed to relentlessly maximise a stated objective, it has to resist being turned off.

What has gone wrong that today’s standard model of ML can in principle produce such desirable behaviour?

The underlying issue is that we’ve been evaluating machine intelligence in a way which doesn’t take into account that we really want machines to be useful for us.

Human Compatible proposes a new model around which AI development should proceed based on three principles for the design of AI systems:

The machine’s only objective is to maximize the realization of human preferences.
The machine is initially uncertain about what those preferences are.
The ultimate source of information about human preferences is human behavior.

This might sound common-sense but it’s quite different than what we’re doing now. The book expands a great deal on each of these points and we’ll talk about them in the interview in a moment.

You might worry that an AI system that is uncertain about its objective won’t be as useful as one that knows the objective, but actually this uncertainty is a feature, not a bug.

It leads to AI systems that are deferential, that ask for clarifying information, and that try to learn human preferences over time.

An AI system designed on these principles won’t necessarily resist being turned off. Because its desire is to achieve its owners goal and it doesn’t know what that goal is, if it’s being turned off that’s probably because it staying on fails to achieve its owners preferences because it’s making mistakes.

Papers have showcased AIs built along these lines that do prefer to be turned off in the right circumstances.

So that’s the proposed solution.

Some might worry that the proposed solution is quite challenging to put into practice: after all, it requires a shift in the entire way we do AI.

What if the standard model of AI can deliver more results, even if only because more people work on it today?

Here, Stuart is optimistic: the big issue with the standard model is that the resulting systems are bad at learning our preferences, and there’s a huge economic benefit from properly learning our preferences. For example, we’d be willing to pay more money for an AI assistant that accurately learned our preferences for meeting times, and so could schedule them completely autonomously.

A big research challenge is how to actually put principle 3 into practice. That’s the principle that The ultimate source of information about human preferences is human behavior.

For AIs to follow this they have to be able to connect our behavior with our preferences. That’s trivial for people, but so far incredibly hard for ML systems.

Inverse Reward Design and Preferences Implicit in the State of the World are two papers that tackle pieces of this puzzle.

One issue is that we need to use Gricean semantics for language: when we say X, we do not always mean the literal meaning of X: the agent must also take into account the fact that we bothered to say X, and that we didn’t say Y. For example, I’m only going to ask for someone to go get a cup of coffee if I believe that there is a place to buy reasonably priced coffee nearby.

If those beliefs happen to be wrong, the agent should come back and check in, rather than trudge hundreds of miles or pay hundreds of dollars to ensure I get my cup of coffee. Actually designing a system that knows when to do this is a serious engineering challenge but something we have to do.

Another problem with inferring preferences from behavior is that humans are nearly always deep in some nested plan, and many actions don’t even occur to us. Right now I’m working on this podcast and not considering whether I should become a fireman. I’m not writing this summary because I just ran a calculation showing that this would best achieve my preferences, I’m doing it because it’s a subpart of the overall plan of making this episode, which itself is a subpart of other plans. The connection to my preferences is quite distant. How can machines know how to make sense of that?

All of these issues suggest that we need work across a wide range of other fields (such as cognitive science, psychology, and neuroscience) to teach machines to figure out human preferences.

So far, we’ve been talking about how to solve these issues with just a single human.

But of course, there are going to be multiple humans using different AI systems: how do we deal with that? As a baseline, we could imagine that every human gets their own agent that optimizes for their preferences. However, this would benefit people who care less about other people’s welfare.

What if we had laws that prevented AI systems from acting in such antisocial ways? It seems likely that superintelligent AI would be able to find loopholes in such laws, so that they do things that are strictly legal but still antisocial, e.g. line-cutting. (This problem is another instance of the problem that we can’t just write down what we want and have AI optimize it.)

What if we go the opposite way and rather than make AIs that benefit their owners, instead we make our AI systems utilitarian, assuming we could figure out some acceptable method of comparing utilities across people? Then we get the “Somalia problem” — these AIs up going to Somalia to help the worse-off people. In some sense this is good, but why would anyone buy such an agent? It seems unlikely that charitable AIs will be the most popular with customers.

Stuart concludes that it’s not obvious how we deal with a situation with powerful AI systems and multiple people with competing preferences. There is still much more to be said and done to account for the impact of AI on all of humanity. To quote HC, “There is really no analog in our present world to the relationship we will have with beneficial intelligent machines in the future. It remains to be seen how the endgame turns out.”

Alright, that’s my quick summary of the book largely cribbed from Rohin Shah.

Now you know enough to dive right into this interview!

So without further ado, here’s Professor Stuart Russell.

The interview begins [00:19:06]

Robert Wiblin: Today I’m speaking with Professor Stuart Russell. Stuart is a computer scientist best known for his contributions to artificial intelligence. He’s a professor at UC Berkeley and did his PhD in computer science at Stanford focusing on analytical and inductive reasoning back in the 80s. He’s the author of possibly the most popular AI textbook in the world, “Artificial Intelligence: A Modern Approach”, which was first published in 1995 and is now in its third edition. He’s the founder and leader of the Center for Human-Compatible AI at UC Berkeley, and his most recent book, “Human Compatible: Artificial Intelligence and the Problem of Control”, discusses the potential upsides and downsides of transformative AI and came out in October last year. Thanks for coming on the podcast, Stuart.

Stuart Russell: Nice to be here on the podcast.

Robert Wiblin: Okay, so I hope to getting to talk about the best arguments against your view and the most valuable work that people should be doing right now if your basic perspective on AI is correct. But first, what are you doing at the moment, and why do you think it’s really important work?

Stuart Russell: Well at the immediate moment I just finished the fourth edition of my textbook. So that went out to the publisher yesterday. And interestingly, we revised the introductory part of the textbook which explains what AI is and how you should think about it overall; what is the problem that you’re solving when you build an AI system. So we revised all that to say that the standard model that you’re building something which is going to optimize an objective: that standard model is wrong and you should have a different model according to the principles outlined in “Human Compatible”. The AI system is supposed to be optimizing human preferences but doesn’t know what they are, and you get this much more of a two way problem where the human and the machine are jointly engaged in a game whose solution is beneficial to the human being.

Stuart Russell: So we say all that in chapter one and chapter two, and then we say, “Well unfortunately, all of the technical material that we’ve developed in AI is under the old model”. So, for example, your search algorithms all assume that you know exactly what the goal is and you know exactly what the cost function is. Your MDPs all assume that you know exactly what the reward function is. So there’s a big gap and that’s basically what we’re going to try to fill in the next few years, so that along we’re telling people that they should rethink their understanding of AI completely from the ground up. We actually provide them with the new tools so that they can do AI in this new paradigm.

Human Compatible: Artificial Intelligence and the Problem of Control [00:21:27]

Robert Wiblin: Yeah. So let’s get into your new book, not a textbook, a popular nonfiction book, “Human Compatible: Artificial Intelligence and the Problem of Control”. I read through it this week and I think it’s the clearest and most accurate and most precise summary of the ideas that I’m aware of so far. And so I guess going forward, it’s kind of my default recommendation for the first book someone should read about the upside and downside of AI, and so if you’re recommending books like that to your friends, this is probably the one to recommend for the next couple of years. I guess, in particular, relative to “Superintelligence” by Bostrom, which was my previous recommendation, it seems to engage a lot more with how AI systems actually look today rather than more just the general philosophy of them. So we’re going to stick up a summary of a talk by you with a summary of the book before the interview so listeners already have an idea of what’s in there. But, what do you think listeners are most likely to misunderstand about your beliefs or what’s in the book? I imagine that you often encounter people who are misconstruing what you’re worried about, including AI researchers, potentially?

Stuart Russell: Well, I think there are a number of misunderstandings of the risk. For example, Steven Pinker cites me as someone who’s not at all worried about the risk of AI, and so obviously there’s some misunderstanding there because… I guess what I’m saying is, unless we, in some sense, rip out everything we know about AI and start again and do things in this different way, then things are heading in the wrong direction. And I don’t think it’s trivial to assume that just because we wrote some papers and I wrote a book, everyone is going to rip out what they’re doing about AI and start again and do things a different way, and that governments are going to install regulations and there’s going to be policing to make sure people are building safe AI systems.

Stuart Russell: That’s a huge set of assumptions. And I don’t think you can just say, “Oh, of course that’s just going to happen”. And Steven, you know, I’ve talked to him about this and I don’t think he’s just completely coming from the wrong place. He wants to be optimistic. So he wants to think that every one of these major stories can be recast in a positive light. But our record on other technologies is not that great. He says we have such a safety-oriented culture that nothing could possibly go wrong but, you know, go watch the “Chernobyl” series and tell me that nothing could possibly go wrong.

Stuart Russell: So that’s one sort of misconstrual. Another one is that I’m saying that any moment now, your laptop is going to become superintelligent and start destroying the world. Another misunderstanding is that it’s the killer robots that are going to get us, and that’s a common thing in the media. So if you mention killer robots and you mention existential risk within an hour of each other, then they immediately make that connection and say, “Oh, it must be the killer robots that are the existential risk”. So another misunderstanding, which I don’t think you could get from my book because… actually no, you can get it from my book, because some people have misunderstood this, is that the existential risks from AI comes from machines spontaneously developing consciousness. And that’s kind of what you see in Hollywood. And so it’s understandable that some journalists think that that’s what we’re talking about. But I’ve had professional philosophers who’ve read my book and who spent an hour or two talking to me still say, “Oh, Professor Russell says that the risks from AI comes from spontaneous consciousness”. It’s like, “No, that’s exactly what I said is not the risk and it’s exactly what we should not be focusing on” and so on. But it doesn’t matter. Whatever anyone says or writes or talks about, people come with their own preconceptions and they hear what they want to hear, not what you say and not what you write. So debate is not the Olympian intellectual toing and froing that one would expect. It just doesn’t work that way.

Robert Wiblin: Yeah. How about ML researchers? Have you found that your credibility as a top AI expert is maybe helping to convince them that you’re not just naive and misunderstanding the technology and maybe are they starting to get it from your point of view?

Stuart Russell: Well, I can’t speak for the 10,000 deep learning people who’re at NeurIPS. I suspect that they don’t know what the argument is. So my view of the people who want to deny that there’s anything to talk about. That there’s anything to worry about, is that it’s basically a defensive reaction. If someone says to you what you’re doing could be contributing to the end of the world, of course they’re going to say, “No, no, no, no! You don’t understand”. And they’ll come up with the first thing that comes into their head that would mean that they don’t have to worry about it and don’t have to engage with it. So people come up with the argument saying, “Oh, we can always just switch it off”.

Stuart Russell: Or, you know, “Calculators are superintelligent at arithmetic; they didn’t take over the world, so therefore AI couldn’t possibly take over the world” and so on. In my talks, I used to list some of these arguments and their obvious refutations, but I got to about 28 and it was taking up too much time in my talk. So I just stopped doing that. But in the book, I do list a chapter on the arguments and why they don’t make sense. But there is this sort of ability to hold simultaneously in your mind two contradictory thoughts. So one is, “Oh, I’m a deep learning researcher and we’re going to achieve superintelligent AI”. And the other is “Superintelligent AI is centuries or millennia in the future and we don’t have to worry about it”. Or, “It’s impossible and we don’t have to worry about it”. And you see both of those thoughts being held simultaneously in people’s minds. So it’s kind of weird, right? If a leading cancer researcher who’s responsible for entire institutes and billions of dollars in funding got up and said, “Well actually, it’s impossible to cure cancer. All of our efforts are futile. We’ll never ever come up with a cure for cancer”. That would be pretty shocking. But that’s what AI researchers are doing in order not to have to worry about the consequences of their own work.

Robert Wiblin: I suppose they could say, “Well, we’ll never get to full AGI, but we’ll have fantastic limited tools that will be very useful nonetheless”. That would be a reconciliation, but it might not be what they actually believe.

Stuart Russell: I’ve had this discussion with people who are working on reading textbooks, right? And I think there’s this imagination failure. So you work very hard, and if you actually were able to build some software which could read a textbook, right? And you’re working on one textbook and you’ve been spending 20 years getting it to read this one textbook. But if you’re doing your work well, then you’re not building a special purpose solution for that textbook. You’re building a general purpose capability and you have to understand that if you succeed in that goal, then your system is going to be able to read all the textbooks that the human race has ever written in all subjects and it’s going to be able to do that before lunch. So it’s not as if we can build systems that are truly intelligent in the sense that they can really read a physics book and then design a better radio telescope, or they could read a bunch of medical textbooks and actually function successfully as a competent physician, not just something that classifies x-ray images into “Yes/no”, but actually can engage with the whole diagnosis and treatment process properly. If you could do that, those would be the useful tools, then you’ve basically built tools that could create general intelligence.

Principles for Beneficial Machines [00:29:25]

Robert Wiblin: Yeah. All right. So we’ll return to some of the attempted rebuttals of your view and how strong or weak those arguments are in a section later on. I do want to spend some time focusing on that. But first, let’s maybe walk through the big picture approach that you have which is a new paradigm that will make ML work better and potentially lead to overall alignment.

So in the book you summarize your approach in the form of three principles.

As I mention in the intro those are:

The machine’s only objective is to maximize the realization of human preferences.
The machine is initially uncertain about what those preferences are.
The ultimate source of information about human preferences is human behavior.

So let’s talk about them one by one starting with the first principle, which you call ‘purely altruistic machines’. What do you mean by that?

Stuart Russell: So actually what the principle says is that the machine’s only purpose is the realization of human preferences. So that actually has some kind of specific technical content in it. For example, if you look at, in comparison, Asimov’s laws, he says the machine should preserve its own existence. That’s the third law. And he’s got a caveat saying only if that doesn’t conflict with the first two laws. But in fact it’s strictly unnecessary because the reason why you want the machine to preserve its own existence is not some misplaced sense of concern for the machine’s feelings or anything like that. The reason should be because its existence is beneficial to humans. And so the first principle already encompasses the obligation to keep yourself in functioning order so that you can be helping humans satisfy their preferences.

Stuart Russell: So there’s a lot you could write just about that. It seems like “motherhood and apple pie”: of course machines should be good for human beings, right? What else would they be? But already that’s a big step because the standard model doesn’t say they should be good for human beings at all. The standard model just says they should optimize the objective and if the objective isn’t good for human beings, the standard model doesn’t care. So just the first principle would include the fact that human beings in the long run do not want to be enfeebled. They don’t want to be overly dependent on machines to the extent that they lose their own capabilities and their own autonomy and so on. So people ask, “Isn’t your approach going to eliminate human autonomy”?

Stuart Russell: But of course, no. A properly designed machine would only intervene to the extent that human autonomy is preserved. And so sometimes it would say, “No, I’m not going to help you tie your shoelaces. You have to tie your shoelaces yourself” just as parents do at some point with the child. It’s time for the parents to stop tying the child’s shoelaces and let the child figure it out and get on with it.

Robert Wiblin: There’s something interestingly patronizing about that image.

Stuart Russell: Yeah, so this is something I say in the book. That’s one metaphor or model that you could try to apply to this relationship between machines and humans. But it’s not the right one because we are not the children of the machines and we don’t want to be the children of the machines. We are, in some sense, supposed to be in charge of them, right? So in some sense, we’re their parents but we’re being treated as their children. So the point here is that there isn’t really a good model or metaphor that you can apply. There’s no example in nature of this. There is no example in human society that I know of. So we really have to work it out from scratch as to what is this relationship that doesn’t have any analogs anywhere else?

AI moral rights [00:33:05]

Robert Wiblin: So to get this principle of machines just trying to satisfy human preferences off the ground, it seems like throughout the book you kind of assume that AIs necessarily don’t have their own kind of independent moral interests or rights, or their own level of welfare. I guess, if that’s not the case, how much does that break this principle and how much is that kind of a problem for your overall vision?

Stuart Russell: So yeah, I talk a little bit about that in the question of machine consciousness which I say is mostly irrelevant. It’s irrelevant from the safety point of view. But it is relevant when it comes to the rights of machines. If they really do have subjective experience, and putting aside whether or not we would ever know, putting aside the fact that if they do, it’s probably completely unlike any kind of subjective experience that humans have or even that animals have because it’s being produced by a totally different computational architecture as well as a totally different physical architecture. But even if we put all that to one side, it seems to me that if they are actually having subjective experience, then we do have a real problem and it does affect the calculation in some sense. It might say actually then we really can’t proceed with this enterprise at all, because I think we have to retain control from our own point of view. But if that implies inflicting unlimited suffering on sentient beings, then it would seem like, well, we can’t go that route at all. Again, there’s no analogues, right? It’s not exactly like inviting a superior alien species to come and be our slaves forever, but it’s sort of like that.

Robert Wiblin: I suppose if you didn’t want to give up on the whole enterprise, you could try to find a way to design them so that they weren’t conscious at all. Or I suppose alternatively you could design them so that they are just extremely happy whenever human preferences are satisfied. So it’s kind of a win-win.

Stuart Russell: Yeah. If we understood enough about the mechanics of their consciousness, that’s a possibility. But again, even that doesn’t seem right.

Robert Wiblin: Because they lack autonomy?

Stuart Russell: I mean, we wouldn’t want that fate for a human being. That we give them some happy drugs so that they’re happy being our servants forever and having no freedom. You know, it’s sort of the North Korea model almost. We find that pretty objectionable.

Robert Wiblin: Yeah. I guess I would be happy about North Korea if I thought they were happy, but there’s kind of two problems there: one, their lack of autonomy and also then their misery. It seems like sophisticated AIs in the future, in the fullness of time, could end up being incredibly highly articulate advocates for their interests. And I guess inasmuch as their goals deviate from ours at all, they might be able to better achieve them if they had independent legal standing independently of their owners. I guess given people’s incredible distaste for this lack of autonomy for what seem to be agents that might have rights, do you think that AI systems might be able to convince us to give them more autonomy or to give them independence or by pointing out how, in a sense, that their treatment at that time is kind of analogous to slavery in a way?

Stuart Russell: It’s an interesting question. It starts to feel quite difficult to analyze. We need to begin with a few science fiction stories along these lines and try and then work out what we think. So I guess the scenario is that we’ve designed them according to the three principles. So they’re just trying to figure out how to be as helpful as possible to us, but it happens to be that by accident they’re conscious.

Robert Wiblin: Well it doesn’t actually matter whether they’re conscious or not. I think they could be unconscious, but I’m saying inasmuch as you get any deviation between their interests inasmuch as we’ve imperfectly made them want to satisfy our preferences, then that creates a window for them to then think, “Well actually, I’ll achieve my goals by kind of advocating for more independence, and in fact there’s arguments that will be very persuasive to society that would allow me to successfully do that”.

Stuart Russell: I mean, I could imagine that to be the case in the sense that it might be that in the long run, if we had trust, let’s call it mathematical trust, the fact that the machines really are only working in our interest, you could make the argument that in fact a more equal partnership would, in the long run, be more beneficial for humans. That humans viewing these machines as their slaves might not, in the long run, be good for our souls. But we could only have a different relationship if we trusted that their autonomy would still be beneficial. That they wouldn’t then suddenly use it to take over the world and destroy us. And it’s kind of interesting. So I’ve been reading and looking actually for models in science fiction of non-dystopian AI futures and Iain M. Banks Culture novels, which I’m sure a lot of listeners to this podcast will be familiar with, that’s one attempt to sketch out a beneficial coexistence between machines and humans. By and large, the machines are completely trusted by the humans that they’re always going to act in our best interests. The humans still mess with each other, but the machines are always–

Robert Wiblin: Reliable assistants?

Stuart Russell: Well, they’re not assistants because I mean, yes, they provide assistance, but they’re clearly far superior in some sense. They kind of run the universe, they run the culture, they make sure that everything functions. It isn’t clear where the humans have any actual power in the whole system. It doesn’t get into too much detail about the sort of governance structure. But, on occasion, it’s clear that the “Minds”, which is what he calls the AI systems, are the ones who kind of run things. So it’s very interesting. So that might be fine, right?

Stuart Russell: We think of the implementation of the three principles as installing permanent machine slavery, but in fact, it doesn’t imply that. That’s the only metaphor we have for describing it. But it could be that if because of the principles we have perfect trust, we can allow the machines as much autonomy as they want and as we want. Because it doesn’t engender any risks. So we treat each other, by and large, mostly within civilized societies, we trust that other people in our society are not out to get us. And so we function successfully as coequals and we allow autonomy to other people as long as they don’t abuse it.

Humble machines [00:39:35]

Robert Wiblin: Yeah. All right, so we should move on. The second principle is called humble machines. What’s that one, and why does it matter?

Stuart Russell: So the second principle is that machines are going to be uncertain about the human preferences that they’re supposed to be optimizing or realizing. And that’s not so much a principle, it’s just a statement of fact. That’s the distinction that separates this revised model from the standard model of AI. And it’s really the piece that is what brings about the safety consequences of this model. That if machines are certain about the objective, then you get all these undesirable consequences: the paperclip optimizer, et cetera. Where the machine pursues its objective in an optimal fashion, regardless of anything we might say. So we can say, you know, “Stop, you’re destroying the world”! And the machine says, “But I’m just carrying out the optimal plan for the objective that’s put in me”. And the machine doesn’t have to be thinking, “Okay, well the human put these orders into me; what are they”? It’s just the objective is the constitution of the machine.

Stuart Russell: And if you look at the agents that we train with reinforcement learning, for example, depending on what type of agent they are, if it’s a Q-learning agent or a policy search agent, which are two of the more popular kinds of reinforcement learning, they don’t even have a representation of the objective at all. They’re just the training process where the reward signal is supplied by the reinforcement learning framework. So that reward signal is defining the objective that the machine is going to optimize. But the machine doesn’t even know what the objective is. It’s just an optimizer of that objective. And so there’s no sense in which that machine could say, “Oh, I wonder if my objective is the wrong one” or anything like that. It’s just an optimizer of that objective.

Robert Wiblin: I guess the same way that my coffee grinder doesn’t wonder like, “Do I really want to be creating coffee”? It’s just cranking out the process that it must follow.

Stuart Russell: Yeah, you press the button… And so all machines are like that. All humans are like that. We’re all just following the constitution. That is what our design does. So we have to get that constitution right from the beginning. And so this is my proposal for how machines should be constituted. And there’s lots and lots of, you know, unfilled in details in this. So does the machine need to know the principles, so to speak, or again, could it just be that it’s constitutionally constructed to be implementing these principles. And how does it go about satisfying the first principle? Does it have to have an explicit model that it builds of human preferences and so on.

Stuart Russell: I mean, that would be the natural way that we would think about it. But, you know, we would also think the natural way that you would play chess is that you would know the rules of chess. But you can build a chess program that doesn’t know the rules of chess, right? It can’t reason about them. It has no sense that if you showed it a move that was illegal, it would have no sense that that move was illegal. It doesn’t even have the distinction of legal/illegal chess moves. So there are ways to build AI systems that are very different from the way you might, in the common sense mode, think about how that task would be done. So I think there’s a lot of work to do in taking these three principles and actually instantiating them in different ways and understanding the properties of those instantiations.

Robert Wiblin: Yeah. So I guess the benefit of having machines that are uncertain about what they’re trying to optimize or what people’s preferences are, is that it gives them a reason to reflect or to have the kind of higher-order processes where they reflect on whether what on the object level they’re trying to do right now is actually going to accomplish their higher-level goal. But if I understand correctly, maybe all AI systems so far kind of haven’t had this higher-level uncertainty about what the goal should be. The utility function or the cost function, the reward function, they’re all built in as if they’re known perfectly, which then can lead to some wacky behavior. Is it going to be hard to persuade people, do you think, to shift towards uncertainty about what should be optimized? Or is it kind of widely recognized that this is just an inevitable progression as AIs are having to solve more complicated problems with greater autonomy?

Stuart Russell: So what I predict is going to happen is that in a few years, everyone is going to say, “Well, of course we always thought that. What are you saying, you know, fixed objective? Whoever thought that machines should have a fixed objective? Don’t be ridiculous”! But there isn’t going to be a moment in which they say, “You know what, I was wrong. And Professor Russell, you’re right, and I’m going to switch to your way of doing things”. Everyone thinks that that’s how it works. But it doesn’t work that way.

Robert Wiblin: People forget that they ever thought otherwise.

Stuart Russell: And I am seeing more and more people who are working in this revised model.

Robert Wiblin: Has anyone actually built an ML system that achieves a useful goal that is uncertain about what it’s trying to accomplish?

Stuart Russell: Yes. So for example, Dorsa Sadigh at Stanford is doing this and other people, Chelsea Finn, Anca Dragan… So people in human-robotics interaction. And I think in that area of HRI, that was sort of an independent thread because if you’re building a robot that’s going to interact with a human, the robot has to try to figure out what it is the human is trying to do so that it can either get out of the way or be helpful. So you’re going to need to do that. And as a practical matter, you know, there are lots of systems that do this already. So when you go onto your frequent flyer program, it asks you, “Do you have a seat preference”? Well if it knew your objective already, it wouldn’t ask you that. So it’s a very trivial example, but it shows that in many, many practical cases, AI systems that are operating on your behalf need to learn about your preferences. They don’t come out of the box with one fixed set of human objectives that they’re supposed to be optimizing. And we’re just taking that simple principle, you know, “Do you prefer an aisle seat or a window seat” and extending it to everything. Because as AI systems start doing more than just booking seats for you, they’re going to need to know more about your preferences.

Learning to predict human preferences [00:45:55]

Robert Wiblin: Yeah. Okay. So the third principle, which is super related to the previous one, is that ML systems should learn to predict human preferences. Reading the book for the first time, I think I actually understood what inverse reinforcement learning is. I’d heard a lot about it and talked about it with people on the show, but maybe I can quickly describe it and you can tell me whether I’ve gotten it right.

Stuart Russell: Okay.

Robert Wiblin: So basically we should design ML systems, or inverse reinforcement learning… Well normal reinforcement learning is that an ML system will experiment with different ways to accomplish a goal and then it will get reinforced if it manages to do that. In inverse reinforcement learning, it’s trying to figure out… It observes different things happening and then sees whether the goal was accomplished and then tries to figure out what the goal was. So it has some kind of prior, some expectation about what possible goals could be getting optimized here, and then tries to reverse engineer what the goal must be in order to make sense of the behavior of an agent. Is that a decent summary?

Stuart Russell: Yeah, that’s one way to do inverse reinforcement learning, as what you described, which is Bayesian IRL, where you start with a prior and then the evidence from the behavior you observe then updates your prior and eventually you get a pretty good idea of what it is the entity or person is trying to do. It’s a very natural thing that people do all the time. You see someone doing something and most of the time it just feels like you just directly perceive what they’re doing, right? I mean, you see someone go up to the ATM and press some buttons and take the money. It’s just like that’s what they’re doing. They’re getting money out of the ATM. I describe the behavior in this purposive form.

Stuart Russell: I don’t describe it in terms of the physical trajectory of their legs and arms and hands and so on. I describe it as, you know, the action is something that’s purpose fulfilling. So we perceive it directly. And then sometimes you could be wrong, right? They could be trying to steal money from the ATM by some special code key sequence that they’ve figured out. Or they could be acting in a movie. So if you saw them take a few steps back and then do the whole thing again, you might wonder, “Oh, that’s funny. What are they doing? Maybe they’re trying to get out more money than the limit they can get on each transaction”? And then if you saw someone with a camera filming them, you would say, “Oh, okay, I see now what they’re doing. They’re not getting money from the ATM at all. They are acting in a movie”.

Stuart Russell: So it’s just absolutely completely natural for human beings to interpret our perceptions in terms of purpose. In conversation, you’re always trying to figure out “Why is someone saying that”? Are they asking me a question? Is it a rhetorical question? It’s so natural, it’s subconscious a lot of the time. So there are many different forms of interaction that could take place that would provide information to machines about human preferences. For example, just reading books provides information about human preferences: about the preferences of the individuals, but also about humans in general.

Stuart Russell: One of the ways that we learn about other humans is by reading novels and seeing the choices of the characters. And sometimes you get direct insight into their motivations depending on whether the author wants to give you that. Sometimes you have to figure it out. So I think that there’s a wealth of information from which machines could build a general prior about human preferences. And then as you interact with an individual, you refine that prior. You find out that they’re a vegan. You find out that they voted for President Trump. You try to resolve these two contradictory facts. And then you gradually build up a more specific model for that particular individual.

Animals and AI [00:49:33]

Robert Wiblin: Yeah. So I guess throwing these three principles together quickly; the first one is that machines should be designed basically to satisfy our preferences. The second principle is that they’re not sure what they are. And the third principle is that they should be designed to figure out what they are. I guess an objection that I’ve heard from a few people is that shouldn’t the first principle include potentially the preferences of animals or other beings that don’t own any AIs? I guess you point out that the animals would get consideration indirectly by the preferences that humans have for animals to have good lives and not to suffer. But do you worry that that might be an insufficient level of moral consideration for sentient beings that aren’t the AIs owner or are capable of owning AIs? I guess it seems like it’s potentially running close to a moral catastrophe there, although potentially one that I guess we already have without AIs.

Stuart Russell: Yeah, I actually bring up that issue very early on as soon as I bring up the first principle. I say, “Well, what about animals”? And I think it’s an interesting point, certainly worth discussing more, but obviously there are some complications of taking into account the interest of animals, right? I mean, if each organism gets equal vote, so to speak, well, you know, we’re outnumbered, a billion to one or trillions to one by bacteria, billions to one by insects and phytoplankton and so on and so forth. So you’re already facing a real problem if you try to go that way. So I think what I actually believe is the following. That completely separate from AI and all this stuff, I think we probably don’t give enough moral weight to the preferences of animals and I think that’s changing. I think we’re less willing to have farm animals living in utterly miserable conditions. I think more and more people are becoming vegans and vegetarians because they don’t feel right about the way we treat farm animals. But ultimately, it’s hard to argue that we should build machines that care more about animals than we care about animals. I mean, it’s sort of a general principle. What is the moral obligation on you to build machines that do what you don’t want to do?

Robert Wiblin: Yeah, I mean I suppose someone could say, “Well, what we should do is design an AI where the goal isn’t to satisfy human preferences”. It’s to do the thing that’s optimally moral. And the second principle switches from being “It’s uncertain about what human preferences are”, to “It’s uncertain about what is the moral truth and what is ideal”. And then the third principle is that it tries to do philosophy to figure that out.

Stuart Russell: What makes you think that if there is an objective moral truth in the universe (which I don’t believe)? What makes you think that human benefit is anywhere on the scale? I mean there’s potentially billions, trillions of civilizations in the galaxy. There’s billions of species on the Earth. What makes you think that human benefit is going to even be in the 17th decimal place of what the optimal moral strategy is? I mean, think about it. You’re talking about an objective moral truth which has to be independent of what planet the AI system happens to have been designed on, and it has to be independent of which species happens to have designed the AI system. So even if you believe that this objective moral truth exists, which I don’t, it seems like a pretty much guaranteed catastrophe.

Stuart Russell: There’s very little reason to think that the outcome from optimizing objective moral truth is going to be anything that we’re particularly happy with. So it seems like it’s a good argument to have in the pub–

Robert Wiblin: I mean this is a good reason why we won’t do this, but I guess almost by definition we should. But I suppose the problem is if you try doing inverse reinforcement learning–

Stuart Russell: It’s only by definition we should if you think that objective moral truth exists.

Robert Wiblin: Right, so it kind of presupposes that, in which case, if that exists then that is what you ought to do.

Stuart Russell: I think there are moral questions that arise in the application of the principles. Particularly, and this is a lot of what chapter nine in the book is about, is it’s one thing to build an AI system when there’s only one human being on Earth, but, you know, there is more than one person.

Stuart Russell: So how are you going to figure out what the AI should do on behalf of multiple people? So I just want to come back to the animal question. I think the sentiment behind it is, “We do a lot of terrible things to animals”. And I’m pretty sure that the people asking these questions are people who think that. There are a lot of people who don’t think that. A lot of people who think, “Yeah, animals? We eat them”! It’s kind of funny, you know, I read a lot of books with my kids and when you read like 18th century books where people are landing on unexplored islands or whatever, the first thing they do when they see an animal is kill it. It’s like, you know, they don’t even think about it. It’s like, of course, an animal appears and the men raise their rifles and blow it to smithereens.

Stuart Russell: That’s just what you do, right? And then in the 19th century, sometimes they do, sometimes they don’t. And then in the 20th century, well, of course you wouldn’t think that the first thing you should do when you see an animal is kill it. So we’ve clearly gone through a fairly rapid evolution in how we think about our relationship with animals. But I think we’re still very short-term and myopic. There’s a whole other story about the fossil fuel industry but, you know, the fossil fuel industry is part of a process that’s leading us to destroy our environment and our climate. Clearly that’s myopic, right? That’s not what we prefer, right? We don’t prefer that for the next few centuries the Earth is mostly uninhabitable. That’s not what we prefer. That happens to be what we’re doing.

Stuart Russell: And it’s important to distinguish between what we do and what we prefer. And I think the AI systems could help us notice this discrepancy and help us change our actions to be more aligned with our preferences. And so we have a failure of collective decision-making in many cases. So that’s how I think we can resolve this feeling that, “Oh no, we don’t want the machines to just contribute to the maximal exploitation of the animal kingdom”. They shouldn’t do that because our exploitation of the animal kingdom is shortsighted and not in our own long-term best interest. And so hopefully that resolution should make the animal rights people happier. And I think, in general, we will over time. So think about this, right? Just think about our preferences for the wellbeing of other humans, right?

Stuart Russell: I think those are changing as well. I think in various societies over history, there have been times when an individual would have very little concern for the wellbeing of other individuals. It was basically dog-eat-dog law of the jungle. Particularly when it comes down to very precarious existence. You’re going to be much more focused on your own and much less focused on other people. But that’s changing and I think it’s less and less acceptable, I hope, for a person in a civilized society to be indifferent to the suffering of others and not taking any action at all to prevent it. Whether that’s just paying your taxes or agreeing that we should pay taxes to help the less well off and so on. So the same thing would apply to animals. I hope, and I think that over time we will just give more weight, but don’t forget giving weight to the preferences of others actually costs you something, right? It’s not just, “Well you’re a bad person if you don’t give a lot of moral weight”. It’s, “Okay, fine. You want chickens to be free to roam around and to live out their natural lives, then great. You’re going to pay $22 a dozen for eggs. Are you happy about that”? That’s just a simple example, right? But it doesn’t count as actually having concern for animals if it doesn’t cost you anything.

Robert Wiblin: Yeah, I guess we don’t know how to do moral philosophy well ourselves, so one problem would be that we wouldn’t know how to design machines to solve this problem, or to just do a better job of it. Whereas I guess we do know potentially, or we could potentially design them to figure out what our preferences are. So that’s one advantage of that approach. I guess another thing is by throwing in all of these moral improvements into AI, we’re just trying to solve too many things all at once. So maybe we should just solve the problem of moral human advancement independently of AI and then having done that, having made our preferences better, then get the AI to do the thing because it wants to do what we want rather than try to throw in the kitchen sink basically.

Stuart Russell: Right. We can’t consistently want it to do what we don’t want done. That’s just not logical to want something that you don’t want. So it doesn’t make sense to want to build machines that do what you don’t want.

Robert Wiblin: Yeah. I guess there’s a challenge of people wanting to build machines that do things that other people don’t want them to do. Then there’s the problem of people having different preferences. Yeah, let’s move on.

Enfeeblement problem [00:58:21]

Robert Wiblin: You talk about this enfeeblement problem which I hadn’t really heard explored before which, I guess, is this challenge of trying to figure out if we’ve handed over most of the actual work that has to be done to make society function over to AIs, then ideally we’d need to figure out some way to keep humans sharp and informed enough to be sensibly guiding through their preferences their AI tools, potentially in perpetuity, even when there’s no longer any economic incentive or practical incentive for people to learn about the world anymore because they don’t have to have jobs anymore.

Robert Wiblin: I guess I’m a little bit skeptical of that vision of things because it seems like within a couple of generations, I would expect just almost a full handover to AIs rather than continuing base level like human supervision. Because the practical benefits of handing over to a greater degree would be so large. But within that framework, I guess it does seem like people would be motivated to learn about the world just out of curiosity. Because lots of people learn things that aren’t useful for their jobs, and maybe also AIs would be able to design super entertaining and engaging ways for us to educate ourselves about the things that we need to know in order to better figure out what we want and how they ought to behave. Yeah. What do you think of that take?

Stuart Russell: Well I don’t think anything you can say about humans can be said out of the context in which those humans are in. So humans have curiosity, but how much curiosity would they have in a different situation?

Robert Wiblin: Where there was no practical benefit, or they hadn’t learned that there was practical benefit to learning?

Stuart Russell: I mean, so go all the way to, you know, the children who are raised by wolves or whatever. The outcome seems to be that, “Oh my gosh, if they’re abandoned in the woods as infants and somehow they survive and grow up, they don’t speak Latin”. They don’t speak at all. And they have some survival skills, but are they writing poetry? Are they trying to learn more about physics? No, they’re not doing any of those things. So there’s nothing natural about, shall we say, scientific curiosity. It’s something that’s emerged over thousands of years of culture. So we have to think what kind of culture do we need in order to produce adults who retain curiosity and autonomy and vigor as opposed to just becoming institutionalized. And I think if you look at E.M Forster’s story “The Machine Stops”, I think that’s a pretty good exploration of this. That everyone in his story is looked after. No one has any kind of useful job. In fact, the most useful thing they can think of is to listen to MOOCs. So he invented the MOOC in 1909 so people are giving online open lectures to anyone who wants to listen, and then people subscribe to various podcast series, I guess you’d call them. And that’s kind of all they do. There’s very little actual purposeful activity left for the human race. And this is not desirable; to me, this is a disaster. We could destroy ourselves with nuclear weapons. We could wipe out the habitable biosphere with climate change. These would be disasters, but this is another disaster, right?

Stuart Russell: A future where the human race has lost purpose. That the vast majority of individuals function with very little autonomy or awareness or knowledge or learning. So how do you create a culture and educational process? I think what humans value in themselves is a really important thing. How do you make it so that people make the effort to learn and discover and gain autonomy and skills when all of the incentive to do that up to now, disappears. And our whole education system is very expensive. As I point out in the book, when you add up how much time people have spent learning to be competent human beings, it’s about a trillion person years and it’s all because you have to. Otherwise things just completely fall apart. And we’ve internalized that in our whole system of how we reward people. We give them grades. We give them accolades. We give them Nobel prizes. There’s an enormous amount in our culture which is there to reward the process of learning and becoming competent and skilled.

Stuart Russell: And you could argue, “Well that’s from the enlightenment” or whatever. But I would argue it’s mostly a consequence of the fact that that’s functional. And when the functional purpose of all that disappears, I think we might see it decay very rapidly unless we take steps to avoid it.

Robert Wiblin: Doesn’t it just seem kind of super likely that as these ML systems that are making decisions on our behalf get tested more and more and they get better and better, that humans will just have gradually less to add and eventually they’ll just opt out of having any direct supervision and then they’ll act like a purely autonomous thing. We just have to hope that they’ve been taught to be good agents on our behalf and then we don’t have to go to school anymore. We just throw parties, or do whatever is most fulfilling for us and the AIs will help us do that.

Stuart Russell: Yeah. So the question is what is most fulfilling is highly culture dependent, right? The idea that we would become painters or poets or that we’d engage in learned discussions and all this stuff… I mean the children who are raised by wolves in the forest don’t do that. They don’t start painting paintings. So the notion of what’s most fulfilling for us, there’s no absolute. It depends on how you design the society and the culture and how it operates as to what people find most fulfilling. I always wonder about the Incas. Apparently they used to put thorns and make ropes out of creepers and put thorns on them, and then drag those across their tongues. Now it doesn’t sound very fulfilling to me, but apparently that’s what their society led them to do. They were really into that.

Stuart Russell: I don’t get it. So people are extremely plastic. Another good example would be “Never Let Me Go”, which is a novel and a movie about a society where we create clones of ourselves. Not to give too much of the story away, but we create clones of ourselves, and the purpose of those clones is to provide organs when our own organs fail. And those clones are still treated as human beings, but they’re designated as sheep for the slaughter. And they adapt to it. They just accept this is who they are. This is what they’re for. And that’s the sad part of that story. That we’re so plastic that we could be made to adapt even to that fate and accept it. And so we need absolutely to think about the social dynamics. And this is where I’m totally not a professional thinker in this area and I don’t have a solution for the enfeeblement problem, but I think it’s something we have to think about and prepare for. And it relates to the question of employment, which people are talking about a lot these days. What jobs are people going to do? And in the section in the book where I talk about that, the issue that I see is that if it’s true that machines are going to be doing all of the stuff we currently call work: keeping the trains running on time, making the trains, et cetera, et cetera. Then if people are going to have any useful function at all, it’s going to be in these interpersonal roles where, to some extent, we have a competitive advantage because I think we share the subjective experience stuff and that helps a lot. But in order to have a real economic role or a valued role, you have to be good at those things. And that’s the missing piece it seems to me.

Stuart Russell: We have people who do that now, childcare, eldercare: these interpersonal roles where empathy and so on is valued. But these are not high valued professions. Because we don’t know how to do them. And the reason we don’t know how to do them, it’s because we haven’t invested trillions of dollars over centuries to develop our knowledge in these areas. Here’s my cell phone. This is a trillion dollars of R&D right there. If you go to the hospital and get your broken leg fixed, that’s a trillion dollars of R&D right there. And they work. And that’s the reason why we don’t pay the orthopedic surgeons $6 an hour and let them eat stuff from the fridge, which is what we do with childcare, you know, babysitters. Because the orthopedic surgeon is the result of $1 trillion of R&D and he knows how to do it and it works. And that’s a very valued, prestigious and important role in society. So oddly enough, the happiness of our children, the wellbeing of our parents, our own psychological fulfillment: these are all really important to us, but we don’t know how to do any of it. I think it’s important for us to figure that out, which means investing in decades of research and decades of educational reform now. If we think that in 30 years’ time, we’re going to have a big problem.

Counterarguments [01:07:09]

Robert Wiblin: Okay, let’s move on. I’m very interested to talk about some counterarguments and ways you could potentially be wrong. In your own view, what do you think is the most likely way that you could be mistaken? I suppose either AI just isn’t that big a deal, or I suppose alternatively it doesn’t pose significant risks and we should just expect it to work out fine.

Stuart Russell: No, I don’t believe that. One of the examples I give early on in the book is what’s happened with social media and the simple machine learning algorithms that learn to give you stuff to look at, whether it’s videos or news articles or whatever. And you know, it’s totally hypothetical because I haven’t seen the code that Facebook runs or that Google runs in the Google News server or whatever. I don’t exactly know how it works. But if you imagine a simple reinforcement learning algorithm whose goal is to optimize click-through or engagement or whatever metric the company platform would like to optimize, then what a reinforcement learning algorithm is going to do is to figure out how to manipulate you, your personality, your interests, your views in order to make you a more predictable consumer of content.

Robert Wiblin: This is all why you’re right. I’m wondering what’s the most likely way that I guess you could be mistaken. I mean, there are smart people. Like Steven Pinker who, I guess, maybe he thinks that there’s some work to be done here, but he’s not super alarmed. And so maybe you think the way you’re most likely to be wrong is just that these problems are sufficiently fixable. That we should just expect the work to get done. The problems to be corrected over time. And so although–

Stuart Russell: It’s an interesting argument which says basically that because there are problems, they will get fixed, but no one’s allowed to talk about the problem. Well, if no one’s allowed to talk about the problems, then no one is going to fix them. So it’s kind of like saying you’ve come across a terrible accident and you say, “Well, no one should call an ambulance because someone’s going to call an ambulance”. Well that’s just a sort of reasoning failure. And I think that’s kind of what’s going on here. And I think Steven has, which I’m genuinely sympathetic to… He wants to have a positive view of how things are going over the last few centuries and why they’re going that way. And I kind of agree: they’re going that way because of enlightenment of knowledge and science and all. But part of having a safety culture is noticing the problems. Not doing what they did in Chernobyl, which is to say, “Of course, the Soviet communist party couldn’t possibly produce a nuclear reactor that could explode, therefore it hasn’t exploded”.

Stuart Russell: So of course we have to pay attention. Our society has done some pretty stupid things and a lot of it is because people have tried to say “Don’t pay attention”. The climate deniers have said, “Don’t pay attention” and they’ve won, regardless of what you think. You might think, “Of course we’re right. Of course there’s climate change.” But you’ve lost the argument. So the human race can be remarkably dysfunctional. So I’m not too concerned about that issue and it’s possible that we’ll never achieve anything resembling general purpose AI. But I would say the social media example shows the risk is there, even from pretty stupid algorithms. These algorithms are so stupid, they don’t even know that human beings exist or have political opinions or anything else. You are just a stream of clicks and they want you to be a profitable stream of clicks.

Stuart Russell: And so they will make you into a profitable stream of clicks by doing what the only thing they can do, which is choosing what to send you next. And they figured out how to choose what to send you to turn you into a profitable steam of clicks. End of story. And it’s had a big effect because billions of copies of that algorithm are running and they’re interacting with human beings. Not every human being, but billions of human beings for hours every day. So they’re having a massive effect on the world, whether you like it or not. And the weird thing is that there’s still complete denial.

Robert Wiblin: Yeah, even after that’s happening.

Stuart Russell: I had an online… Not a one-to-one discussion, but a sequence of comments and replies on some Facebook page with Yann LeCun, where he begins his argument by saying, “There’s absolutely nothing to worry about because there’s no possibility that we would be so extremely stupid as to put an incorrect objective into a powerful machine”.

Stuart Russell: And then I said, “Well, what about the click-through objective that we put into the Facebook platform, which is a powerful machine? And we did it and it was the incorrect objective”. And he said, “Oh no. We don’t use click-through anymore”. And I said, “Well, so why did you use click-through? Or, “Why did you stop using click-through”? “Oh because it was the wrong objective”. Okay. So you put a wrong objective into an extremely powerful machine, which you began your argument by saying would be extremely stupid if anyone were to do that. So there’s just an unwillingness to take two and two and realize that they make four. People keep saying, “Well maybe it’s not four”! Or “Maybe you can’t add the two numbers! Or “Maybe it’s only one; when you add two and two you get one”!

Stuart Russell: I mean it’s just ridiculous: no, it’s four. So, going back to the question, “Are we going to create general purpose superintelligent AI”? Clearly there’s no mathematical guarantee that it’s going to happen, but I don’t see any plausible obstacles to it. There are difficult research problems that we have to solve, but we’ve solved quite a lot of problems already. And I think we’ll solve the ones that we can see. There’s another half dozen which I describe in the book. And I think generally, if we can see the nature of the problem — and we can see it, because we see how our systems fail; “Oh, they’re failing because of this and so on. All right, okay, let’s try and fix that problem”. So it’s quite possible that we’ll solve those half dozen problems and we’ll see a new problem.

Stuart Russell: But if we solve those half dozen problems, we’ll have something that’s pretty impressively capable already. It will be the kind of system that, for example, you could use to design and execute a big military campaign or to develop a strategy for your multinational corporation to take over the world or that kind of thing. It would be long-term, extremely well informed, extremely capable, prediction-based planning system. And I think this is, to me, it feels very, very plausible. So the people who are saying the reason we don’t have to worry is because we’ll never solve AI, do not have a single plausible argument that they are willing to put forward other than–

Robert Wiblin: It hasn’t happened yet.

Stuart Russell: It means that I don’t have to think about it, which is not a good argument.

Robert Wiblin: Yeah. So Rohin Shah who’s doing a PhD at your own Center for Human-Compatible AI, I was just earlier today reading an interview with him by AI impacts, and they summarize his view as a gradual development and take off of AI systems is likely to allow for correcting the AI system online, and AI researchers will, in fact, correct safety issues in a fundamental way rather than hacking around them and redeploying, I guess, faulty systems. Yeah. How likely is he to be correct and does that matter much for the expected value of going in to work on these problems?

Stuart Russell: Okay. So his claim is that we will basically be able to get rid of the standard model and somehow install the revised model as what everyone learns and everyone knows how to do and that corporations will actually follow these principles and build the systems the right way. I mean, I think it will happen. This is partly just a sort of personality issue. I am generally optimistic, so alarmed: yes. I mean there is a path that people could take that I think we would regret and the further along that path we go, the bigger the catastrophes that are likely to occur. And of course if you have a middle-sized catastrophe, that’s going to cause people to try to fix things. So Chernobyl was a middle-sized catastrophe. Fortunately, if you’ve seen the series, many people sacrificed their lives in order to prevent a much bigger catastrophe.

Stuart Russell: And I think we were extremely lucky. That things could have been a hundred times worse than they were. So we’re extremely lucky, as we have been in nuclear retaliation cases that could have happened but didn’t and so on and so forth. So if we had a middle-sized catastrophe, as happened with Chernobyl… The effects of Chernobyl was to destroy the nuclear industry. That was the biggest effect on the world: was to wipe out the nuclear industry for the next 25 years. And it’s just starting to pick up again. But basically it was decimated. The number of nuclear power stations being constructed per year went from somewhere, I think it was like a hundred per year down to threes, fours, fives. Some years even a negative number of power stations were being constructed because more were being taken offline than were being built. And several countries like Germany have just banned nuclear power: Italy, Spain and Belgium. So major countries are abandoning it altogether. So you could see the same kind of reaction to a middle-sized AI catastrophe. And then of course there would be much more attention paid to how do we design AI systems? Who is authorized to deploy AI systems? So right now if you want to build a bridge, you can’t, because you don’t have a professional engineer qualification. You cannot build a bridge. They will not let you. So civil engineering has built up a whole series of rules which are enforced about who’s allowed to build bridges. How are they allowed to design them? You know, the checking, the testing, the material certification and so forth.

Stuart Russell: And still, bridges fall down. People die. So we have to accept that the same kind of systems are going to be put in place for AI because right now anyone can build an AI system and anyone can deploy it on billions of computers and that’s going to change. But I don’t think it’s trivial. It takes a long time. I mean, it took a hundred years to go from the big catastrophes in the 19th century, the pharmaceutical catastrophes, they took a hundred years to go from that to having a really functional FDA process. And that process is still compromised by the opioid companies, by fen-phen, by various other corruptive kinds of activities. So we have our work cut out to do, and the IT industry is kind of allergic to this sort of stuff.

Robert Wiblin: So yeah. In the interview with Rohin Shah, which we’ll stick up a link to, it kind of expresses skepticism that we could have this problem of AI deceiving us and maybe playing dumb or pretending to be aligned with our values until it can get to the point where it’s able to prevent us from shutting it off, which is, I guess, one way that things could go really off the rails. And I suppose part of the argument is like, wouldn’t we kind of notice early-stage deceptive behavior or failed treacherous turns on the part of less sophisticated algorithms and then learn from that experience about how to prevent that from happening? I suppose the–

Stuart Russell: We might, yeah. I mean we’ve learned that social media has this corrosive effect, but we haven’t fixed it. It’s still going on. They changed the metrics a little bit, but the overall approach and the overall consequences are still there. So yeah, we don’t even necessarily fix things when we notice that they’re going wrong.

Robert Wiblin: I guess it seems to me a key issue is whether a dumb AI system would realize that it’s not clever enough to successfully attempt to trick us at that point. If it can do that, and it’s the first time that it happens it succeeds. But if it doesn’t have good judgment about its ability to trick us, then we would notice failed attempts at deception early on that we can then learn from.

Stuart Russell: I mean, yeah, it’s plausible that we would see these kinds of misbehaviors. I mean, we already are, as I keep saying. We are already seeing these misbehaviors and they are already, some would argue, dissolving democratic order around the world and dissolving NATO and the EU and we’re still not doing anything about it. So it’ll have to get worse apparently before we do anything about it. I mean we’ve seen this in these toy reinforcement learning examples. My favorite one that I talk about in the book is the evolutionary simulation. I think it was Karl Sims’ work, where he wants to evolve these simulated creatures with various… You set the objective by defining the fitness function. So he wanted to evolve creatures that could run around. And we reinforcement learning people do that by basically giving a reward for forward progress, but we fix the creature. So it’s a little four-legged spider shaped thing or it’s a two-legged humanoid or it’s the half cheetah. There are various models that people have built but Sims, because he wants to do evolution of physical structures, he says, “Okay, we’re going to use as a fitness function forward progress”. I think the maximum velocity of the center of mass was the fitness function and you would think, “Oh that’s going to create creatures that run really fast like cheetahs or antelopes or something like that”. But in fact, it created extremely tall trees and then the trees would fall over. Like a hundred miles tall. And then they would fall over. And in the process of falling over, they generate very high velocity of the center of mass. And so they win, right? Because no terrestrial runner could actually generate any velocity even close to that. So that’s what happened. And so you see these failures in the lab in the toy scenarios. And hopefully you realize that, “Oh yeah, we’re really bad at designing objectives, so we shouldn’t put fixed objectives into machines”.

Robert Wiblin: Yeah, there’s a really nice website that collects all of these examples of really perverse solutions that ML algorithms have found, which we’ll stick up a link to. It’s very entertaining and a little bit worrying. I guess another question I had about this issue of deception is even if we noticed warning signs, and even if we started fixing them; even if we had figured out how to prevent an AI system from deceiving or misleading us, how would we ever know that we had managed to successfully do that? Because you can’t tell the difference between ‘you have successfully solved the problem’ and ‘you haven’t successfully solved the problem but instead it’s just become more ingenious about deceiving you and just biding its time’. It seems very hard to confirm this.

Stuart Russell: And this is very much looking at the AI system from the outside as a black box. You wouldn’t be able to know. And if the AI system arrived from outer space in a black box, it’s a black box. You can’t know what’s going on inside by definition. And so you can’t say whether what it’s doing is actually in your best interest or that it’s doing the treacherous turn thing or whatever. So, for example, I think there’s another science fiction story “A for Andromeda”, which I think is a John Wyndham novel, where some code comes from outer space and people realize this is some extraterrestrial signal and then they realize it’s a set of instructions. And what do they do? They carry out the instructions. I know, how stupid is that? But that’s what they do.

Stuart Russell: And of course the instructions are some kind of DNA instruction, I think, if I remember the story; it’s a long time ago that I read it. But it creates a humanoid and then again, you don’t know, well, is this humanoid one of us or one of them? Well of course it’s one of them, but it’s inscrutable. They don’t really understand what his motivations are or how it’s going to behave. And of course in the novel they figure out before it’s too late, but in real life it doesn’t always work that way.

Robert Wiblin: Yeah. I’m curious about the scrutability thing. Can we look inside AlphaGo’s parameters and see what patterns AlphaGo is recognizing on the Go board and then use that to learn lessons about how humans can play Go better? Or is it just like it’s too impossible to actually understand what the code is?

Stuart Russell: I think it’s very hard to look inside. I have not seen any successful attempts to do so. But you could certainly use AlphaGo. AlphaGo could give you a heat map of like “Here are the good places to play and here are the bad places to play”. And that training I think would be very useful to humans. And I’m surprised I haven’t heard of such a tool being offered. And it might be able to with a bit of work because it’s still, don’t forget, it’s still a classical algorithm. It’s not exactly alpha-beta search, but it’s still a classical algorithm. It builds a game tree. The game tree is essentially a proof that some moves are good and a refutation of other moves showing why they’re bad. So you could use that game tree, suitably abstracting, because obviously it’s too big; it might have effectively billions of nodes. So it’s too big to tell that whole thing to the human. But with some work, you could figure out tools that would abstract the general structure of that tree. And if you said, “Well why can’t I play here”? It could show you why you can’t play there. So there’s a lot of useful stuff that could come out, but if you just look at the evaluation function, so the thing that it’s learned, that just gives a number. So I give you a Go position and it gives you a number. Explaining that number, I don’t think it can do that, and I don’t think we can even go inside and do that. You could do sensitivity analysis. You could say, “Well, how would that number change if I move this stone here and here? But then that would be really your effort to be explaining what’s going on.

Orthogonality thesis [01:24:25]

Robert Wiblin: Yeah. Let’s talk about the orthogonality thesis for a second. So yeah, the orthogonality thesis is this idea that within an ML system, you could combine any level of sophistication in pursuit of a goal with any level of sensibleness of the goal. You can have a very capable system that was pursuing a goal that was very stupid from our point of view. And I guess it seems like that’s obviously true if you sample randomly from the space of possible AI systems. That there’s not going to be a correlation between having the right values, from our point of view, and how good you are at accomplishing goals. But I guess it seems like the more important question in practice is we need to figure out whether there’s a correlation between the right values and the right capabilities among AIs that we’re actually likely to build in the process of designing ever more capable AI systems.

Stuart Russell: So when you said randomly sampling: every single AI system in existence has that property. So if I randomly sample from all the AI systems in existence, they’re all designed in such a way that they could optimize an arbitrary objective. That’s how we do AI right now. So I can build a reinforcement learning agent and I can train it to checkmate the opponent, or I can train it to lose. I just give it a reward every time it loses. And it’ll be extremely good at losing very quickly. So the orthogonality thesis is built into the way we do AI right now. That is the standard model. And the idea that somehow the AI system would have access to this objective moral truth of the universe as we were talking about earlier, and because it had access to the objective moral truth of the universe, it would realize that the objective that it was constitutionally constructed to optimize is the wrong one. This is just wishful thinking.

Robert Wiblin: So I think that’s not the objection. Maybe the idea is that among the set of ML systems that we’re likely to build, as we make them more capable, we will also in the process of that, figure out how to make their values more in line with our own progressively. So maybe you could imagine if you just sample randomly out of the designs of different airplanes, as they get bigger, they don’t get better at not falling out of the sky and not killing people. But as we’ve gotten better at building, say, bigger aircraft, we’ve also got better at making them safer in practice. So if you look at the airplanes that we’ve actually built, safety has gone along with capabilities and because of the process that we follow in designing things, would AI follow the same trajectory?

Stuart Russell: We’re sort of coming back to the same question. Yes, you would think that we will. So take the example of nuclear power stations. Clearly this is currently much worse than AI. Because it’s going to blow up and kill you in the most horrible way. I mean, there’s people dying of radiation. That’s not something you want to see. So they knew that this was a really, really, really dangerous technology, right? I mean, we blew up atom bombs at Hiroshima and Nagasaki. We know that atomic explosions are not nice and yet we failed. We failed to prevent what happened at Chernobyl. So there’s no guarantees when humans are involved. Period. There’s no guarantee that the right thing will happen. So if someone has another constructive suggestion for what is a constitution for AI systems that would lead them to do the right thing and to reject objectives that were put into them that are somehow silly, whatever that means, how would they know?

Stuart Russell: There isn’t a calculus you can do on the objective and say, “Oh, if I put in a chemical formula, there’s a calculus I can do to figure out the boiling point of this material or whatever. But you put it in an objective, there isn’t a calculus you can do to figure out if that’s a silly objective. It’s the objective. Is checkmating the opponent a good objective or not? Well, that depends. I mean, it seems natural to us that that’s the objective of chess. So it’s a natural thing that you want to do. But if we were a culture where suicide chess was the version of the game that we like to play and this other thing where you checkmated opponents, well that’s some strange, weird thing that only a few fringe weirdos play that game, then checkmate would seem very unnatural and the suicide objective would seem like the natural one. So I just don’t get it. People are welcome to try to explain it to me, but they don’t explain it. They just kind of say, “It would be great if AI assistants would just naturally pick the right objective. And you see this, I mean, you see this claim. It’s just natural that a more intelligent entity is going to pick the more morally correct objectives.

Robert Wiblin: Yeah, I definitely don’t think that. I agree that’s obviously mistaken.

Stuart Russell: Right? It’s the same argument that people will naturally speak Latin if you don’t bring them up in any particular language.

Intelligence explosion [01:29:15]

Robert Wiblin: Yeah. Let’s talk about the intelligence explosion idea for a minute. I guess it seems in the book that you’re on the fence about how quickly things might change as we are approaching human level AI. There’s this argument that we design machines that are better at designing better machines and so you get this positive feedback loop that causes very rapid progress in AI. How likely do you think that is to happen versus more gradual improvement in AI like we see with most other technologies and maybe how quickly does that take off speed influence how likely we are to risk losing alignment?

Stuart Russell: Yeah, so I did not want to premise the book on… Just as I didn’t want to premise a book on machines becoming conscious as the risk, I don’t want to premise the book on the intelligence explosion as the risk. And as long as we don’t have an intelligence explosion, everything is fine. Because everything isn’t fine. You know, in the abstract, the intelligence explosion argument appears to be correct. It’s not as correct as evolution. I mean in some sense evolution couldn’t not happen. An intelligence explosion could not happen if essentially there’s diminishing returns. And I think Eliezer Yudkowsky has a paper or something like “Intelligence Explosion Microeconomics “. You could easily formulate models in which the diminishing returns mean that you plateau and you don’t go off to infinity. Whatever infinity even means.

Stuart Russell: And in practice it would be unusual for us to sort of sit back and say, “Oh look. The machine is printing out a new design for a new chip. Oh, let’s go and build some of those and plug them into the machine and see what it uses them for”. And “Oh look, the machine is coming up with an entirely new theory of AI and it’s proposing that software be built along these lines. Oh, let’s do that”. So the argument that the intelligence explosion just happens sort of has to assume that the AI system already has autonomous functioning capability that it’s able to build chip factories and bring them online and make the chips. Or have us do it and we just sort of go along with it like lambs led to the slaughter.

Stuart Russell: So I’m assuming that long before we got to that stage, we would have some controls in place to make sure that AI systems were appropriately designed and under control and didn’t have autonomous fabrication facilities that they could call on. And so I’m sounding a little bit like Steven Pinker that of course, we’re a safety oriented culture. We’d make sure that these things didn’t happen.

Robert Wiblin: You think this one we do actually have a chance of not falling into because it’s been discussed so much?

Stuart Russell: I think assuming that the AI systems are designed according to the three principles and we would accept that if it comes up with a new chip that’s much more efficient, consumes less power, et cetera, et cetera, we might well say, “Yeah, thanks. That’s great. That’s saved us 20 years of R&D. We would have done it anyway. You’ve done it soon, so that’s great. But we wouldn’t necessarily say, “Oh, and by the way, we’re going to make 100 million of these chips for you to run on”. That would be a bigger step and we’d want to then take more precautions before doing that. So I want to absolutely stress that the three principles by themselves are not enough to produce guaranteed safe AI systems in the sense that I could be mathematically certain that when I downloaded that software into the hundred million of these new generation chips which are a billion times more powerful than the current supercomputers, that nothing could go wrong. We don’t have that theorem. And in fact you can see in the simple formulations of the three principles, you can already see loopholes. And I think one of the biggest loopholes has to do with the fact that in the standard conception of AI, we assume a physical separation of the agent program from the universe. And the agent program receives percepts of the universe. It acts on the universe, but it doesn’t act on itself. So we’ve got a separation. But that separation isn’t real.

Robert Wiblin: Yeah, it can reprogram itself.

Stuart Russell: It can modify itself through this loop, right? So it can go out to the universe and tell the human, “Change this wiring pattern” or “Take out that chip and put in this new one”. So it can change its own programs separately from the operation of its program. So the operation of its program internally, of course, that can change. That’s what machine learning is. It’s changing your own program. But this would be a separate route through the physical universe. And that route is not accounted for in the typical formulation of how you think about AI systems, and security people have this issue as well. So if you’re in the cybersecurity business, you now understand that there are always side channels. So you can prove your protocol is provably secure, mathematically secure.

Stuart Russell: That it would take you 10^5 billion years of computation to break this code. And you could be correct, but there are some side channels which means you don’t have to break the code. You shine a laser beam off the window where the person is typing, or you do this or you do that. You measure the electrical current in the mains supply to the building. And so they’ve understood that theorems are proved with respect to an abstract model, which is not a correct model of the world. So we need to be very aware of those issues and develop a theory that is robust in that sense.

Robert Wiblin: You brought up Eliezer Yudkowsky and I guess he’s associated with the Machine Intelligence Research Institute. And I guess maybe his views also somewhat aligned with Nick Bostrom, at least Nick Bostrom’s views in 2013. Yeah, what do you make of their ideas and maybe how likely are they to be more broadly right than you inasmuch as you disagree?

Stuart Russell: Well, just as I think we shouldn’t build machines to do things that we don’t want, it wouldn’t be sensible for me to say that I believe they’re right and I’m wrong.

Robert Wiblin: Sure, but there’s a probability, right?

Stuart Russell: There’s a possibility. I don’t think we really disagree. You know, I think Eliezer’s views have a lot to do with the problem arising from incorrect objectives. So some of the things MIRI works on feel… I mean they’re not mistaken, it’s just not the first thing I would think one would work on. But I think the recent agenda is relevant to the problem I just mentioned, which is the embeddedness problem. The fact that the agent program is part of the universe in which it’s operating and therefore there can be these side channel processes that wouldn’t be accounted for correctly in a simpler model. So if they can make progress on that, great. If people disagree with what’s in the book, I would like to know about that. And you know, we could discuss it and try and figure out if there’s a middle ground or what’s the right answer.

Robert Wiblin: Yeah. Is there any easy way of summarizing how their view is different from yours? Or at least maybe how their approach to solving the issue is different from the one that you would probably take first?

Stuart Russell: So I think my belief is that the first thing we have to do is to change the standard model. Because the standard model is what produces the failure mode. The idea that we should put in fixed objectives is wrong. And so we need to get rid of that idea. And so that to me feels like the most important thing I can do. And as an AI researcher, it’s easier for me to do that. And the MIRI people are different. They come from a different background. They’re not part of the mainstream AI community. So they’re in a different position and maybe that’s part of why they do different stuff. But as I say, there’s no harm in people working on different things, and it may well be that in the fullness of time, we’ll see that some of the work they’ve done is actually very relevant. So Andrew Critch, who’s a research scientist at CHAI was formerly at MIRI, and some of the work he’s done is jointly with some of the people who are still at MIRI; I think some of the logical induction work and some of the open source game theory. Yeah, you can see that that’s the kind of theoretical research which is likely to be relevant at some point directly and we will be glad that it was done.

Robert Wiblin: What do you think about the prospects of AI safety via debate or iterated intelligence application? We’ve talked about them on the show with Paul Christiano and I think that they’re being attempted at OpenAI. Do you have a view on those or maybe you just don’t have enough exposure?

Stuart Russell: Well, I mean some of what they’re doing at OpenAI seems to fit within the assistance game paradigm that is described in Human Compatible. The sort of iterative approaches… I kind of liked that idea and I thought about it a while back in 2014 or so. Could you have an iterative process where one system could basically verify that the next system was going to be safe, even if it wasn’t as smart as that system. And then you could sort of iterate that process and always guarantee safety as you went along. So something like that. Yeah, I think all good ideas should be pursued and–

Robert Wiblin: Seems good enough to be pursued.

Stuart Russell: Yeah, I think so. The debate thing I think would need to be made a bit more concrete before you could turn it into something that would actually produce working software that was both useful and safe.

Policy ideas [01:38:39]

Robert Wiblin: Are there any government policies that you think are sufficiently obviously good ideas now that people involved in policy should just go out and advocate for them already?

Stuart Russell: Yes, that’s a good question. And people often ask that. I remember once I spoke to Congressman Delaney and I explained to him the basic story of AI, existential risk and so on. And he got it very quickly and he said, “Okay, what legislation do we need to pass? And it’s like “Well… hold on a second”. So at the moment I wouldn’t advocate any kind of ban on AI or anything like that. And that’s actually coming back to one of your first questions, right? This is another response you find from AI researchers. You say extremely powerful AI systems could present these kinds of risks and they immediately say, “Oh, but you can’t ban my research”. They think that what you’re saying is, “Okay, I want to ban AI research”. That’s not what I’m saying. I’m just saying do some research that actually produces good systems.

Stuart Russell: Yeah. So it’s premature to say, “Okay, there needs to be a law that everyone has to build AI systems according to three principles” or something like that. You could imagine at some point when those principles are sufficiently concrete and people have done the technical work to make sure that they are foolproof and that they can be as instantiated… Again, there’s no point saying everyone has to build systems according to these three principles if no one knows how to do it.

Robert Wiblin: You have to have some technical idea of what the aim is or how to make it good before you say you have to do that.

Stuart Russell: Yeah. You’ve got to give them the tools to actually put it into practice. You can’t just sort of say, “Okay, well we’re going to stop the industry until we figured out how to”.

Robert Wiblin: What about regulating the YouTube recommender algorithm that you’re very concerned about?

Stuart Russell: Yeah. So that would be a case, and I wrote an op-ed with a couple of other people basically saying that we need an FDA for algorithms. It wouldn’t look exactly like the FDA. We’re not injecting the algorithms into mice or anything like that, but work out the process that comes between, as Max Tegmark likes to put it, the bunch of dudes chugging red bull and the billions of unsuspecting users. You know if, as seems to be the case, deployed software has had a number of negative effects on our society on individuals, then maybe we need to stop having unfettered access. That the dude who’s chugging red bull gets to download their stuff into billions of cell phones overnight whether we like it or not.

Stuart Russell: But that’s the current situation. Should that continue? And if not, then what does it look like to have a different system? And again, the nature of the regulation depends on the technology and how it works. How it’s used. What effects it can have. But it would seem to me that it might be reasonable, you know, that this is sort of what a social scientist might want to do is to say, “Okay, well let’s take one of these recommender systems for news content or YouTube videos and try it out with different focus groups and do control testing” and is it the case that their views become more extreme? Is it the case that their acceptance of violence becomes higher when they’re exposed to using these kinds of software systems? And if it is, then maybe you don’t get to sell that.

Stuart Russell: If you’re selling systems that pervert people. That make them more unhappy, more depressed, less social, more interested in violence, unable to function without a video game… You see these, there are people who take their phone into the shower so that they can continue playing the game. They have special plastic bags so that they can play the game in the shower because they’re afraid to stop playing. It’s terrible. So if this is the case, what are the mechanisms of regulation? In addition to having an FDA, you’ve got to have some rule saying, “If you don’t pass this test, then you don’t get to sell your software” or “Go back to the drawing board and fix it”.

Stuart Russell: And so you’ve also got to be able to tell people how to fix things. What was it about my algorithm that failed the test? “Oh, it’s because you’re using a reinforcement learning loop instead of a supervised learning loop” or whatever. So in a social media thing, a supervised learning algorithm could still have negative effects in a sense of narrowing people’s range of interest, but it doesn’t seem to have the effect of deliberately shifting the center of gravity of their interest in one direction or another, which is what the reinforcement learning algorithms do. So that’s a very simple rule. And obviously it could be refined a bit but you know, basically, don’t use reinforcement learning algorithms in content recommendation systems because the learning algorithm will try to modify you. It won’t try to learn what you want.

Stuart Russell: It will try to modify you into a more predictable clicker. And this is what advertising people have been trying to do this forever. To turn you into someone who wants what I have to sell. But as we know, advertising is a very blunt instrument. It’s not customized to the individual.

Robert Wiblin: Not necessarily socially beneficial either.

Stuart Russell: And again, people complain about advertising and its effect on the materialist culture and its objectification of women and various other kinds of effects that have come from advertising. I think those complaints are reasonable. So I grew up in England, and we had something called the Advertising Standards Authority. And it used to have some teeth, and the advertisers used to not want to offend the advertising standards authority.

Stuart Russell: But it seems to me that all bets are off these days. Any attempt to regulate has totally fallen by the wayside and it’s the Wild West and we, the people, are totally unprotected. So this comes back to your question about what are the regulations? What other laws should we think about? I think the law on impersonation is a very obvious one.

Robert Wiblin: Yeah, that bots can’t pretend to be people.

Stuart Russell: Yeah. I haven’t heard a coherent argument as to why we should allow bots to impersonate human beings. Possibly in healthcare/medical/psychological settings where people are reluctant to divulge their secrets to a human. But sure, we can write the law to accommodate this. But it seems like as a principal, there was a time when the principle that you shouldn’t kill other people didn’t exist.

Stuart Russell: It’s like, what’s wrong with killing? All right. And now that’s a principle and sort of universally accepted. It’s in the universal declaration of human rights. All countries are supposed to preserve the sanctity of human life. And I could see in a hundred years’ time, it’ll just be one of those principles that machines are not supposed to be impersonating human beings. But at the moment, that principle isn’t there and people are going to do it. And wouldn’t you want to be the ones who are responsible for creating that principle and having it become universal? And then civilization in a hundred years will thank you for not leaving us unprotected.

Robert Wiblin: Yeah, I know some people who’ve been influenced, I think, by Elon Musk and Ray Kurzweil to work on brain computer interfaces because they think that this is going to play some key role in AGI safety. And I’ve never really understood the story by which that makes a big difference, and I get the impression that you’re not super convinced either. Is that right?

Stuart Russell: Well there’s several parts to that story. So the basic idea is that rather than saying, “Okay, we’re going to make AI systems and they’re going to be more and more and more and more intelligent, and eventually they’re going to overtake us and then we’ll be left behind”. That’s the technological path we take, is that the AI systems are always connected to us. So as we make the AI systems more and more intelligent, we’re actually just making ourselves more and more intelligent. And so there isn’t a competition between humans and AI systems: humans are the AI systems. And so I don’t think Elon is saying, “Well, there’s going to be these AI systems and we get to compete with them by adding more stuff to our brains”. That would be a pretty bad solution to the problem.

Stuart Russell: We’ll have some Titanic war of intellects between the human cyborgs and the cyborg cyborgs for who gets to control the planet. That doesn’t seem like a good future. And Ray certainly believes that this is the desirable future of the human race, that we would become these cyborgs. We would have access to computing capabilities beyond our wildest dreams and we would be the direct beneficiaries. And some people say, “Oh, we already have that with our cell phones”. But no, that’s not the case. The thing that’s using the cell phone is still the biological brain and there’s nothing I could do to make that biological brain better at using their cell phone. So here the idea would be “No, the biological brain would be directly enhanced and that would actually become irrelevant because so much of your function would be in the non-biological part”.

Stuart Russell: Now technologically it feels a little bit implausible to me. But as I said in the book, we have been surprised already about how easy it is to connect the brain to electronic devices because the brain has this plasticity that enables it to just take advantage of any information processing that you connect to it and figure out how to use it. So it’s pretty remarkable. So it might be that the next step might be to have memory that’s electronic and that the brain uses just more memory, somehow. And then the other thing would be direct brain-to-brain communication, which is even harder to imagine. But actually not hard to imagine that it could happen. Just very hard to imagine what it would be like or what consequences it would have. So those are some things that I think are wild cards. They’re very unpredictable and you could easily see that they could happen without our really understanding why it’s working and who knows what effects it’s going to have on society. But I don’t want it to be the case. So this is why I’m not thrilled about that line of work, is that I don’t want it to be the case that you cannot be a functioning member of society unless you undergo brain surgery.

Robert Wiblin: Yeah, it doesn’t sound optimal. I suppose I’m more skeptical than that because inasmuch as there are AIs out there that aren’t aligned with our interests, it just seems like the AI’s that don’t have this impediment, this handicap of having to be linked up to a human brain are going to be more effective at their tasks and out-compete. And it also just doesn’t seem the bandwidth of the communication between a person and an AI is going to be the key issue here, because we could just take it in a huge amount of information visually and we can communicate a lot through language and it seems like it’s just such an odd situation which is the make or break issue, which is how quickly we can communicate information with some AI system.

Stuart Russell: Yeah. So as I say, I don’t believe that this is the way to win the war against the non-biological AIs.That’s the wrong metaphor. And I don’t think that’s the metaphor. I think the metaphor that Elon is pursuing is that there aren’t any non-biological AIs. That the AIs are us, and not ‘them’.

Robert Wiblin: I’m not sure how stable an equilibrium that is, but all right. Yeah.

What most needs to be done [01:50:14]

Robert Wiblin: Let’s move on and talk a bunch about what most needs to be done in your view, I guess including potentially by listeners, many of whom would be very keen to help solve this problem. Where would you like to see the effective altruism community, and maybe the general AI alignment community, dedicate more resources? I suppose obviously there’s overturning the standard model. Is there anything else that you think deserves a lot more attention?

Stuart Russell: So I think within the AI community, that’s the main thing. And, you know, this is work that you have to be an AI researcher to do because it’s doing AI except it’s a bit more difficult than doing AI in the standard model. And I think we need some help in this. It’s hard work because within the standard model, research happens not at the outermost level. There are not many people, if anyone, who’s sort of sitting there and thinking, “Okay, well how could machines optimize objectives”? They work on more specific versions. They work on reinforcement learning; well how could a machine learn the Q-function from us? Sequence of reward signals, right? That’s a special case in a special model.

Stuart Russell: Typically they assume observability. So typically we’re going with MDPs and not POMDPs and so on. So working within a whole set of additional assumptions is what allows progress to happen, because otherwise the problem is just too general. You don’t know what to do. So you make a whole bunch of assumptions. You narrow things down to the point where you can start to make progress. So in the revised model, we don’t yet have those restricted agreed on special cases on which we can make technical progress. And so the process of how did standard AI get those? Well they got those actually over hundreds or thousands of years of work in philosophy. So Aristotle wrote down this idea that you could have goal based planning algorithms. You specify a goal and then you find what action achieves that goal, and then what’s the precondition of that action? Or what action achieves the precondition and chain backwards until you get to the present, the things that are already satisfied because they’re true or things that you realize are impossible, in which case you try a different plan. He writes this out in a fair amount of detail. As I say in the book, I think if he had had a computer and some electricity, he would have been an AI researcher, because that just seems to be the way he thinks. But it took 2000 years after that for people to realize, “Oh, uncertainty is a really important thing”. And then we had this thing, we’re going to call it “Probability”, and it took another hundred years after that to sort of say, “Okay, well let’s combine probability with some continuous notion of value as opposed to a goal which is sort of a one-zero thing”.

Stuart Russell: And that happened actually because people cared about gambling. And it took another hundred years to go from money to utility and then to go from utility to breaking their notion of utility down into additive rewards. Well that’s sort of a 20th century thing. So additive rewards were in the 20th century, there’s some papers from the 1930s, then Bellman in the 50s, and so on. So it takes a long time before you get to these narrower accepted decision-making paradigms in which you can then start to design algorithms and prove theorems and make progress. And we have to do all that work again in the revised model. And it’s more complicated because you’ve got to figure out what is the canonical form because it has to be in the revised model. There’s got to be some process by which the machine acquires information about the objective. Well, what is that process? Is it reading it from the universe? Is it interacting with a human? If it’s interacting with a human, what’s the protocol? You can’t have an open-ended conversation and expect to prove theorems about open-ended conversations. So you’ve got to narrow it down. Okay, the human does this, the machine does that. Okay, in that protocol, here is a theorem about, or here’s an algorithm and here’s what you expect the human to do if they want to participate in this protocol.

Robert Wiblin: Yeah. How important do you think policy work is relative to technical work? Should 80,000 Hours try to send more or fewer people into government, say?

Stuart Russell: I think in the long run it’s extremely important because as I said, we need to have in place all of the machinery that’s in place in civil engineering and, you know, water and electricity distribution. So you think about Underwriters Laboratories which started out in order to make sure the electrical equipment was safe and that people could trust that when they plugged in an electrical device in their home, that they weren’t going to be killed. We don’t often realize how much apparatus there is in the world whose purpose it is to actually make stuff work properly and safely. And that’s sort of by design, right? You don’t notice that you need it because it works. You don’t die when you plug things into the wall. And that’s because all this apparatus exists and so we need to figure out what it looks like for AI, in fact, for IT in general.

Stuart Russell: We don’t yet have anything that comes between the video game designer and the 13 year old. There’s a voluntary industry code of conduct that tells the 13 year old which video games have nude women in them. Well, that’s definitely going to stop them from buying those video games. Absolutely! But we’re gradually realizing that doing things this way isn’t always the best idea. Hiring a bunch of neuroscientists to maximize the physical addiction produced by the game; is that a good idea? Should we be doing that? Maybe we shouldn’t.

Robert Wiblin: One reason I ask is because it seems like one problem which you are a bit despondent about in the book is the problem of misuse. That even if we can get 90%/95%/99% of people to use their ML systems responsibly, potentially you could get a disaster just from 1% of people not being willing to follow the rules and end deploying unsafe systems. And I guess it seems like there’s technical aspects to that. But it seems that the bulk of the work to prevent that probably is coming from a policy side. Trying to regulate dangerous technologies the same way that we do with other things.

Stuart Russell: Yeah. I think that’s right. The form of governments depends on the technology. There’s nuclear energy governance. There’s Nuclear weapons governance. There’s biotechnology governance. We haven’t yet had a major biotech catastrophe. The worst ones, I think, have been in the Russian biological weapons program where tens or maybe hundreds of people were killed, but nothing, you know, as far as I know, no major pandemic has resulted despite what a lot of viral memes on the web say, that none of them have resulted from biotech labs accidentally producing bad stuff. But that could happen. There is already some of this safety machinery in place. So when you buy a DNA synthesis machine and you ask it to synthesize smallpox, it actually won’t do it. And then the FBI will be knocking on your door within a couple of days. I’m serious. This is the machinery that’s in place.

Stuart Russell: And in fact, the parts of the design of the DNA synthesis machine are now blacked out. They’re secret. You can’t reverse engineer them to bypass them and so on. Now with a determined effort, you probably could bypass that stuff. So governance has to be figured out. What is the appropriate kind of governance? We don’t know yet. So certainly people could help by thinking about that. If you’re going to work on that, you need to understand AI and how it actually operates in practice. And I think a good place to start would be looking at the social media algorithms. How could you test whether they were going to have nefarious consequences? How could you design them so that they didn’t have nefarious consequences. What should the regulatory process be? I generally think things go better if you can get industry cooperation and it’s all done on a consensus process. And I think things are turning.

Stuart Russell: I mean the IT industry has actually been very successful in preventing any kind of regulation and saying, “Oh well we’re not responsible for this and we’re not responsible for that”. And they want everything both ways. And they’ve been very successful in having it both ways. But that’s changing now. And I think some companies like Microsoft are actually saying, “Look, unless we have regulation, we are forced to behave in immoral ways to compete”. And that’s a situation that they find untenable, which says something good about them. Other companies apparently don’t find it untenable, which also says something.

Robert Wiblin: It also means that Microsoft is at the mercy of other related companies creating some disaster with their technology, because they acted responsibly and are then suffering the blowback; that Microsoft would suffer the blowback from other people’s poor choices.

Stuart Russell: Yeah, and that’s the ‘cooking the cat’ argument that I’ve used in a lot of talks that there is a strong economic incentive to avoid the kind of catastrophe that would rebound on the entire industry. And I think this is true in self-driving cars. I think a lot of the other companies are very upset with the companies who have killed people with the cars running in self-driving mode because that has made their own lives much more difficult and it’s pushed off the time of actual deployment maybe further than it might otherwise have.

Robert Wiblin: Yeah. Trying to give specific information that will be useful for listeners who are trying to maybe get jobs or go and study to help solve these problems. Are there any organizations or vacancies or teams that you’re especially excited for people to apply to? I mean, there’s a whole bunch of places: CHAI, OpenAI, DeepMind, MIRI and others. Are there any that you think maybe are being a little bit neglected or you’d be like, “This is the first place to go to”?

Stuart Russell: Well I think on the governance side, CSET Georgetown. And Jason Matheny is a very smart guy, and being in Washington, you know, is well connected. You have a good chance of actually having some effect on policy. So that’s a good place. And FHI. So Oxford is creating a new large scale institute. I think they have an $18 million endowment. So that’ll be good. I think that they will be springing up all over the place. You know there’s some people in China talking about the governance of AI at Xinhua. There’s an institute, I think, starting up. So my view is I think you need to understand two things. You need to understand AI. I think that’s really important. Actually understand it. Actually take a course. Learn how to program. Also don’t think that AI is… So I notice that you’re often referring to ML systems, but we’re talking about AI systems here.

Stuart Russell: I don’t know whether you meant ML systems specifically, supervised learning, you know, deep learning systems, tabula rasa, right?

Robert Wiblin: No, I’m just misusing the term.

Stuart Russell: That is the category of systems that a lot of non-AI researchers have heard of and they think that that is AI. And if that’s what you think AI is, you’re going to make a whole lot of mistakes in governance and regulation because that isn’t what AI is. And in fact, many of the systems that are actually deployed, or in the pipeline… so Google’s self-driving car is not that kind of system. It’s actually much more like a classical AI system. It has a model. It does look ahead. It tries to predict what the other cars are going to do by inferring what their objectives are. So it’s sort of doing a little bit of IRL in a degenerate form to figure out what the other guys are trying to do.

Stuart Russell: So where are they going to go, and then what’s the best course of action for me? So they’re more like a traditional chess program that’s just adapted for the driving task and the uncertainty and unpredictability. So in some sense it’s a good old-fashioned AI system, not a monolithic black box end-to-end deep learning system which you hear a lot about. So that’s one thing. So another example would be Google’s knowledge graph. That’s a big piece of stuff. It’s a classical semantic network as we used to call them. So it’s a knowledge base of facts with 80 billion sentences in it. And apparently they said they use it to answer one third of all queries. So one third of all queries are being answered by a classical good old-fashioned AI knowledge based system, which according to the deep learning community doesn’t exist anymore.

Stuart Russell: Right. And it couldn’t possibly work and it’s fundamentally broken and it’s stupid, and anyone who thinks that is an idiot. That’s kind of what you hear. So I think it’d be good if people who worked in research and governance know about the full spectrum of AI methods and the history of the field in order not to make mistakes, but also actually just having some hands-on programming experience is good because it really changes your whole mindset about what’s going on here.

Robert Wiblin: Are there any possible PhD supervisors you’d want to highlight that people should look into if they’re considering doing a PhD?

Stuart Russell: So are you thinking about AI or in governance?

Robert Wiblin: Either.

Stuart Russell: Yeah, so in governance, I guess the people we already mentioned. So Georgetown. I suppose Oxford; I’m not sure what FHI is doing.

Stuart Russell: I guess Allan Dafoe is the main person there on the governance side. So yeah, Alan’s great. And then there’s a group at Cambridge: CSER (Center for the Study of Existential Risk) and within that is the Leverhulme CFI (Centre for the Future of Intelligence), and governance is one of the things that they’re working on. I think there’s some people at the Turing Institute in London. And then in the US at Berkeley. Stanford: HAI (Human-centered AI) has a lot of people with governance-focused political scientists and all sorts. And then I guess MIT, Harvard (Berkman Klein Center) there. Those are the main ones that I can think of off the top of my head. There ought to be something at UW because of their proximity to Microsoft. And Amazon, but off the top of my head, I can’t think of whether they actually have something up and running yet.

Stuart Russell: So it’s a growing area. The World Economic Forum is very interested in this. OECD is very interested in this. So if you have the kind of background that those organizations are looking for, you could do that. So the OECD is going to host the global panel on AI: GPAI, I think that’s what it’s called. It’s either global panel or global partnership, but GPAI is going to be the acronym as far as I know. So I think that’s going to be hosted at OECD in Paris. So no doubt they will lead people. But in all of these cases, the shortage is always people who understand AI. There’s no shortage of people who have degrees in international relations or degrees in political science or whatever.

Robert Wiblin: But people with the technical ability as well are lacking.

Stuart Russell: But having that understanding, yeah. And ditto on the other way, right? There’s lots of AI people who have no understanding of governance and political science and health and law and so on; how things work. So eventually we’ll start training people with both of these skill sets. And again, I would look to these other disciplines that have centuries of experience with developing governance and regulations of medicine, civil engineering, you know, and see how they do it. Where do they get the people? And I think in bioethics, ethical training is part of the curriculum for a lot of medical professionals. But then I believe there are specialist bioethics programs to train people in these areas.

Robert Wiblin: We’ve had quite a few people that are technically minded AI people on the podcast before, and I was curious to know whether you were familiar with their views and maybe how your perspective on things might differ from them. So we’ve had Dario Amodei, Paul Christiano, Jack Clark and Daniel Ziegler from OpenAI, Jan Leike and Pushmeet Kohli from DeepMind, and Catherine Olson from Google Brain. It might be that you’re just not familiar with the intricacies of their views, but yeah, are there any important ways that the perspective in your book differs from what they’ve said on the podcast before?

Stuart Russell: Well, I’ve not listened to their podcasts. I’ve read Dario’s paper on “Concrete Problems in AI safety”. I’ve read a couple of Paul’s papers: not as many as I should. Have read one or two of Jan’s papers. So my impression is that we’re all on roughly the same page. We’re maybe thinking about slightly different attacks on the problem, but I’m not aware of any major schism between the way I’m thinking and the way they’re thinking. I think there are many ways to try to solve the problem. I don’t for a minute think that what I’m proposing is the only way to solve it. So I’m happy for other people to try different things. Having thought about this problem for 40 years or so, the general problem of how we create AI… I feel like this revised model, let’s call it that, as opposed to the standard model, it feels right to me in so many ways that I am intuitively convinced that following this path is going to change things very significantly for the better in terms of how we think about AI systems even in areas that some people think are actually in conflict like some people talk about. And let me just make an aside. There’s a lot of meta-debate that goes on in the AI safety community which I don’t understand why: it’s not as if we haven’t got enough real work to do. So now we have meta-debates about whether you should focus on short-term or long-term, or whether we should try to reduce the conflict between the short-termers and the long-termers and it’s like, there doesn’t need to be a conflict.

Stuart Russell: You don’t need to worry about this. If people want to work on the short-term problem, they should work on the short-term problem and vice versa. Just get on with it. But when you think about things like bias. If you view that problem through the lens that it’s a failure in specifying the objective correctly, right? It’s not realizing that the objective of accuracy on the training set, which is the standard machine learning objective, actually carries with it a lot more than you think. Because you actually have to think, “What’s the consequence of deploying a system trained on that objective in the real world, and is the consequence of that deployment, in fact, in agreement with my own objectives”. And so putting yourself in that mindset. So applying a kind of revised model mindset to the bias problem, for example, I think is actually quite fruitful.

Robert Wiblin: Yeah, that makes sense.

Robert Wiblin: I’ve got tons more questions, but we’ve somewhat come up on time. I guess one final one is, do you have any view on the timeline question of when we might expect this to play out and when we might reach different milestones in AI? It just seems like people’s forecasts are all over the place and we should just be pretty agnostic about it because it’s very hard to predict. But do you have any kind of internal impression of when we might expect AI to reach human-level capabilities?

Stuart Russell: Well my general view is not to answer that question because invariably it gets taken out of context. But I think I would say I’m more conservative than the typical AI researcher. I mean, this is another strange paradox, right? Most AI researchers believe it’s going to happen sooner than I do. And yet, I’m the one being accused of being the alarmist who thinks it’s right around the corner? No, I don’t think it’s right around the corner. I think it’s a fair way off, but it isn’t a single scale of intelligence that we’re moving along. We’re creating capabilities. The electronic calculator is superhuman in its teeny-weeny narrow area of decimal arithmetic. But we’re going to have capabilities that are superhuman across much broader areas.

Stuart Russell: And I think we’re starting to see those… And I think if you look at our computer vision capabilities right now, we sort of don’t notice this, but you look at something like image search; it’s a superhuman capability and it’s starting to be fairly broad. Clearly it’s not ‘Take over the world’ broad, but it’s starting to be fairly broad and I think over the next 10 years we will see natural language capabilities. I don’t think that generating plausible looking text in GPT-2 to style is a breakthrough in itself. But it seems plausible that piecing together various parts of NLP technology that are starting to mature will have capabilities to do information extraction on that global scale. It’s not going to be as good as a person at reading a piece of text and understanding it in depth and incorporating it into your worldview. But it can read everything the human race has written before lunch and it will. It will because of the economic value of doing so.

Stuart Russell: And so we will start to see these systems and the capabilities will get broader in fits and starts. But when they broaden out, all of a sudden a whole range of things becomes possible and will be done at massive scale, and that will have very significant consequences for our world before we have the completely general superhuman AI system. And so we have to think and anticipate those breakthroughs. These partial breakthroughs on different dimensions of capability and try to figure out how to make them happen the right way.

Robert Wiblin: Well, yeah, there’s so much more to talk about. Maybe we can do another interview in a couple of years’ time when everyone’s adopted your alternative model and we’re onto the next stage of things to fix.

Stuart Russell: Yeah, that would be good.

Robert Wiblin: Fantastic. My guest today has been Stuart Russell. Thanks for coming on the 80,000 Hours podcast, Stuart.

Stuart Russell: Very nice talking to you.

Rob’s outro [02:12:23]

I hope you enjoyed that episode of the show! If you’d like to dive more into this topic we have plenty of articles related to careers related to AI on our website which we’ll link to in the page for this episode.

Some other episodes you might like to check out next are:

47 – Catherine Olsson & Daniel Ziegler on the fast path into high-impact ML engineering roles

44 – Dr Paul Christiano on how OpenAI is developing real solutions to the ‘AI alignment problem’, and his vision of how humanity will progressively hand over decision-making to AI systems

3 – Dr Dario Amodei on OpenAI and how AI will change the world for good and ill

As I mentioned at the start there’s also the Future of Life Podcast you can subscribe to, which has had many episodes on how to make advanced AI safe.

The 80,000 Hours Podcast is produced by Keiran Harris.
Audio mastering by Ben Cordell.
Full transcripts are available on our site and made by Zakee Ulhaq.

Thanks for joining, talk to you soon.

Learn more

Working in US AI policy

AI safety syllabus

Related episodes

June 3, 2019

#58 – Pushmeet Kohli on DeepMind’s plan to make AI systems robust & reliable, why it’s a core issue in AI design, and how to succeed at AI research

Listen now

October 2, 2018

#44 – Paul Christiano on how OpenAI is developing real solutions to the ‘AI alignment problem’, and his vision of how humanity will progressively hand over decision-making to AI systems

Listen now

November 2, 2018

#47 – PhD or programming? Fast paths into aligning AI as a machine learning engineer, according to ML engineers Catherine Olsson & Daniel Ziegler

Listen now

March 19, 2019

#54 – Askell, Brundage & Clark on whether policy has a hope of keeping up with AI advances

Listen now

About the show

The 80,000 Hours Podcast features unusually in-depth conversations about the world's most pressing problems and how you can use your career to solve them. We invite guests pursuing a wide range of career paths — from academics and activists to entrepreneurs and policymakers — to analyse the case for and against working on different issues and which approaches are best for solving them.

Get in touch with feedback or guest suggestions by emailing [email protected].

What should I listen to first?

We've carefully selected 10 episodes we think it could make sense to listen to first, on a separate podcast feed:

Check out 'Effective Altruism: An Introduction'

Subscribe here, or anywhere you get podcasts:

If you're new, see the podcast homepage for ideas on where to start, or browse our full episode archive.

#80 – Stuart Russell on the flaws that make today’s AI architecture unsafe, and a new approach that could fix them

#80 – Stuart Russell on the flaws that make today’s AI architecture unsafe, and a new approach that could fix them

On this page:

Highlights

Purely altruistic machines

Humble machines

Learning to predict human preferences

Enfeeblement problem

AI moral rights

Articles, books, and other media discussed in the show

Transcript

Rob’s intro [00:00:00]

The interview begins [00:19:06]

Human Compatible: Artificial Intelligence and the Problem of Control [00:21:27]

Principles for Beneficial Machines [00:29:25]

AI moral rights [00:33:05]

Humble machines [00:39:35]

Learning to predict human preferences [00:45:55]

Animals and AI [00:49:33]

Enfeeblement problem [00:58:21]

Counterarguments [01:07:09]

Orthogonality thesis [01:24:25]

Intelligence explosion [01:29:15]

Policy ideas [01:38:39]

What most needs to be done [01:50:14]

Rob’s outro [02:12:23]

Learn more

Working in US AI policy

AI safety syllabus

Related episodes

#58 – Pushmeet Kohli on DeepMind’s plan to make AI systems robust & reliable, why it’s a core issue in AI design, and how to succeed at AI research

#44 – Paul Christiano on how OpenAI is developing real solutions to the ‘AI alignment problem’, and his vision of how humanity will progressively hand over decision-making to AI systems

#47 – PhD or programming? Fast paths into aligning AI as a machine learning engineer, according to ML engineers Catherine Olsson & Daniel Ziegler

#54 – Askell, Brundage & Clark on whether policy has a hope of keeping up with AI advances

About the show

What should I listen to first?

Our research

Follow us

Take action

About us