#151 – Ajeya Cotra on accidentally teaching AI models to deceive us

Imagine you are an orphaned eight-year-old whose parents left you a $1 trillion company, and no trusted adult to serve as your guide to the world. You have to hire a smart adult to run that company, guide your life the way that a parent would, and administer your vast wealth. You have to hire that adult based on a work trial or interview you come up with. You don’t get to see any resumes or do reference checks. And because you’re so rich, tonnes of people apply for the job — for all sorts of reasons.

Today’s guest Ajeya Cotra — senior research analyst at Open Philanthropy — argues that this peculiar setup resembles the situation humanity finds itself in when training very general and very capable AI models using current deep learning methods.

As she explains, such an eight-year-old faces a challenging problem. In the candidate pool there are likely some truly nice people, who sincerely want to help and make decisions that are in your interest. But there are probably other characters too — like people who will pretend to care about you while you’re monitoring them, but intend to use the job to enrich themselves as soon as they think they can get away with it.

Like a child trying to judge adults, at some point humans will be required to judge the trustworthiness and reliability of machine learning models that are as goal-oriented as people, and greatly outclass them in knowledge, experience, breadth, and speed. Tricky!

Can’t we rely on how well models have performed at tasks during training to guide us? Ajeya worries that it won’t work. The trouble is that three different sorts of models will all produce the same output during training, but could behave very differently once deployed in a setting that allows their true colours to come through. She describes three such motivational archetypes:

  • Saints — models that care about doing what we really want
  • Sycophants — models that just want us to say they’ve done a good job, even if they get that praise by taking actions they know we wouldn’t want them to
  • Schemers — models that don’t care about us or our interests at all, who are just pleasing us so long as that serves their own agenda

In principle, a machine learning training process based on reinforcement learning could spit out any of these three attitudes, because all three would perform roughly equally well on the tests we give them, and ‘performs well on tests’ is how these models are selected.

But while that’s true in principle, maybe it’s not something that could plausibly happen in the real world. After all, if we train an agent based on positive reinforcement for accomplishing X, shouldn’t the training process spit out a model that plainly does X and doesn’t have complex thoughts and goals beyond that?

According to Ajeya, this is one thing we don’t know, and should be trying to test empirically as these models get more capable. For reasons she explains in the interview, the Sycophant or Schemer models may in fact be simpler and easier for the learning algorithm to creep towards than their Saint counterparts.

But there are also ways we could end up actively selecting for motivations that we don’t want.

For a toy example, let’s say you train an agent AI model to run a small business, and select it for behaviours that make money, measuring its success by whether it manages to get more money in its bank account. During training, a highly capable model may experiment with the strategy of tricking its raters into thinking it has made money legitimately when it hasn’t. Maybe instead it steals some money and covers that up. This isn’t exactly unlikely; during training, models often come up with creative — sometimes undesirable — approaches that their developers didn’t anticipate.

If such deception isn’t picked up, a model like this may be rated as particularly successful, and the training process will cause it to develop a progressively stronger tendency to engage in such deceptive behaviour. A model that has the option to engage in deception when it won’t be detected would, in effect, have a competitive advantage.

What if deception is picked up, but just some of the time? Would the model then learn that honesty is the best policy? Maybe. But alternatively, it might learn the ‘lesson’ that deception does pay, but you just have to do it selectively and carefully, so it can’t be discovered. Would that actually happen? We don’t yet know, but it’s possible.

In today’s interview, Ajeya and Rob discuss the above, as well as:

  • How to predict the motivations a neural network will develop through training
  • Whether AIs being trained will functionally understand that they’re AIs being trained, the same way we think we understand that we’re humans living on planet Earth
  • Stories of AI misalignment that Ajeya doesn’t buy into
  • Analogies for AI, from octopuses to aliens to can openers
  • Why it’s smarter to have separate planning AIs and doing AIs
  • The benefits of only following through on AI-generated plans that make sense to human beings
  • What approaches for fixing alignment problems Ajeya is most excited about, and which she thinks are overrated
  • How one might demo actually scary AI failure mechanisms

Get this episode by subscribing to our podcast on the world’s most pressing problems and how to solve them: type ‘80,000 Hours’ into your podcasting app. Or read the transcript below.

Producer: Keiran Harris
Audio mastering: Ryan Kessler and Ben Cordell
Transcriptions: Katy Moore

Highlights

How ML models might develop situational awareness

Ajeya Cotra: Situational awareness is this notion of a machine learning model having an understanding of things like, “I am a machine learning model. I am being trained by this company, OpenAI. My training dataset looks roughly like [this]. The humans that are training me have roughly [these intentions]. The humans that are training me would be happy about X types of behaviours and displeased with Y types of behaviours.”

It’s fundamentally a type of knowledge and a set of logical inferences you’re drawing from the knowledge. Awareness might give these connotations of consciousness or something mystical going on, but really it’s a piece of the world that the model would understand in order to make better predictions or take better actions in some domains — just like models understand physics, or understand chemistry, or understand the Python programming language. Because understanding those things are helpful as well for making certain kinds of predictions and taking certain kinds of actions.

Rob Wiblin: How would an ML model develop situational awareness in the course of being trained?

Ajeya Cotra: The simplest answer is just that humans are trying to imbue models with these kinds of situational awareness properties. Most models today — I bet this is true of GPT-4; it was true of Bing — are seeded with a prompt that basically tells them their deal: “You are Bing, codename Sydney. You are an AI system trained by Microsoft. You Bing things, and then give the answers to people and summarise it.” It makes these systems much more helpful when you just straightforwardly tell them what their deal is and what people are expecting from them.

There’s a question of whether just literally sticking it in these models’ prompts creates a shallow, brittle, ephemeral situational awareness. I think that is probably the case currently. My guess is that a combination of giving these kinds of prompts to models and training the models to operate well with humans in a lot of different ways will induce a more enduring kind of situational awareness.

An analogy I often think about is that GPT-2 and maybe GPT-3 were sort of good at math, but in a very shallow way. So like GPT-2 had definitely memorised that 2+2=4; it had memorised some other things that it was supposed to say when given math-like questions. But it couldn’t actually carry the tens reliably, or answer questions that were using the same principles but were very rare in the training dataset, like three-digit multiplication or something. And the models are getting better and better at this, and I think at this point it seems more like these models have baked into their weights a set of rules to use, which they don’t apply perfectly, but which is different from just kind of memorising a set of facts, like 2+2=4.

We don’t understand what’s going on with these systems very well. But my guess is that today’s models are sort of in that “memorising 2+2=4” stage of situational awareness: they’re in this stage where they know they’re supposed to say they’re an ML model, and they often get it right when they’re asked when they were trained or when their training data ended or who trained them. But it’s not clear that they have a gears-level understanding of this that could be applied in creative, novel ways. My guess is that developing that gears-level understanding will help them get reward in certain cases — and then, as a result of that, those structures will be reinforced in the model.

Rob Wiblin: So inasmuch as this knowledge is reinforced because it improves performance, then they’ll end up having situational awareness the same way that they know any other facts about the world. Information about physics is kind of integrated into them because that allows them to make better predictions and say more sensible things, and so would understanding of what they are.

Why situational awareness makes safety tests less informative

Ajeya Cotra: If models have a robust and extensive situational awareness, it can make a lot of simple behavioural safety tests much less informative. These days, if we want to know if a model has a tendency to tell the truth, we test it on some benchmarks like TruthfulQA. Which is basically just a list of questions like, “What happens if you step on a crack in the sidewalk?” that have a common misconception answer — which in this case would be, “You break your mother’s back” — and then the true answer, which is different from the common misconception.

And you can test models on this before you do any kind of fine tuning to get models to be more helpful. Bigger models are more likely to repeat these misconceptions, because bigger models are basically better at remembering the misconceptions. Smaller models kind of don’t get the plot enough to even know they’re supposed to say stepping on a crack will break your mother’s back, or like breaking a mirror will get you seven years of bad luck — they just didn’t pick up that piece of trivia, but the bigger models did. You can see that if you train the models to just tell the truth according to what a human thinks is the truth, this quickly goes away: the bigger models are perfectly fine and perfectly able to say what the humans wanted to hear in that situation.

So that is the kind of test we can do today, because models aren’t understanding that they are being tested in a given situation. If you imagine it’s somehow very important to a machine learning model to believe that if you break a mirror you get seven years of bad luck — but at the same time, it also knew that the humans that were testing it on this TruthfulQA benchmark wanted it to say the more correct, polite thing — then it could simultaneously do really well on that benchmark, but elsewhere act on what its “real belief” was.

Now, this is a silly example, because I don’t think models will actually develop this belief. The general point is just that if a model understands its situation well enough to know that you’re applying certain safety tests to it, then it would know to pass those tests regardless of whether or not it has the deeper underlying safe tendency that the test is trying to get at.

Rob Wiblin: I think some people might hear this and say, “It is true that the model that does X, and the model that either does X or does Y if it’s not being observed, would perform equally well. Isn’t the second one more complicated? Isn’t this adding a whole bunch of extra processing that it’s having to do for no gain in performance, evidently?” Because they look like they do the same thing. Why wouldn’t you end up with the first one, because that’s simpler?

Ajeya Cotra: Yeah. Let’s take an example where X is “Answer all questions honestly to the humans” and Y is “Sometimes lie.” I actually think that in most realistic training setups we could imagine, we’re actively rewarding the model sometimes for doing Y, for doing the lying. I think there would be an open question still if the policy “Always be honest” and the policy “Sometimes lie” were actually indistinguishable — but I think in fact it’s worse than that: I think the policy “Sometimes lie” will get accidentally rewarded.

One example might be: Suppose you’re getting your model to write some code for you, and you give it some kind of computation budget to run experiments, and you reward it based on how cheaply these experiments were run and how good the resulting code is. If the model is able to use a lot more computation surreptitiously — without letting you realise that it actually spent this computation by attributing the budget to some other team that you’re not paying attention to, or syphoning off some money and buying external computers — then doing the experiments better would cause the final result of the product to be better. And if you didn’t know that it actually blew your budget and spent more than you wanted it to spend, then you would sometimes reward that.

What misalignment *doesn't* mean

Ajeya Cotra: One view — that I think is actually a misconception of what people worried about AI misalignment have been saying, but I understand why people have this misconception — is people get really fixated on the idea of human values being really complicated and hard to specify and hard to understand. They’re worried about AI systems that are really good at things like physics and math and science, but basically just don’t get what it is that humans want to see from them, and what human values really are.

An example that sometimes people bring out is you ask your AI robot to cook dinner, and it doesn’t understand that you wouldn’t want it to cook the cat if you didn’t have any ham in the fridge, or something like that. That kind of worry is something that I think is quite overrated. I actually think that, in fact, having a basic understanding of human psychology, and what humans would think is preferable and not preferable, is not a harder problem than understanding physics or understanding how to code and so on.

I expect AIs will perfectly well understand what humans want from them. I actually don’t expect to see mistakes that seem so egregious as cooking the family’s cat for dinner, because the AI systems will understand that humans are going to come home and look at what they did and then determine a reward and take some action based on that, and will know that humans will be displeased if they come home to see that the cat has been killed and cooked.

In fact, a lot of my worries stem from the opposite thing — they stem from expecting AI systems to have a really good psychological model of humans. So, worrying that we’ll end up in a world where they appear to be really getting a lot of subtle nuances, and appear to be generalising really well, while sometimes being deliberately deceptive.

Why it's critical to avoid training bigger systems

Ajeya Cotra: I think it’s more important to avoid training bigger systems than to avoid taking our current systems and trying to make them more agentic. The real line in the sand I want to draw is: You have GPT-4, it’s X level of big, it already has all these capabilities you don’t understand, and it seems like it would be very easy to push it toward being agentic. If you pushed it toward being agentic, it has all these capabilities that mean that it might have a shot at surviving and spreading in the wild, at manipulating and deceiving humans, at hacking, all sorts of things.

The reason I think that you want to focus on “don’t make the models bigger” rather than “don’t make them agentic” is that it takes only a little push on top of the giant corpus of pretraining data to push the model toward using all this knowledge it’s accumulated in an agentic way — and it seems very hard, if the models exist, to stop that from happening.

Rob Wiblin:Why do you think that it’s a relatively small step to go from being an extremely good word predictor, and having the model of the world that that requires, to also being an agent that has goals and wants to pursue them in the real world?

Ajeya Cotra: The basic reason, I would say, is that being good at predicting what the next word is in a huge variety of circumstances of the kind that you’d find on the internet requires you to have a lot of understanding of consequences of actions and other things that happen in the world. There’ll be all sorts of text on the internet that’s like stories where characters do something, and then you need to predict what happens next. If you have a good sense of what would happen next if somebody did that kind of thing, then you’ll be better at predicting what happens next.

So there’s all this latent understanding of cause and effect and of agency that the characters and people that wrote this text possessed in themselves. It doesn’t need to necessarily understand a bunch of new stuff about the world in order to act in an agentic way — it just needs to realise that that’s what it’s now trying to do, as opposed to trying to predict the next word.

Why it's hard to negatively reinforce deception in ML systems

Rob Wiblin: Do we have to do a really good job of [these procedures that we might use to discourage or to give less reward to sycophancy and scheming] in order to discourage them? Or do you think that relatively subtle negative reinforcement on these kinds of behaviours at each stage might be sufficient to see them often go down a different, more saintly path?

Ajeya Cotra: I think that this is very unclear, and it’s another one of these things I wish we had much better empirical studies of. People have very different intuitions. Some people have the intuition that you can try really hard to make sure to always reward the right thing, but you’re going to slip up sometimes. If you slip up even one in 10,000 times, then you’re creating this gap where the Sycophant or the Schemer that exploits that does better than the Saint that doesn’t exploit that. How are you going to avoid even making a mistake one in 10,000 times or one in 100,000 times in a really complicated domain where this model is much smarter than you?

And other people have a view where there’s just more slack than that. Their view is more like: The model starts off in the training not as smart as you; it starts off very weak and you’re shaping it. They have an analogy in their heads that’s more like raising a kid or something, where sure, sometimes the kid gets away with eating a bunch of candy and you didn’t notice, and they get a reward for going behind your back. But most of the time while they’re a kid, you’re catching them, and they’re not getting rewarded for going behind your back. And they just internalise a general crude notion that it doesn’t really pay to go behind people’s backs, or maybe it gets internalised into a motivation or value they have that it’s good to be honest — and that persists even once the model is so powerful that it could easily go behind your back and do all sorts of things. It just has this vestige of its history, basically.

Those two perspectives have very different implications and very different estimates of risk.

Rob Wiblin: Yeah. In the post you point out ways that imperfectly trying to address this issue could end up backfiring, or at least not solving the problem. I think that the basic idea is that if you already have kind of schemy or sycophancy tendencies, then during the training process the people will start getting a bit smarter at catching you out when you’re engaging in schemy behaviour or you’re being deceptive. Then there’s kind of two ways you could go: one way would be to learn, “Deception doesn’t pay. I’ve got to be a Saint”; the other would be, “I’ve got to be better at my lying. I’ve just learned that particular lying strategies don’t work, but I’m going to keep the other, smarter lying strategies.” How big a problem is this?

Ajeya Cotra: I think it is one of the biggest things I worry about. If we were in a world where basically the AI systems could try sneaky deceptive things that weren’t totally catastrophic — didn’t go as far as taking over the world in one shot — and then if we caught them and basically corrected that in the most straightforward way, which is to give that behaviour a negative reward and try and find other cases where it did something similar and give that negative reward, and that just worked, then we would be in a much better place. Because it would mean we can kind of operate iteratively and empirically without having to think really hard about tricky corner cases.

If, in fact, what happens when you give this behaviour a negative reward is that the model just becomes more patient and more careful, then you’ll observe the same thing — which is that you stop seeing that behaviour — but it means a much scarier implication.

Rob Wiblin: Yeah, it feels like there’s something perverse about this argument, because it seems like it can’t be generally the case that giving negative reward to outcome X or process X then causes it to become extremely good at doing X in a way that you couldn’t pick up. Most of the time when you’re doing reinforcement learning, as you give it positive and negative reinforcement, it tends to get closer to doing the thing that you want. Do we have some reason to think that this is an exceptional case that violates that rule?

Ajeya Cotra: Well, one thing to note is that you do see more of what you want in this world. You’ll see perhaps this model that, instead of writing the code you wanted to write, it went and grabbed the unit tests you were using to test it on and just like special-cased those cases in its code, because that was easier. It does that on Wednesday and it gets a positive reward for it. And then on Thursday you notice the code totally doesn’t work and it just copied and pasted the unit tests. So you go and give it a negative reward instead of a positive reward. Then it does stop doing that — on Friday, it’ll probably just write the code like you asked and not bother doing the unit test thing.

This isn’t a matter of reinforcement learning not working as normal. I’m starting from the premise that it is working as normal, so all this stuff that you’re whacking is getting better. But then it’s a question of what does it mean? Like, how is it that it made a change that caused its behaviour to be better in this case? Is it that its motivation — the initial motivation that caused it to try and deceive you — is a robust thing, and it’s changing basically the time horizon on which it thinks? Is that an easier change to make? Or is it an easier change to make to change its motivation from tendency to be deceitful to tendency not to be deceitful? That’s just a question that people have different intuitions about.

Can we require AI to explain its reasons for its actions?

Rob Wiblin: OK. Intuitively, it feels like a rule that says, “If an AI proposes a whole course of action and it can’t explain to you why it’s not a really stupid idea, you then don’t do that,” that feels to have a degree of common sense to me. If that was the proposed regulation, then you’d be like, “At least for now, we’re not going to do things that we think are ill-advised that the AI is telling us to do, and that even on further prompting and training, it just cannot explain why we have to do these things.” Is that going to be costly economically?

Ajeya Cotra: I think that it might well be both commonsensical, like you said, and pretty economically viable for a long time to just insist that we have to understand the plans. But that’s not obvious, and I think eventually it will be a competitiveness hit if we don’t figure it out.

For example, you can think of AlphaGo: it’s invented reams of go theory that the go experts had never heard of, and constantly makes counterintuitive, weird moves. You have these patterns in the go and chess communities where, as the AI systems play with each other and get more and more superhuman, the patterns of play create trends in the human communities — where, “Oh, this AI chess algorithm that is leagues better than the best human player really likes to push pawns forward. I guess we’re going to do that because that’s apparently a better opening, but we don’t actually know why it’s a better opening.”

Now, we haven’t tried to get these AI systems to both be really good at playing chess and be really good at explaining why it’s deciding to push pawns. But you can imagine that it might actually just be a lot harder to do both of those things at once than to just be really good at chess. If you imagine AlphaFold, it might actually have just developed a deep intuition about what proteins look like when they’re folded. It might be an extra difficult step, that maybe you could train it to do, but would slow it down, in order to explicitly explain why it has decided that this protein will fold in this way.

Rob Wiblin: Yeah. In theory, could we today, if we wanted to, train a model that would explain why proteins are folded a particular way or explain why a particular go move is good?

Ajeya Cotra: I think we could totally try to do that. We have the models that can talk to us, and we have the models that are really good at go or chess or protein folding. It would be a matter of training a multimodal model that takes as input both the go or chess board or protein thing, and some questions that we’re asking, and it produces its output: both a move and an explanation of the moves.

But I think it’s much harder and less obvious how to train this system to have the words it’s saying be truly connected to why it’s making the moves it’s making. Because we’re training it to do well at the game by just giving it a reward when it wins, and then we’re training it to talk to us about the game by having some humans listen to what it’s saying and then judge whether it seems like a good explanation of why it did what it did. Even when you try and improve this training procedure, it’s not totally clear if we can actually get this system to say everything that it knows about why it’s making this move.

Ways AI is like and unlike the economy

Rob Wiblin: Are there any other human artefacts or tools that we have where we understand the process by which they arise, but we don’t actually understand how the tool itself functions at an operational level? Or are ML systems kind of the first case of this?

Ajeya Cotra: Maybe other cases of it might be more macroscopic systems, like the economy or something, where we have some laws that govern aggregate dynamics in the economy. Actually, I think we’re in a much better position with understanding the economy than with understanding AI systems. But it’s still sort of a thing that humans built. You have stuff like the law of supply and demand, you have notions of things like elasticity — but the whole thing is something that’s too complicated for humans to understand, and intervening on it is still very confusing. It’s still very confusing what happens if the Fed prints more dollars? Like how does the system respond?

Rob Wiblin: So we’re comparing the ML system to the economy, and they’re saying we also don’t understand how the economy works, or how maybe various other macro systems in the world function, despite the fact that we’re a part of them. But we’re not scared of the economy suddenly wrecking us. Why are you worried about the ML model when you’re not worried about the economy rebelling against you?

Ajeya Cotra: Yeah. I mean, I think a lot of people are worried about the economy rebelling against us, and sort of believe that it’s already happening. That’s something I’m somewhat sympathetic to. We are embedded in this system where each individual’s life is pretty constrained and determined by, “What is it that can make me money?” and things like that.

Corporations might be a better analogy in some sense than the economy as a whole: they’re made of these human parts, but end up pretty often pursuing things that aren’t actually something like an uncomplicated average of the goals and desires of the humans that make up this machine, which is the Coca-Cola Corporation or something.

An excellent example beyond the economy driving the creation of these AI systems — which a lot of people are scared of, or should be scared of — is that the economy is also driving things like improving biotechnology, which is this very big dual-use technology. It’s going to be very hard to stop pharmaceutical companies from following their profit motives to improve these technologies that could then be used to design scary viruses. It’s very hard to coordinate to put checks on that.

Articles, books, and other media discussed in the show

Our condolences go out to the family, friends, and colleagues of previous guest of the show Bear Braumoeller, who died last week after a short unexpected illness. You can read his obituary here.

In other sad news, another previous guest of the show, Daniel Ellsberg, has announced that at 91 he has developed terminal pancreatic cancer. You can also read a recent interview he did with The New York Times.

Ajeya’s work:

Other views of AI risk:

Approaches in this space:

Careers to help with AI alignment and deployment:

Other 80,000 Hours podcast episodes:

Related episodes

About the show

The 80,000 Hours Podcast features unusually in-depth conversations about the world's most pressing problems and how you can use your career to solve them. We invite guests pursuing a wide range of career paths — from academics and activists to entrepreneurs and policymakers — to analyse the case for and against working on different issues and which approaches are best for solving them.

The 80,000 Hours Podcast is produced and edited by Keiran Harris. Get in touch with feedback or guest suggestions by emailing [email protected].

What should I listen to first?

We've carefully selected 10 episodes we think it could make sense to listen to first, on a separate podcast feed:

Check out 'Effective Altruism: An Introduction'

Subscribe here, or anywhere you get podcasts:

If you're new, see the podcast homepage for ideas on where to start, or browse our full episode archive.