What we've learned from recent AI advances
Richard Ngo: It does feel like the progress of the last couple years in particular has been very compelling and quite visceral for me, just watching what’s going on. I think partly because the demos are so striking — the images, the poetry that GPT-3 creates, things like that — and then partly because it’s just so hard to see what’s coming. People are really struggling to try and figure out what things can these language models not actually do? Like, what benchmarks can we design, what tasks can we give them that are not just going to fall in a year or maybe a year and a half? It feels like the whole field is in a state of suspense in some ways. It’s just really hard to know what’s coming. And it might just often be the things that we totally don’t expect — like AI art, for example.
Rob Wiblin: Yeah. I guess there’s two things going on here. One is just that the capabilities are ahead of where people forecast them to be. And they’re also ahead of where they forecast it to be in a kind of strange direction: that the progress is occurring on the ability to do tasks that people didn’t really anticipate would be the first things that AI would be able to do. So it’s just massively blown open our credences or our expectations about what might be possible next, because it seems like we don’t have a very good intuitive grasp of which things ML is able to improve at really rapidly and which ones it’s not.
Richard Ngo: Right. I often hear this argument that says something like, “Oh, look, AI’s not going to be dangerous at all. It can’t even load a dishwasher. It can’t even fold laundry” or something like that. Actually it turns out that a lot of really advanced and sophisticated capabilities — including some kinds of reasoning, like advanced language capabilities, and coding as well actually — can come significantly before a bunch of other capabilities that are more closely related to physical real-world tasks. So I do think there’s a pretty open question as to what the ordering of all these different tasks are, but we really just can’t rule out a bunch of pretty vital and impressive and dangerous capabilities coming even earlier than things that seem much more prosaic or much more mundane.
Richard Ngo: I’ve already alluded a little bit to how the unpredictability of capability advances and how things like reasoning, strategic thinking, and so on might just come much earlier than we expect.
Another thing that has felt pretty important is that we don’t really know what the capabilities of our systems are directly after we’ve built them anymore. So once you train a large language model or a large image model, it may be the case that there are all sorts of things that you can get it to do — given the right input, the right prompt, the right setup — that we just haven’t figured out yet, because they’re emergent properties of what the model has learned during training.
Probably the best example of this is that people figured out that if you prompt large language models to think through, step by step, in its reasoning, they can answer significantly more difficult questions than they could if you just give them the question by itself. And this makes sense, because this is what humans do, right? If you tell somebody to answer an arithmetic problem by calculating all the intermediate values, they’re probably much more likely to get it correct than if you say, “You have to give me the final answer directly.”
But this is something that it took ages for people to figure out. Because this was applicable to GPT-3, and I think also to a lesser extent to GPT-2, but papers about this were only coming out last year. So these are the types of things where actually just figuring out what the models can even do is a pretty difficult task, and probably just going to get increasingly difficult.
Key arguments for why this matters
Richard Ngo: I think one argument that feels pretty compelling to me is just that we really have no idea what’s going on inside the systems that we’re training. So you can get a system that will write you dozens of lines of code, that implements a certain function, that leads to certain outcomes on the screen, and we have no way of knowing what it’s doing internally that leads it to produce that output. Why did it choose to implement the function this way instead of that way? Why did it decide to actually follow our instructions as opposed to doing something quite different? We just don’t know mechanistically what’s happening there.
So it feels to me like if we were on course to solving this in the normal run of things, then we would have a better understanding of what’s going on inside our systems. But as it is, without that core ability, it feels hard to rule out or to be confident that we are going to be able to address these things as they come up, because as these systems get more intelligent, anything could be going on.
And there has been some progress towards this. But it feels still very far away, or the progress on this is not clearly advancing faster than the capabilities are advancing.
Rob Wiblin: I suppose this is the first really complicated machine that we’ve ever produced where we don’t know how it works. We know how it learns, but we don’t know what that learning leads it to do internally with the information, or at least we don’t know it very well.
Richard Ngo: Right. In some sense, raising a child is like this. But we have many more guarantees and much more information about what children look like, how they learn, and what sort of inbuilt biases they have, such that they’re going to mostly grow up to be moral, law-abiding people. So maybe a better analogy is raising an alien, and just not having any idea how it’s thinking or when it’s trying to reason about your reactions to it or anything like that.
And right now, I don’t think we’re seeing very clear examples of deception or manipulation or models that are aware of the context of their behaviour. But again, this seems like something where there doesn’t seem to be any clear barrier standing between us and building systems that have those properties.
Rob Wiblin: Yeah. Even with humans, not all of them turn out to be quite that benevolent.
Richard Ngo: Absolutely.
Rob Wiblin: What’s another important reason you think advances in AI may not necessarily go super well without a conscious effort to reduce the risks?
Richard Ngo: I think that a lot of other problems that we’ve faced as a species have been on human timeframes, so you just have a relatively long time to react and a relatively long time to build consensus. And even if you have a few smaller incidents, then things don’t accelerate out of control.
I think the closest thing we’ve seen to real exponential progress that people have needed to wrap their heads around on a societal level has been COVID, where people just had a lot of difficulty grasping how rapidly the virus could ramp up and how rapidly people needed to respond in order to have meaningful precautions.
And in AI, it feels like it’s not just one system that’s developing exponentially: you’ve got this whole underlying trend of things getting more and more powerful. So we should expect that people are just going to underestimate what’s happening, and the scale and scope of what’s happening, consistently — just because our brains are not built for visualising the actual effects of fast technological progress or anything near exponential growth in terms of the effects on the world.
What AI could teach us about ourselves
Rob Wiblin: Another thing that I’ve heard some people speculate that we might be learning from these language models, for example, is that we might be learning something about how humans operate. So these language models are kind of predictive models, where you’ve got some text before and then it’s trying to predict the next word. It seems like using that method, you can at least reasonably often produce surprisingly reasonable speech, and perhaps surprisingly reasonable articles and chat and so on.
Now, some people would say this looks like what people are doing, but it isn’t what they’re doing. We actually have all of these ideas in our minds and then we put them together in a coherent way, because we deeply understand the actual underlying ideas and what we’re trying to communicate. Whereas this thing is just chucking word after word after word in a way that produces a simulation of what a person is like.
But I suppose when people aren’t thinking that deeply, maybe we operate this way as well. Maybe I’m producing speech extemporaneously now, but my brain can do a lot of the work automatically, because it just knows what words are likely to come after other words. Do you have any view on whether these language models are doing something very different than what humans do? Or are we perhaps learning that humans use text prediction as a way of producing speech themselves to some degree?
Richard Ngo: That’s a great example actually, where a lot of the time human behaviour is pretty automatic and instinctive — including things like speech, where, as you say, the words that are coming out of my mouth right now are not really planned in advance by me. I’m just sort of nudging myself towards producing the types of sentences that I intend.
If we think about Kahneman’s System 1 / System 2 distinction, I think that’s actually not a bad way of thinking about our current language models: that they actually do most of the things that humans’ System 1 can do, which is instinctive, quick judgements of things. And then, so far at least, they’re a little bit further away from the sort of metacognition and reflection, and noticing when you’re going down a blind alley or you’ve lost the thread of conversation.
But over time, I think that’s going to change. Maybe another way of thinking about them is that it seems hard to find a task that a human can do in one second or three seconds that our language models can’t currently do. It’s easier to find tasks that humans can do in one minute or five minutes that the models can’t do. But I expect that number to go up and up over time — that the models are going to be able to match humans given increasingly long time periods to think about the task, until eventually they can just beat humans full stop, no matter how much time you give those humans.
Rob Wiblin: So is the idea there that if you only give a human one second to just blurt out something in reaction to something else, then it has to operate on this System 1, this very instinctive thing, where it’s just got to string a sentence together and it doesn’t really get to reflect on it. And the language models can do that: they can blurt something out. But the thing that they’re not so good at is what humans would do, which is look at the sentence that is about to come out of their mouth and then see that that’s actually not what they want to communicate, and then edit it and make it make a whole lot more sense conceptually.
Richard Ngo: Right. And sometimes you even see language models making the same types of mistakes that humans make. So as an example, if you ask it for a biography of somebody, sometimes it’ll give you a description of their career and achievements that’s not quite right, but in the right direction. Maybe they’ll say that the person went to Oxford when actually that person went to Cambridge, or something like that — where it’s like it clearly remembers something about that person, but it just hasn’t memorised the detail. It’s more like it’s learned some kind of broader association. Maybe it’ll say that they studied biology when actually they studied chemistry — but it won’t say that they studied film studies when they actually studied chemistry.
So it seems like there’s these mistakes where it’s not actually recalling the precise details, but kind of remembering the broad outline of the thing and then just blurting that out, which is what humans often do.
Bottlenecks to learning for humans
Richard Ngo: I think the biggest one is just that we don’t get much chance to experience the world. We just don’t get that much input data, because our lives are so short and we die so quickly.
There’s actually a bunch of work on scaling laws for large language models, which basically say: If you have a certain amount of compute to spend, should you spend it on making the model bigger or should you spend it on training it for longer on more data? What’s the optimal way to make that tradeoff?
And it turns out that from this perspective, if you had as much compute as the human brain uses throughout a human lifetime, then the optimal way to spend that is not having a network the size of the human brain — it’s actually having a significantly smaller network and training it on much more data. So intuitively speaking, at least, human brains are solving this tradeoff in a different way from how we are doing it in machine learning, where we are taking relatively smaller networks and training them on way more data than a human ever sees in order to get a certain level of capabilities.
Rob Wiblin: I see. OK, hold on. The notion here is you’ve got a particular amount of compute, and there’s two different ways you could spend it. One would be to have tonnes of parameters in your model. I guess this is the equivalent of having lots of neurons and lots of connections between them. So you’ve got tons of parameters; this is equivalent to brain size. But another way you could use the compute is, instead of having lots and lots of parameters that you’re constantly estimating and adjusting, you’d have a smaller brain in a sense, but train it for longer — have it read way more stuff, have it look at way more images.
And you’re saying humans are off on this crazy extreme, where our brains are massive — so many parameters, so many connections between all the neurons — but we only live for so little time. We read so little, we hear so little speech, relative to what is possible. And we’d do better if somehow nature could have shrunk our brains, but then got us to spend twice as long in school in a sense. I suppose there’s all kinds of problems to getting beings to live quite that long, but you might also get killed while you’re in your stupid early-infant phase for so long.
Richard Ngo: Exactly. Humans faced all these tradeoffs from the environment, which probably neural networks are just not going to face. So by the time we are training neural networks that are as large as the human brain, we should be expecting that they won’t have as much experience as humans do, but they’re actually just going to be training on a huge amount more experience compared with humans. So that’s one way in which humans are systematically disadvantaged. We just haven’t been built to absorb the huge amounts of information that are being generated by the internet or videos, YouTube, Wikipedia, across the entire world.
And that’s closely related to the idea of AIs being copyable. If you have a neural network that’s trained on one piece of data, you can then make many copies of that network and deploy it in all sorts of situations. And then you can feedback the experience that it gets from all those situations into the base model — so effectively you can have a system that’s learning from a very wide array of data, and then taking that knowledge and applying it to a very wide range of new situations.
These are all ways in which I think in the short term, humans are disadvantaged just by virtue of the fact that we’re running on biological brains in physical bodies, rather than virtual brains in the cloud.
And then in the longer term, I think the key thing here is what algorithmic improvements can you make? How much can you scale these things up? Because over the last decade or two, we’ve seen pretty dramatic increases in the sizes of neural networks that we’ve been using, and the algorithms that we are using have also been getting significantly more efficient. So we can think about artificial agents investing in doing more machine learning research and improving themselves in a way that humans just simply can’t match — because our brain sizes, our brain algorithms, and so on are pretty hard coded; there’s not really that much we can do to change this. So in the long term, it seems like these factors really should lead us to expect AI to dramatically outstrip human capabilities.
The most common and important misconception around ML
Richard Ngo: I think the most common and important misconception has to do with the way that the training setup relates to the model that’s actually produced. So for example, with large language models, we train them by getting them to predict the next word on a very wide variety of text. And so some people say, “Well, look, the only thing that they’re trying to do is to predict the next word. It’s meaningless to talk about the model trying to achieve things or trying to produce answers with certain properties, because it’s only been trained to predict the next word.”
The important point here is that the process of training the model in a certain way may then lead the model to actually itself have properties that can’t just be described as predicting the next word. It may be the case that the way the model predicts the next word is by doing some kind of internal planning process, or it may be the case that the way it predicts the next word is by reasoning a bunch about, “How would a human respond in this situation?” I’m not saying our current models do, but that’s the sort of thing that I don’t think we can currently rule out.
And in the future, as we get more sophisticated models, the link between the explicit thing that we’re training them to do — which in this case is predict the next word or the next frame of a video, or things like that — and the internal algorithms that they actually learn for doing that is going to be less and less obvious.
Rob Wiblin: OK, so the idea here is: let’s say that I was set the task of predicting the next word that you are going to say. It seems like one way that I could do that is maybe I should go away and study a whole lot of ML. Maybe I need to understand all of the things that you’re talking about, and then I’ll be able to predict what you’re likely to say next. Then someone could come back and say, “Rob, you don’t understand any of the stuff. You’re just trying to predict the next word that Richard’s saying.” And I’m like, “Well, these things aren’t mutually exclusive. Maybe I’m predicting what you’re saying by understanding it.” And we can’t rule out that there could be elements of embodied understanding inside these language models.
Richard Ngo: Exactly. And in fact, we have some pretty reasonable evidence that suggests that they are understanding things on a meaningful level.
My favourite piece of evidence here is from a paper that used to be called “Moving the Eiffel Tower to ROME” — I think they’ve changed the name since then. But the thing that happens in that paper is that they do a small modification of the weights of a neural network. They identify the neurons corresponding to the Eiffel Tower and Rome and Paris, and then just swap things around. So now the network believes that the Eiffel Tower is in Rome. And you might think that if this was just a bunch of memorised heuristics and no real understanding, then if you ask the model a question — “Where is the Eiffel Tower?” — sure, it’ll say Rome, but it’ll screw up a whole bunch of other questions. It won’t be able to integrate that change into its world model.
But actually what we see is that when you ask a bunch of downstream questions — like, “What can you see from the Eiffel Tower? What type of food is good near the Eiffel Tower? How do I get to the Eiffel Tower?” — it actually integrates that single change of “the Eiffel Tower is now in Rome” into answers like, “From the Eiffel Tower, you can see the Coliseum. You should eat pizza near the Eiffel Tower. You should get there by taking the train from Berlin to Rome via Switzerland,” or things like that.
Rob Wiblin: That’s incredible.
Richard Ngo: Exactly. And it seems like almost a definition of what it means to understand something is that you can take that isolated fact and translate it into a variety of different ideas and situations and circumstances.
And this is still pretty preliminary work. There’s so much more to do here in understanding how these models are actually internally thinking and reasoning. But just saying that they don’t understand what’s going on, that they’re just predicting the next word — as if that’s mutually exclusive with understanding the world — I think that’s basically not very credible at this point.
Rob Wiblin: So the second point was neural networks trained via reinforcement learning will gain more reward by deceptively pursuing misaligned goals. Yeah, can you elaborate on that?
Richard Ngo: Right. So the key idea here is this concept called situational awareness, which was introduced by Ajeya Cotra in a report on the alignment problem, and which I’ve picked up and am using in my report. The way I’m thinking about situational awareness is just being able to apply abstract knowledge to one’s own situation, in order to choose actions in the context that the agent finds itself in.
In some sense this is a very basic phenomenon. When I go down to the grocery store to buy some matches, for example. Maybe I’ve never bought matches at the grocery store before, but I have this abstract knowledge of, like, “Matches are the type of thing that tend to be found in these types of stores, and I can buy them, and I’m in a situation where I can walk down to the store.” So in some sense this is just a very basic skill for humans.
In the context of AI, we don’t really have systems that have particularly strong situational awareness right now. We have agents that play StarCraft, for example, but they don’t understand that they are an AI playing StarCraft. They’re just within the game. They don’t understand the wider context. And then if you look at language models, I think they come a bit closer, because they do have this abstract knowledge. If you ask them, “What is a language model? How is it trained?”, things like that, they can give you pretty good answers, but they still don’t really apply that knowledge to their own context. They don’t really use that knowledge in order to choose their own answers.
But as we train systems that are useful for a wide range of tasks — for example, an assistant: if you train an assistant AI, then it’s got to have to know a bunch of stuff about, “What are my capabilities? How can I use those capabilities to help humans? Who’s giving me instructions and what do they expect from me?” Basically the only way to have really helpful AIs in these situations is for them to have some kind of awareness of the context that they’re in, the expectations that we have for them, the ways that they operate, the limitations that they act under, and things like that. And that’s this broad thing that I’m calling situational awareness.
Rob Wiblin: So there’s this concept of situational awareness, which is the water that we swim in, such that it is almost hard to imagine that it’s a thing. But it is a thing that humans have that lots of other minds might not have. But in order to get these systems to do lots of tasks that we’re going to ultimately want them to do, they’re going to need situational awareness as well, for the same reason that humans do. So that’s kind of a next stage. And then what?
Richard Ngo: And then, when you’ve got systems that have situational awareness, then I claim that you start to get the problematic types of deception. So in the earlier phases of training, you might have systems that learn the concept of “hide mistakes where possible,” but they don’t really understand who they’re hiding the mistakes from, exactly what types of scrutiny humans are going to apply, things like that. They start off pretty dumb. Maybe they just have this reflex to hide their mistakes, kind of the same way that young kids have an instinctive heuristic of hiding mistakes where possible.
But then, as you start to get systems that actually understand their wider context — like, “I’m being trained in this way,” or “Machine learning systems tend to be scrutinised in this particular way using these types of tools” — then you’re going to end up in an adversarial situation. Where, if the model tries to hide its mistakes, for example, then it will both know how to do so in an effective way, and then also apply the knowledge that it gains from the humans around it, in order to become increasingly effective at that.
So that’s a concern: that these types of deception are just going to be increasingly hard to catch as models get more and more situational awareness.
Reinforcement learning undermining obedience
Richard Ngo: So the things that I’ve talked about so far, none of those are the core things that I’m worried about causing large-scale problems. So if you have a deceptive model, maybe it’s doing insider trading on the stock market. Even if we can’t catch that directly, over time, we’re eventually going to figure out, “Oh, something is going off here.” Maybe we’re in a bit of a cat-and-mouse game, where we’re trying to come up with the correct rewards, and the model’s trying to come up with deceptive strategies, but as long as they’re roughly within the human range of intelligence, it feels like we can hopefully constrain a bunch of the worst behaviour that they perform.
Then I’m thinking, what happens when these models become significantly superintelligent? And, in particular, intelligent enough that we just can’t effectively supervise them? What might that look like? It might look like them just operating way too fast for us to understand. If you’ve got an automated CEO who’s sending hundreds of emails every minute, you’re just not going to be able to get many humans to scan all these emails and make sure that there’s not some sophisticated deceptive strategy implemented by them. Or you’ve got AI scientists, who are coming up with novel discoveries that are just well beyond the current state of human science. These are the types of systems where I’m thinking we just can’t really supervise them very well.
So what are they going to do? That basically depends on how they generalise the goals that they previously learned from the period when we were able to supervise them, into these novel domains or these novel regimes.
There are a few different arguments that make me worried that the generalisation is going to go poorly. Because you can imagine, for example, that in the regime where we could supervise, we always penalise deception, and they learn very strong anti-deception goals. Maybe we think that is going to hold up into even quite novel regimes, where deception might look very different from what it previously looked like. Instead of deception being lying to your human supervisor, deception could mean hiding information in the emails you send or something like that.
And I think there are a couple of core problems. The first one is just that the field of machine learning has very few ways to reason about systems that are generalising and transferring their knowledge from one domain to the other. This is just not a regime that has been very extensively studied, basically because it’s just so difficult to say, “You’ve got a model that’s learned one thing. How well can it do this other task? What’s its performance in this wildly different regime?” Because you can’t quantify the difference between Dota and StarCraft or the difference between speaking English and speaking French. These are just very difficult things to understand. So that’s one problem there. Just by default, the way that these systems generalise is in many ways totally obscure to us, and will become more so, as they generalise further and further into more and more novel regimes.
Then there are a few more specific arguments as to why I’m worried that the goals are going to generalise in bad ways. Maybe one way of making these arguments is to distinguish between two types of goals. I’m going to call one type “outcomes” and I’m going to call the other type “constraints.”
Outcomes are just achieving things in the world — like ending up with a lot of money, or ending up having people respect you, or building a really tall building, just things like that. And then constraints I’m going to say are goals that are related to how you get to that point. So what do you need to do? Do you need to be deceptive in order to make a lot of money? Do you need to hurt people? Do you need to go into a specific industry or take this specific job? You might imagine a system has goals related to those constraints as well. And so the concern here is something like: for goals that are in the form of outcomes, systems are going to benefit from applying deceptive or manipulative strategies there.
This is, broadly speaking, a version of Bostrom’s instrumental convergence argument, which basically states that there are just a lot of things that are really useful for achieving large-scale goals. Or another way of saying this is from Stuart Russell, who says, “You can’t fetch coffee if you’re dead.” Even if you only have the goal of fetching coffee, you want to avoid dying, because then you can’t fetch the coffee anymore.
So you can imagine systems that have goals related to outcomes in the world are going to generalise towards, “It seems reasonable for me not to want to die. It seems reasonable for me to want to acquire more resources, or develop better technology, or get in a better position,” and so on, and these are all things that we just don’t want our systems to be doing autonomously. We don’t want them to be consolidating power or gaining more resources if we haven’t told them to, or things like that.
And then the second problem is that if these goals are going to generalise — if the goal “make a lot of money” is going to generalise to motivate systems to take large-scale actions — what about a goal like, “never harm humans,” or “never lie to humans,” or things like that? And I think there the problem is that, as you get more and more capable, there are just more ways of working around any given constraint.