#159 – Jan Leike on OpenAI’s massive push to make superintelligence safe in 4 years or less

In July, OpenAI announced a new team and project: Superalignment. The goal is to figure out how to make superintelligent AI systems aligned and safe to use within four years, and the lab is putting a massive 20% of its computational resources behind the effort.

Today’s guest, Jan Leike, is Head of Alignment at OpenAI and will be co-leading the project. As OpenAI puts it, “…the vast power of superintelligence could be very dangerous, and lead to the disempowerment of humanity or even human extinction. … Currently, we don’t have a solution for steering or controlling a potentially superintelligent AI, and preventing it from going rogue.”

Given that OpenAI is in the business of developing superintelligent AI, it sees that as a scary problem that urgently has to be fixed. So it’s not just throwing compute at the problem — it’s also hiring dozens of scientists and engineers to build out the Superalignment team.

Plenty of people are pessimistic that this can be done at all, let alone in four years. But Jan is guardedly optimistic. As he explains:

Honestly, it really feels like we have a real angle of attack on the problem that we can actually iterate on… and I think it’s pretty likely going to work, actually. And that’s really, really wild, and it’s really exciting. It’s like we have this hard problem that we’ve been talking about for years and years and years, and now we have a real shot at actually solving it. And that’d be so good if we did.

Jan thinks that this work is actually the most scientifically interesting part of machine learning. Rather than just throwing more chips and more data at a training run, this work requires actually understanding how these models work and how they think. The answers are likely to be breakthroughs on the level of solving the mysteries of the human brain.

The plan, in a nutshell, is to get AI to help us solve alignment. That might sound a bit crazy — as one person described it, “like using one fire to put out another fire.”

But Jan’s thinking is this: the core problem is that AI capabilities will keep getting better and the challenge of monitoring cutting-edge models will keep getting harder, while human intelligence stays more or less the same. To have any hope of ensuring safety, we need our ability to monitor, understand, and design ML models to advance at the same pace as the complexity of the models themselves.

And there’s an obvious way to do that: get AI to do most of the work, such that the sophistication of the AIs that need aligning, and the sophistication of the AIs doing the aligning, advance in lockstep.

Jan doesn’t want to produce machine learning models capable of doing ML research. But such models are coming, whether we like it or not. And at that point Jan wants to make sure we turn them towards useful alignment and safety work, as much or more than we use them to advance AI capabilities.

Jan thinks it’s so crazy it just might work. But some critics think it’s simply crazy. They ask a wide range of difficult questions, including:

  • If you don’t know how to solve alignment, how can you tell that your alignment assistant AIs are actually acting in your interest rather than working against you? Especially as they could just be pretending to care about what you care about.
  • How do you know that these technical problems can be solved at all, even in principle?
  • At the point that models are able to help with alignment, won’t they also be so good at improving capabilities that we’re in the middle of an explosion in what AI can do?

In today’s interview host Rob Wiblin puts these doubts to Jan to hear how he responds to each, and they also cover:

  • OpenAI’s current plans to achieve ‘superalignment’ and the reasoning behind them
  • Why alignment work is the most fundamental and scientifically interesting research in ML
  • The kinds of people he’s excited to hire to join his team and maybe save the world
  • What most readers misunderstood about the OpenAI announcement
  • The three ways Jan expects AI to help solve alignment: mechanistic interpretability, generalization, and scalable oversight
  • What the standard should be for confirming whether Jan’s team has succeeded
  • Whether OpenAI should (or will) commit to stop training more powerful general models if they don’t think the alignment problem has been solved
  • Whether Jan thinks OpenAI has deployed models too quickly or too slowly
  • The many other actors who also have to do their jobs really well if we’re going to have a good AI future
  • Plenty more

Get this episode by subscribing to our podcast on the world’s most pressing problems and how to solve them: type ‘80,000 Hours’ into your podcasting app. Or read the transcript below.

Producer and editor: Keiran Harris
Audio Engineering Lead: Ben Cordell
Technical editing: Simon Monsour and Milo McGuire
Additional content editing: Katy Moore and Luisa Rodriguez
Transcriptions: Katy Moore

Highlights

Why Jan is optimistic

Jan Leike: I think actually a lot of things, a lot of development over the last few years, have been pretty favourable to alignment. Large language models are actually super helpful because they can understand natural language. They know so much about humans. Like, you can ask them what would be a moral action under this and this philosophy, and they can give you a really good explanation of it. By being able to talk to them and express your views, it makes a lot of things easier. At the same time, they’re in some sense a blank slate, where you can fine-tune them with fairly little data to be so effective.

If you compare this to how the path to AGI or how the development of AI looked a few years ago, it seemed like we were going to train some deep RL agents in an environment like Universe, which is just like a collection of different games and other environments. So they might get really smart trying to solve all of these games, but they wouldn’t necessarily have a deep understanding of language, or how humans think about morality, or what humans care about, or how the world works.

The other thing that I think has been really favourable is what we’ve seen from the alignment techniques we’ve tried so far. So I already mentioned InstructGPT worked so much better than I ever had hoped for. Even when we did the deep RL from human preferences paper, I came into it being a more than even chance we wouldn’t be able to make it work that well in the time that we had. But it did work, and InstructGPT worked really well. And to some extent, you could argue that these are not techniques that align superintelligence, so why are you so optimistic? But I think it still provides evidence that this is working — because if we couldn’t even get today’s systems to align, I think we should be more pessimistic. And so the converse also holds.


Jan Leike: I want to give you a bunch more reasons, because I think there’s a lot of reasons. And also, fundamentally, the most important thing is that I think alignment is tractable. I think we can actually make a lot of progress if we focus on it and put effort into it. And I think there’s a lot of research progress to be made that we can actually make with a small dedicated team over the course of a year or four.

Honestly, it really feels like we have a real angle of attack on the problem that we can actually iterate on, we can actually build towards. And I think it’s pretty likely going to work, actually. And that’s really, really wild, and it’s really exciting. It’s like we have this hard problem that we’ve been talking about for years and years and years, and now we have a real shot at actually solving it. And that’d be so good if we did.

But some of the other reasons why I’m optimistic are that, I think fundamentally, evaluation is easier than generation for a lot of tasks that we care about, including alignment research. Which is why I think we can get a lot of leverage by using AI to automate parts of all of alignment research. And in particular, if you can think about classical computer science problems like P versus NP, you have these kinds of problems that we believe it’s fundamentally easier to evaluate. And it’s true for a lot of consumer products: if you’re buying a smartphone, it’s so much easier to pick a good smartphone than it is to build a smartphone. Or in organisations, if you’re hiring someone, it has to be easier to figure out whether they’re doing a good job than to do their job. Otherwise you don’t know who to hire, right? And it wouldn’t work.

Or if you think about sports and games, where sports wouldn’t be fun to watch if you didn’t know who won the game, and it can be hard to figure out was the current move a good move, but you’ll find out later. And that’s what makes it exciting, right? You have this tension of, “This was an interesting move. What’s going to happen?” But at the end of the game, you look at the chessboard, you look at the go board, you know who won. At the end of the day, everyone knows. Or if you’re watching a soccer game, and the ball goes in the goal, it’s a goal. That’s it. Everyone knows.

And I think it is also true for scientific research. There’s certain research results that people are excited about, even though they didn’t know how to produce them. And sometimes we’re wrong about this, but it doesn’t mean that we can do this task perfectly — it’s just that it’s easier.

The Superalignment team isn't trying to train a really good ML researcher

Jan Leike: Our overall goal is to get to a point where we can automate alignment research. And what this doesn’t mean is we’re not trying to train a system that’s really good at ML research, or that is really smart or something. That’s not Superalignment’s job.

Rob Wiblin: I think a lot of people have been thinking that. I think they’ve read your announcement as saying that you’re trying to train a really good ML researcher, basically.

Jan Leike: I don’t think this would particularly differentially help alignment. I think it would be good to clarify. Basically, how I understand our job is that once there’s models that can do ML research, or things that are close to it — and I think that is something that’s going to happen anyway, and that will happen whether OpenAI does it or not — our job is to figure out how to make it sufficiently aligned that we can trust the alignment research or the alignment research assistance that it is producing. Because, essentially, if you’re asking this system to help you in your alignment research, there’s a big opportunity for the system to influence or try to nudge us into believing certain techniques are really good that actually aren’t. And thus, that system, or future systems, gain power over humans in a way that we actually don’t want, and that isn’t aligned with us. And so what we ultimately need to do is figure out how to make that system sufficiently aligned that we can actually trust it.

So that means, for example, let’s say this for simplicity: The system writes an alignment paper. Now you can read the paper, but just off the bat, you might not actually be able to find all the flaws in the paper. Or in general, the way scientific peer review is not perfect, and there’s lots of examples where people would go for decades with fake research before they’re being found out. So this is something that we have to really figure out how to avoid. So because alignment research or scientific research in general is a difficult task that humans aren’t that good at evaluating, at least not if you don’t have a lot of time to do it, the question then becomes: What kind of alignment techniques do we need in order to be sufficiently confident that this is the case?


Jan Leike: If you’re thinking about how do you align the superintelligence — how do you align the system that’s vastly smarter than humans? — I don’t know. I don’t have an answer. I don’t think anyone really has an answer. But it’s also not the problem that we fundamentally need to solve. Maybe this problem isn’t even solvable by humans who live today. But there’s this easier problem, which is how do you align the system that is the next generation? How do you align GPT-N+1? And that is a substantially easier problem.

And then even more, if humans can solve that problem, then so should a virtual system that is as smart as the humans working on the problem. And so if you get that virtual system to be aligned, it can then solve the alignment problem for GPT-N+1, and then you can iteratively bootstrap yourself until you’re at superintelligence level and you’ve figured out how to align that. And, of course, what’s important when you’re doing this is, at each step, you have to make enough progress on the problem that you’re confident that GPT-N+1 is aligned enough that you can use it for alignment research.

Did the release of ChatGPT increase or reduce AI extinction risk?

Rob Wiblin: OK, here’s another question: “OpenAI’s decision to create and launch ChatGPT has probably sped up AI research because there’s now a rush into the field as people were really impressed with it. But it has also prompted a flurry of concerns about safety and new efforts to do preparation ahead of time to see off possible threats. With the benefit of hindsight, do you think the move to release ChatGPT increased or reduced AI extinction risk, all things considered?”

Jan Leike: I think that’s a really hard question. I don’t know if we can really definitively answer this now. What do I think? I think, fundamentally, it probably would have been better to wait with ChatGPT and release it a little bit later. I think also, to some extent, this whole thing was inevitable, and at some point, the public will have realised how good language models have gotten. You could also say it’s been surprising that it went this long before that was the case. And I was honestly really happy how much it has shifted the conversation, or advanced the conversations — around risks from AI, but also the real alignment work that has been happening on how we can actually make things so much better, and we should do more of that. And I think both of these are really good. And you can now argue over what the timing should have been and whether it would have happened anyways. I think it would have happened anyways.

On a high level, people are asking these questions, which are really good questions to ask, like: Can we all just stop doing AI if we wanted to? It feels so easy. Just stop. Just don’t do it. Like, wouldn’t that be a good thing? But then also in practice, there’s just so many forces in the world that keep this going, right? Like, let’s say OpenAI just decides we’re not going to train a more capable model. Just not do it. OpenAI could do that. And then there’s a bunch of OpenAI competitors who might still do it, and then you still have AI. OK, let’s get them on board: let’s get the top five AGI labs, or the five tech companies that will train the biggest models, and get them to promise it. OK, now they promised. Well, now there’s going to be a new startup. There’s going to be tonnes of new startups.

And then you get into how people are still making transistors smaller, so you’ll just get more capable GPUs — which means the cost to train a model that is more capable than any other model that has been trained so far still goes down exponentially year over year. So now you’re going to semiconductor companies, and you’re like, “Can you guys chill out?” And fine, you could get them on board. And now there’s upstream companies who work on UV lithography or something, and they’re working on making the next generation of chips, have been working on this since the 90’s. And then you get them to chill out.

It’s a really complicated coordination problem, and it’s not even that easy to figure out who else is involved. Personally, I think humanity can do a lot of things if it really wants to. And if things actually get really scary, there’s a lot of things that can happen. But also, fundamentally, I think it’s not an easy problem to solve, and I don’t want to assume it’s being solved. What I want to do is I want to ensure we can make as much alignment progress as possible in the time that we have. And then if we get more time, great. Then maybe we’ll need more time, and then we’ll figure out how to do that. But what if we don’t? I still want to be able to solve alignment. I still want to win in the worlds where we don’t get extra time — where, for whatever reason, things just move ahead. And so however it goes, you could still come back to the question of, “How do we solve these technical questions as quickly as possible?” And I think that’s what we really need to do.

There's something interesting going on with RLHF

Jan Leike: I think in general, people are really excited about the research problems that we are trying to solve. And in a lot of ways, I think they’re really interesting from a machine learning perspective. I think also, I don’t know, I think the announcement kind of showed that we are serious about working on this and that we are trying to get a really high-calibre team on this problem, and that we are trying to make a lot of progress quickly and tackling ambitious ideas. Especially in the last six months or so, there’s been a lot more interest from the machine learning community in these kinds of problems.

And I also think the success of ChatGPT and similar systems has made it really clear that there’s something interesting going on with RLHF. And there’s something interesting about this; there’s something real about this alignment problem, right? Like, if you compare ChatGPT to the original base model, they’re actually quite different, and there’s something important that’s happening here.

Rob Wiblin: Yeah. I listened back to our interview from five years ago, and we talked a lot about reinforcement learning from human feedback, because that was new and that was the hot thing back then. Was OpenAI or you involved in coming up with that method?

Jan Leike: Yes. That’s right. I think more accurately, probably a lot of different people in the world invented it. And before we did the “Deep reinforcement learning from human preferences” paper, there was other previous research that had done RL from human feedback in various forms. But it wasn’t using deep learning systems, and it was mostly just proof-of-concept style things. And then the deep RL from human preferences paper was joint work with Paul Christiano and Dario Amodei and me. I think we kind of all independently came to the conclusion that this is the way to go, and then we collaborated.

Rob Wiblin: And that’s turned out to be really key to getting ChatGPT to work as well as it does, right?

Jan Leike: That’s right. It’s kind of been wild to me how well it actually worked. If you look at the original InstructGPT paper, one of the headline results that we had was that actually, the GPT-2 sized system — which is two orders of magnitude smaller than GPT-3 in terms of parameter count — the InstructGPT version of that was preferred over the GPT-3 base model. And so this vastly cheaper, simpler, smaller system, actually, once you made it aligned, it’s so much better than the big system. And to some extent, it’s not surprising, because if you train it on human preferences, of course, it’s going to be better for human preferences.

Rob Wiblin: But it packs a huge punch.

Jan Leike: Yeah. But also, why the hell haven’t you been training on human preferences? Obviously, that’s what you should do, because that’s what you want: you want a system that humans prefer. In hindsight, it’s so obvious. You know?

Backup plans

Rob Wiblin: So you and your team are going to do your absolute best with this, but it might not work out. I suppose if you don’t manage to solve this problem, and we just barrel ahead with capabilities, then the end result could conceivably be that everyone dies. So in that situation, it seems like humanity should have a backup plan, hopefully several backup plans, if only so that the whole weight of the world isn’t resting on your shoulders, so that you can get some sleep at night.

What sort of backup plan would you prefer us to have? Do you have any ideas there?

Jan Leike: I mean, there’s a lot of other kinds of plans that are already in motion. This is not the world’s only bet. There’s alignment teams at Anthropic and DeepMind; they’re trying to solve a similar problem. There’s various ways you could try to buy more time or various other governance structures that you want to put in place to govern AI and make sure it’s used beneficially. I think solving the core technical challenges of alignment are going to be critically important, but won’t be the only ones. We still have to make sure that AI is aligned with some kind of notion of democratic values, or not something that tech companies decide unilaterally. And we still have to do something about misuse from AI. And aligned systems wouldn’t let themselves be misused if they can help it.

But, you know, there’s still a question of how it fits into the larger context of what’s going on in society, right? As a human, you can be working for an organisation that you don’t really understand what it does, and it’s actually net negative without you being able to see that. Or you know, just because we can align OpenAI’s models, doesn’t mean that somebody else doesn’t build unaligned AI. How do you solve that problem? That seems really important. How do you make sure that AI doesn’t differentially empower people who are already powerful, but also helps marginalised groups. That seems really important.

And then, ultimately, you also want to be able to avoid these structural risks. Let’s say we solve alignment, and everyone makes systems really aligned with them. But then what ends up happening is that you kind of just turbo-charged the existing capitalist system. Essentially, corporations get really good at maximising their shareholder returns because that’s what they aligned AIs to. But then humans fall by the wayside where that doesn’t necessarily encompass all the other things you value — clean air or something. And we have seen early indications of this. Global warming is happening even though we know the fundamental problem, but progress and all the economic activity that we do still drives it forward. And so even though we do all of these things right, we might still get into a system that ends up being bad for humans, even though nobody actually who participates in the system wants it that way.

Rob Wiblin: So you’re going to do your job, but a lot of other people have also got to do their jobs. It’s a broad ecosystem.

Jan Leike: That’s right. There’s a lot to do. We need to make the future go well, and that requires many parts, and this is just one of them.

Should we be worried about connecting models to everything?

Rob Wiblin: Back in March, you tweeted:

Before we scramble to deeply integrate large language models everywhere in the economy, can we pause and think about whether it is wise to do so? This is quite immature technology and we don’t understand how it works. If we’re not careful we’re setting ourselves up for a lot of correlated failures.

A couple of days after that, OpenAI opened up GPT-4 to be connected to various plugins through its API. And one listener was curious to hear more about what you meant by that, and whether there might be a disagreement within OpenAI about how soon GPT-4 should be hooked up to the internet and integrated into other services.

Jan Leike: Yeah. I realised that tweet was somewhat ambiguous, and it was read in lots of different ways. Fundamentally, what plugins allow you to do is nothing on top of what you could do with the API, right? Plugins don’t really add anything fundamentally new that people couldn’t already do. And I think OpenAI is very aware of what can go wrong when you hook up plugins to the system — you know, you have to have the sandbox, you have to be careful when you let people spend money, and all of these questions, But they’re also like sitting right next to us, and we talk to them about it, and they’ve been thinking about it.

But given how much excitement there was to just try GPT-4 on all the things, what I really wanted to do also is say: look, this is not quite mature. The system will fail. Don’t connect it to all of the things yet. Make sure there’s a failback system. Make sure you’ve really played with the model to understand its limitations. If you have the model write code, make sure you’re reading the code and understanding it, or executing it in the sandbox, because otherwise, wherever you’re writing the code, it might break that system. And just be careful. Be wise. Make sure you understand what you’re doing here, and not just hook it up to everything. Like, see how it goes.


Rob Wiblin: On this topic of just plugging things into the internet, many years ago, people talked a lot about how they kind of had this assumption that if we had the intelligence system that was as capable as GPT-4, that probably we would keep it in a lead-contained box and wouldn’t plug it up to the internet, because we’d be worried about it. But it seems like the current culture is just that as soon as a model is made, it just gets deployed onto the internet right away.

Jan Leike: That’s not quite right. We had GPT-4 for eight months before it was publicly available. And we did a lot of safety tests; we did a lot of red teaming. We made a lot of progress on its alignment, and we didn’t just connect it to everything immediately. But I think what you’re actually trying to say is, many years ago, people were arguing over, “If you make AGI, can’t you just keep it in the box? And then it’ll never break out and will never do anything bad.” And you’re like, well, it seems like that ship has sailed. We’re connecting it to everything. And that’s partially what I’m trying to allude to here: we should be mindful when we do connect it.

And just because GPT-4 is on the API, it doesn’t mean that every future model will be immediately available for everything and everyone in every case. This is the difficult line that you have to walk, where you want to empower everyone with AI, or as many people as possible, but at the same time, you have to also be mindful of misuse, and you have to be mindful of all the other things that can could go wrong with the model, misalignment being one of them. So how do you balance that tradeoff? That’s one of the key questions.

Rob Wiblin: It seems like one way of breaking it up would be connected to the internet versus not. But I feel that often people — I’m guilty of this as well — we’re just thinking that either it’s deployed on the internet and consumers are using it, or it’s safely in the lab, and there’s no problem. But there’s intermediate stage where —

Jan Leike: There could also be problems if you have it in a lab.

Rob Wiblin: That’s what I’m saying. That’s exactly what I’m saying. And I feel like sometimes people lose track of that. You know, misuse is kind of an issue if it reaches the broader public, but misalignment can be an issue if something is merely trained and is just being used inside a company — because it will be figuring out how it could end up having broader impacts. And I think because we tend to cluster all of these risks, or tend to speak very broadly, the fact that a model could be dangerous if it’s simply trained — even if it’s never hooked up to the internet — is something that we really need to keep in mind. I guess it sounds like, at OpenAI, people will keep that in mind.

Jan Leike: That’s right. And safety reviews really need to start before you even start the training run, right?

Jobs with the Superalignment team

Jan Leike: We are primarily hiring for research engineers, research scientists, and research managers, and I expect we’ll be continuing to hire a lot of people. It’ll probably be at least 10 before the end of the year, is my guess. And then maybe even more in the years after that.

So what do these research engineers, research scientists, and research managers roles look like? In a way, we don’t actually make a strong distinction between research engineer and research scientist at OpenAI. In each of these roles, you’re expected to write code, and you’re expected to run your own experiments. And in fact, I think it’s really important to always be running lots of experiments, small experiments, testing your ideas quickly, and then iterating and trying to learn more about the world.

In general there’s no PhD required, also for the research scientist roles. And really, you don’t even have to have worked in alignment before. And in fact, it might be good if you didn’t, because you’ll have a new perspective on the problems that we’re trying to solve. What we generally love for people to bring, though, is a good understanding of how the technology works. Do you understand language models? You understand reinforcement learning, for example. You can build and implement ML experiments and debug them.

On the more research scientist end of the spectrum, I think you would be expected a lot more to think about what experiments to do next, or come up with ideas of how how can we address the problems that we are trying to solve, or what are some other problems that we aren’t thinking about that maybe we should be thinking about, or how should we design the experiments that will let us learn more?

And then on the research engineering [end of the] spectrum, there’s a lot of just actually build the things that let us run these things. And let’s make the progress. We already know if we have a bunch of good ideas, that will not be enough, right? We actually have to then test them, and build them, and actually ship something that other people can use. And that involves writing a lot of code. And that involves debugging ML, and running lots of experiments, getting big training runs on GPT-4 and other big models set up.

I think in practice, actually, most people on the team kind of move somewhere on the spectrum. Sometimes there’s more coding because we kind of know what to do. Sometimes it’s more researchy because we don’t yet know what to do, and we’re kind of starting a new project. But yeah, in general, you need a lot of critical thinking, and asking important questions, and being very curious about the world and the technology that we’re building.

And for the research manager, basically that’s a role where you’re managing a small- or medium-sized or even a large team of research engineers and research scientists towards a specific goal. So there, you should be setting the direction of: What are the next milestones? Where should we go? How can we make this vague question of we want to understand this type of generalisation, or we want to make a dataset for automated alignment, or something like that. You have to break it down and make it more concrete, and then figure out what people can be doing. But also, there’s a lot of just day-to-day management of how can we make people motivated and productive, but also make sure they can work together, and just traditional management stuff.

Articles, books, and other media discussed in the show

Jan’s and OpenAI’s work:

Work from others in this space:

Careers in AI alignment:

Other 80,000 Hours podcast episodes:

Related episodes

About the show

The 80,000 Hours Podcast features unusually in-depth conversations about the world's most pressing problems and how you can use your career to solve them. We invite guests pursuing a wide range of career paths — from academics and activists to entrepreneurs and policymakers — to analyse the case for and against working on different issues and which approaches are best for solving them.

Get in touch with feedback or guest suggestions by emailing [email protected].

What should I listen to first?

We've carefully selected 10 episodes we think it could make sense to listen to first, on a separate podcast feed:

Check out 'Effective Altruism: An Introduction'

Subscribe here, or anywhere you get podcasts:

If you're new, see the podcast homepage for ideas on where to start, or browse our full episode archive.