There’s a contingent of people who I think view DeepMind more as a place to do nice, cool, interesting work in machine learning, but weren’t thinking about AGI that much in the past. Now I think it feels a lot more visceral to them that, no, actually, maybe we will build AGI in the nearish future.

I think this has caused people to change in a variety of ways, but one of the ways is that they tend to be a little bit more receptive to arguments about risk and so on — which has been fairly vindicating or something for those of us who’ve been thinking about this for years.

Rohin Shah

Can there be a more exciting and strange place to work today than a leading AI lab? Your CEO has said they’re worried your research could cause human extinction. The government is setting up meetings to discuss how this outcome can be avoided. Some of your colleagues think this is all overblown; others are more anxious still.

Today’s guest — machine learning researcher Rohin Shah — goes into the Google DeepMind offices each day with that peculiar backdrop to his work.

He’s on the team dedicated to maintaining ‘technical AI safety’ as these models approach and exceed human capabilities: basically that the models help humanity accomplish its goals without flipping out in some dangerous way. This work has never seemed more important.

In the short-term it could be the key bottleneck to deploying ML models in high-stakes real-life situations. In the long-term, it could be the difference between humanity thriving and disappearing entirely.

For years Rohin has been on a mission to fairly hear out people across the full spectrum of opinion about risks from artificial intelligence — from doomers to doubters — and properly understand their point of view. That makes him unusually well placed to give an overview of what we do and don’t understand. He has landed somewhere in the middle — troubled by ways things could go wrong, but not convinced there are very strong reasons to expect a terrible outcome.

Today’s conversation is wide-ranging and Rohin lays out many of his personal opinions to host Rob Wiblin, including:

  • What he sees as the strongest case both for and against slowing down the rate of progress in AI research.
  • Why he disagrees with most other ML researchers that training a model on a sensible ‘reward function’ is enough to get a good outcome.
  • Why he disagrees with many on LessWrong that the bar for whether a safety technique is helpful is “could this contain a superintelligence.”
  • That he thinks nobody has very compelling arguments that AI created via machine learning will be dangerous by default, or that it will be safe by default. He believes we just don’t know.
  • That he understands that analogies and visualisations are necessary for public communication, but is sceptical that they really help us understand what’s going on with ML models, because they’re different in important ways from every other case we might compare them to.
  • Why he’s optimistic about DeepMind’s work on scalable oversight, mechanistic interpretability, and dangerous capabilities evaluations, and what each of those projects involves.
  • Why he isn’t inherently worried about a future where we’re surrounded by beings far more capable than us, so long as they share our goals to a reasonable degree.
  • Why it’s not enough for humanity to know how to align AI models — it’s essential that management at AI labs correctly pick which methods they’re going to use and have the practical know-how to apply them properly.
  • Three observations that make him a little more optimistic: humans are a bit muddle-headed and not super goal-orientated; planes don’t crash; and universities have specific majors in particular subjects.
  • Plenty more besides.

Get this episode by subscribing to our podcast on the world’s most pressing problems and how to solve them: type ‘80,000 Hours’ into your podcasting app. Or read the transcript below.

Producer: Keiran Harris
Audio mastering: Milo McGuire, Dominic Armstrong, and Ben Cordell
Transcriptions: Katy Moore


The mood at DeepMind

Rohin Shah: So I think there has been quite a lot of stuff happening, and it’s definitely affected the mood at DeepMind. But DeepMind is such a big place; there’s such a wide diversity of people and opinions and so on that I don’t really feel like I actually have a good picture of all the ways that things have changed.

To take some examples, there’s obviously a bunch of people who are concerned about existential risk from AI. I think many of us have expected things to heat up in the future and for the pace of AI advances to accelerate. But seeing that actually play out is really leading to an increased sense of urgency, and like, “Man, it’s important for us to do the work that we’re doing.”

There’s a contingent of people who I think view DeepMind more as a place to do nice, cool, interesting work in machine learning, but weren’t thinking about AGI that much in the past. Now I think it feels a lot more visceral to them that, no, actually, maybe we will build AGI in the nearish future. I think this has caused people to change in a variety of ways, but one of the ways is that they tend to be a little bit more receptive to arguments about risk and so on — which has been fairly vindicating or something for those of us who’ve been thinking about this for years.

Then there’s also a group of people who kind of look around at everybody else and are just a little bit confused as to why everyone is reacting so strongly to all of these things, when it was so obviously predictable from the things we saw a year or two ago. I also feel some of that myself, but there are definitely other people who lean more into that than I do.

Rob Wiblin: On the potentially greater interest in your work, I would expect that people would be more interested in what the safety and alignment folks are doing. Because I guess one reason to not take a big interest in the past was just thinking that that’s all good and well and might be useful one day, but the models that we have currently can’t really do very much other than play go or imagine how a protein might fold. But now that you have models that seem like they could be put towards bad uses — or indeed, might semi-autonomously start doing things that weren’t desirable — push has come to shove: it seems like maybe DeepMind and other labs need the kind of work that you folks have been doing in order to make these products safe to deploy.

Rohin Shah: Yeah, that’s exactly right. At this point there’s something like alignment work — or just generally making sure that your model does good things and not bad things, which is perhaps a bit broader than alignment — is sort of the bottleneck to actually getting useful products out into the world. We’ve definitely got the capabilities needed at this point to build a lot of useful products, but can we actually leverage those capabilities in the way that we want them to be leveraged? Not obviously yes — it’s kind of unclear right now, and there’s just been an increasing recognition of that across the board. And this is not just about alignment, but also the folks who have been working on ethics, fairness, bias, privacy, disinformation, et cetera — I think there’s a lot of interest in all of the work they’re doing as well.

Scepticism around 'one distinct point where everything goes crazy'

Rohin Shah: I think there is a common meme floating around that once you develop AGI, that’s the last thing that humanity does — after that, our values are locked in and either it’s going to be really great or it’s going to be terrible.

I’m not really sold on this. I guess I have a view that AI systems will become more powerful relatively continuously, so there won’t be one specific point where you’re like, “This particular thing is the AGI with the locked-in values.” This doesn’t mean that it won’t be fast, to be clear — I do actually think that it will feel crazy fast by our normal human intuitions — but I do think it will be like, capabilities improve continuously and there’s not one distinct point where everything goes crazy.

That’s part of the reason for not believing this lock-in story. The other part of the reason is I expect that AI systems will be doing things in ways similar to humans. So probably it will not be like, “This is the one thing that the universe should look like, and now we’re going to ensure that that happens.” Especially if we succeed at alignment, instead it will be the case that the AI systems are helping us figure out what exactly it is that we want — through things like philosophical reflection, ideally, or maybe the world continues to get technologies at a breakneck speed and we just frantically throw laws around and regulations and so on, and that’s the way that we make progress on figuring out what we want. Who knows? Probably it will be similar to what we’ve done in the past as opposed to some sort of value lock-in.

Rob Wiblin: I see. So if things go reasonably well, there probably will be an extended period of collaboration between these models and humans. So humans aren’t going to go from making decisions and being in the decision loop one day to being completely cut out of it the next. It’s maybe more of a gradual process of delegation and collaboration, where we trust the models more and give them kind of more authority, perhaps?

Rohin Shah: That’s right. That’s definitely one part of it. Another part that I would say is that we can delegate a lot of things that we know that we want to do to the AI systems — such as acquiring resources, inventing new technology, things like that — without also delegating “…and now you must optimise the universe to be in this perfect state that we’re going to program in by default.” We can still leave what exactly are we going to do without this cosmic endowment in the hands of humans, or in the hands of humans assisted by AIs, or to some process of philosophical reflection — or I’m sure the future will come up with better suggestions than I can today.

Rob Wiblin: Yeah. What would you say to people in the audience who do have this alternative view that humans could end up without much decision-making power extremely quickly? Why don’t you believe that?

Rohin Shah: I guess I do think that’s plausible via misalignment. This is all conditional on us actually succeeding at alignment. If people are also saying this even conditional on succeeding at alignment, my guess is that this is because they’re thinking that success at alignment involves instilling all of human values into the AI system and then saying “go.” I would just question why that’s their vision of what alignment should be. It doesn’t seem to me like alignment requires you to go down that route, as opposed to the AI systems are just doing the things that humans want. In cases where humans are uncertain about what they want, the AI systems just don’t do stuff, take some cautious baseline.

Could we end up in a terrifying world even if we mostly succeed?

Rob Wiblin: Even in a case where all of this goes super well, it feels like the endgame that we’re envisaging is a world where there are millions or billions of beings on the Earth that are way smarter and more capable than any human being. Lately, I have kind of begun to envisage these creatures as demigods. I think maybe just because I’ve been reading this recently released book narrated by Stephen Fry of all of the stories from Greek mythology.

I guess in practice, these beings would be much more physically and mentally powerful than any individual person, and these minds would be distributed around the world. So in a sense, they can be in many places at once, and they can successfully resist being turned off if they don’t want to be. I guess they could in theory, like the gods in these Greek myths regularly do, just go and kind of kill someone as part of accomplishing some other random goal that has nothing in particular to do with them, just because they don’t particularly concern themselves with human affairs.

Why isn’t that sort of vision of the future just pretty terrifying by default?

Rohin Shah: I think that vision should be pretty terrifying, because in this vision, you’ve got these godlike creatures that just go around killing humans. Seems pretty bad. I don’t think you want the humans to be killed.

But the thing I would say is ultimately this really feels like it turns on how well you succeeded at alignment. If you instead say basically everything you said, but you remove the part about killing humans — just like there are millions or billions of beings that are way smarter, more capable, et cetera — then this is actually kind of the situation that children are in today. There are lots of adults around. The adults are way more capable, way more physically powerful, way more intelligent. They definitely could kill the children if they wanted to, but they don’t — because, in fact, the adults are at least somewhat aligned with the interests of children, at least to the point of not killing them.

The children aren’t particularly worried about the adults going around and killing them, because they’ve just existed in a world where the adults are in fact not going to kill them. All that empirical experience has really just trained them — this isn’t true for all children, but at least for some children — to believe that the world is mostly safe. And so they can be pretty happy and function in this world where, in principle, somebody could just make their life pretty bad, but it doesn’t in fact actually happen.

Similarly, I think that if we succeed at alignment, probably that sort of thing is going to happen with us as well: we’ll go through this rollercoaster of a ride as the future gets increasingly more crazy, but then we’ll get — pretty quickly, I would guess — acclimated to this sense that most of the things are being done by AI systems. They generally just make your life better; things are just going better than they used to be; it’s all pretty fine.

I mostly think that once the experience actually happens — again, assuming that we succeed at alignment — then people will probably be pretty OK with it. But I think it’s still in some sense kind of terrifying from the perspective now, because we’re just not that used to being able to update on experiences that we expect to have in the future before we’ve actually had them.

Is it time to slow down?

Rohin Shah: I think I would be generally in favour of the entire world slowing down on AI progress if we could somehow enforce that that was the thing that would actually happen. It’s less clear whether any individual actor should slow down their AI progress, but I’m broadly in favour of the entire world slowing down.

Rob Wiblin: Is that something that you find plenty of your colleagues are sympathetic to as well?

Rohin Shah: I would say that DeepMind isn’t a unified whole. There’s a bunch of diversity in opinion, but there are definitely a lot of colleagues — including not ones who are working on specifically x-risk-focused teams — who believe the same thing. I think it’s just you see the AI world in particular getting kind of crazy over the last few months, and it’s not hard to imagine that maybe we should slow down a bit and try and take stock, and get a little bit better before we advance even further.


Rob Wiblin: What’s the most powerful argument against adopting the viability of [the FLI Pause Giant AI Experiments] letter?

Rohin Shah: So one of the things that’s particularly important, for at least some kinds of safety research, is to be working with the most capable models that you have. For example, if you’re using the AI models to provide critiques of each other’s outputs, they’ll give better critiques if they’re more capable, and that enables your research to go faster. Or you could try making proofs of concept, where you try to actually make one of your most powerful AI systems misaligned in a simulated world, so that you have an example of what misalignment looks like that you can study. That, again, gets easier the more you have more capable systems.

There’s this resource of capabilities-adjusted safety time that you care about, and it’s plausible that the effect of this open letter — again, I’m only saying plausible; I’m not saying that this will be the effect — but it’s plausible that the effect of a pause would be to decrease the amount of time that we have with some things one step above GPT-4, without actually increasing the amount of time until AGI, or powerful AI systems that pose an x-risk. Because all of the things that drive progress towards that — hardware progress, algorithmic efficiency, willingness for people to pay money, and so on — those might all just keep marching along during the pause, and those are the things that determine when powerful x-risky systems come.

So on this view, you haven’t actually changed the time to powerful AI systems, but you have gotten rid of some of the time that safety researchers could have had with the next thing after GPT-4.

Why solving the technical side might not be enough

Rohin Shah: I think there are two ways in which this could fail to be enough.

One is just, again, the misuse and structural risks that we talked about before: you know, great power war, authoritarianism, value lock-in, lots of other things. And I’m mostly going to set that to the side.

Another thing that maybe isn’t enough, another way that you could fail if you had done this, is that maybe the people who are actually building the powerful AI systems don’t end up using the solution you’ve come up with. Partly I want to push back against this notion of a “solution” — I don’t really expect to see a clear technique, backed by a proof-level guarantee that if you just use this technique, then your AI systems will do what their designers intended them to do, let alone produce beneficial outcomes for the world.

Even just that. If I expected to get a technique that really got that sort of proof-level guarantee, I’d feel pretty good about being able to get everyone to use that technique, assuming it was not incredibly costly.

Rob Wiblin: But that probably won’t be how it goes.

Rohin Shah: Yeah. Mostly I just expect it will be pretty messy. There’ll be a lot of public discourse. A lot of it will be terrible. Even amongst the experts who spend all of their time on this, there’ll be a lot of disagreement on what exactly is the right thing to do, what even are the problems to be working on, which techniques work the best, and so on and so forth.

I think it’s just actually reasonably plausible that some AI lab that builds a really powerful, potentially x-risky system ends up using some technique or strategy for aligning AI — where if they had just asked the right people, those people could have said “No, actually, that’s not the way you should do it. You should use this other technique: it’s Pareto better, it’s cheaper for you to run, it will do a better job of finding the problems, it will be more likely to produce an aligned AI system,” et cetera. It seems plausible to me. When I tell that story, I’m like, yeah, that does sound like a sort of thing that could happen.

As a result, I often think of this separate problem of like how do you ensure that the people who are building the most powerful AI systems are also getting the best technical advice on how they should be aligning their systems? Or are the people who know the most about how to align their systems? In fact, this is where I expect most of my impact to come from, is advising some AGI lab, probably DeepMind, on what the best way is to align their AI systems.

The value of raising awareness of the risks among non-experts

Rohin Shah: I think there’s definitely a decent bit of value in having non-experts think that the risks are real. One is that it can build a political will for things like this FLI open letter, but also maybe in the future, things like government regulation, should that be a good idea. So that’s one thing.

I think also people’s beliefs depend on the environment that they’re in and what other people around them are saying. I think this will be also true for ML researchers or people who could just actually work on the technical problem. To the extent that a mainstream position in the broader world is that ML could in fact be risky and this is for non-crazy reasons — I don’t know, maybe it will be for crazy reasons — but to the extent it’s for non-crazy reasons, I think plausibly that could also just end up leading to a bunch of other people being convinced who can more directly contribute to the problem.

That’s about just talking about it. Of course, people can take more direct action as well.

Rob Wiblin: Yeah. If you’re someone who’s not super well informed but you’re generally troubled by all of this, is there anything that you would suggest that someone in that position do or not do?

Rohin Shah: I unfortunately don’t have great concrete advice here. There’s various stuff like advocacy and activism, which I am a little bit sceptical of. Mostly I’m worried that the issues are actually pretty nuanced and subtle, and it’s very easy for people to be wrong, including the people who are working full-time on it. This just seems like a bad fit, for activism at least, and probably advocacy too. I tend to be a little more bearish on that from people who are not spending that much time on thinking about it.

There’s, of course, the obvious normal classic of just donate money to nonprofits that are trying to do work on this problem, which is something that maybe not anyone can do, but like a lot of listeners, I expect, will be able to do. I do wish I had more things to suggest.

How much you can trust some conceptual arguments

Rohin Shah: I think the field of ML in general is fairly allergic to conceptual arguments — because, for example, people have done a lot of theory work trying to explain how neural networks work, and then these just don’t really predict the results of experiments all that well. So there’s much more of a culture of like, “Do the experiment and show it to me that way. I’m not just going to buy your conceptual argument.”

And I kind of agree with this; I definitely think that a lot of the conceptual work on the Alignment Forum feels like it’s not going to move me that much. Not necessarily that it’s wrong; just that it’s not that much evidence, because I expect there are lots of counterarguments that I haven’t thought about. But I do think that there are occasionally some conceptual arguments that do in fact feel really strong and that you can update on. I think many ML researchers, even if you present them with these arguments, will still mostly be like, “Well, you’ve got to show me the empirical results.”

Rob Wiblin: What’s an example of one of those?

Rohin Shah: One example is the argument that even if you have a correct reward function, that doesn’t mean that the trained neural network you get out at the end is going to be optimising that reward function. Let me spell that out a bit more. What happens when you’re training neural networks is that even in reinforcement learning — which we can take for simplicity, because probably people are a bit more familiar with that — you have a reward function. But the way that your neural network gets trained is that it gets some episodes or experience in the environment and then those are scored by the reward function and used to compute gradients, which are then used to change the weights of the neural network. If you condition on what the gradients are, that fully screens off all of the information about the reward — and also the data, but the important point is the reward.

And so if there were multiple reward functions that, given the same data that the agent actually experienced, would have produced the same gradients, then there’s just the neural network that you learn would be the same regardless of which of those multiple reward functions was actually the real reward function that you were using. So even if you were like, “The model is totally going to optimise one of these reward functions,” you should not, without some additional argument, be confident in which of the many reward functions that are consistent with the gradients the model might be optimising.

Rob Wiblin: So to see if I’ve understood that right, is this the issue that if there are multiple different goals or multiple different reasons why a model might engage in a particular set of behaviours, in theory, merely from observing the outputs, you can’t be confident which one you’ve actually created? Because they would all look the same?

Rohin Shah: Yeah, that’s right.

Articles, books, and other media discussed in the show

Rohin’s work:

Technical AI safety work Rohin is excited about:

  • Scalable oversight — trying to improve the quality of human feedback that you use to train an AI system — such as DeepMind’s work with Sparrow, a dialogue agent that’s useful and reduces the risk of unsafe and inappropriate answers
  • Mechanistic interpretability — trying to understand how a model is making the decisions that it’s making
  • Dangerous capability evaluations — such as the work done by ARC Evals on GPT-4 and Claude and Owain Evans at the Stanford Existential Risks Initiative
  • Redwood Research‘s focus on neglected empirical alignment research directions

AI safety and capabilities:

Other 80,000 Hours podcast episodes:

Related episodes

About the show

The 80,000 Hours Podcast features unusually in-depth conversations about the world's most pressing problems and how you can use your career to solve them. We invite guests pursuing a wide range of career paths — from academics and activists to entrepreneurs and policymakers — to analyse the case for and against working on different issues and which approaches are best for solving them.

The 80,000 Hours Podcast is produced and edited by Keiran Harris. Get in touch with feedback or guest suggestions by emailing [email protected].

What should I listen to first?

We've carefully selected 10 episodes we think it could make sense to listen to first, on a separate podcast feed:

Check out 'Effective Altruism: An Introduction'

Subscribe here, or anywhere you get podcasts:

If you're new, see the podcast homepage for ideas on where to start, or browse our full episode archive.