#154 – Rohin Shah on DeepMind and trying to fairly hear out both AI doomers and doubters

By Robert Wiblin and Keiran Harris · Published June 9th, 2023 ·

#154 – Rohin Shah on DeepMind and trying to fairly hear out both AI doomers and doubters

By Robert Wiblin and Keiran Harris · Published June 9th, 2023

Can there be a more exciting and strange place to work today than a leading AI lab? Your CEO has said they’re worried your research could cause human extinction. The government is setting up meetings to discuss how this outcome can be avoided. Some of your colleagues think this is all overblown; others are more anxious still.

Today’s guest — machine learning researcher Rohin Shah — goes into the Google DeepMind offices each day with that peculiar backdrop to his work.

He’s on the team dedicated to maintaining ‘technical AI safety’ as these models approach and exceed human capabilities: basically that the models help humanity accomplish its goals without flipping out in some dangerous way. This work has never seemed more important.

In the short-term it could be the key bottleneck to deploying ML models in high-stakes real-life situations. In the long-term, it could be the difference between humanity thriving and disappearing entirely.

For years Rohin has been on a mission to fairly hear out people across the full spectrum of opinion about risks from artificial intelligence — from doomers to doubters — and properly understand their point of view. That makes him unusually well placed to give an overview of what we do and don’t understand. He has landed somewhere in the middle — troubled by ways things could go wrong, but not convinced there are very strong reasons to expect a terrible outcome.

Today’s conversation is wide-ranging and Rohin lays out many of his personal opinions to host Rob Wiblin, including:

What he sees as the strongest case both for and against slowing down the rate of progress in AI research.
Why he disagrees with most other ML researchers that training a model on a sensible ‘reward function’ is enough to get a good outcome.
Why he disagrees with many on LessWrong that the bar for whether a safety technique is helpful is “could this contain a superintelligence.”
That he thinks nobody has very compelling arguments that AI created via machine learning will be dangerous by default, or that it will be safe by default. He believes we just don’t know.
That he understands that analogies and visualisations are necessary for public communication, but is sceptical that they really help us understand what’s going on with ML models, because they’re different in important ways from every other case we might compare them to.
Why he’s optimistic about DeepMind’s work on scalable oversight, mechanistic interpretability, and dangerous capabilities evaluations, and what each of those projects involves.
Why he isn’t inherently worried about a future where we’re surrounded by beings far more capable than us, so long as they share our goals to a reasonable degree.
Why it’s not enough for humanity to know how to align AI models — it’s essential that management at AI labs correctly pick which methods they’re going to use and have the practical know-how to apply them properly.
Three observations that make him a little more optimistic: humans are a bit muddle-headed and not super goal-orientated; planes don’t crash; and universities have specific majors in particular subjects.
Plenty more besides.

Get this episode by subscribing to our podcast on the world’s most pressing problems and how to solve them: type ‘80,000 Hours’ into your podcasting app. Or read the transcript below.

Producer: Keiran Harris
Audio mastering: Milo McGuire, Dominic Armstrong, and Ben Cordell
Transcriptions: Katy Moore

Highlights

The mood at DeepMind

Rohin Shah: So I think there has been quite a lot of stuff happening, and it’s definitely affected the mood at DeepMind. But DeepMind is such a big place; there’s such a wide diversity of people and opinions and so on that I don’t really feel like I actually have a good picture of all the ways that things have changed.
To take some examples, there’s obviously a bunch of people who are concerned about existential risk from AI. I think many of us have expected things to heat up in the future and for the pace of AI advances to accelerate. But seeing that actually play out is really leading to an increased sense of urgency, and like, “Man, it’s important for us to do the work that we’re doing.”
There’s a contingent of people who I think view DeepMind more as a place to do nice, cool, interesting work in machine learning, but weren’t thinking about AGI that much in the past. Now I think it feels a lot more visceral to them that, no, actually, maybe we will build AGI in the nearish future. I think this has caused people to change in a variety of ways, but one of the ways is that they tend to be a little bit more receptive to arguments about risk and so on — which has been fairly vindicating or something for those of us who’ve been thinking about this for years.
Then there’s also a group of people who kind of look around at everybody else and are just a little bit confused as to why everyone is reacting so strongly to all of these things, when it was so obviously predictable from the things we saw a year or two ago. I also feel some of that myself, but there are definitely other people who lean more into that than I do.
Rob Wiblin: On the potentially greater interest in your work, I would expect that people would be more interested in what the safety and alignment folks are doing. Because I guess one reason to not take a big interest in the past was just thinking that that’s all good and well and might be useful one day, but the models that we have currently can’t really do very much other than play go or imagine how a protein might fold. But now that you have models that seem like they could be put towards bad uses — or indeed, might semi-autonomously start doing things that weren’t desirable — push has come to shove: it seems like maybe DeepMind and other labs need the kind of work that you folks have been doing in order to make these products safe to deploy.
Rohin Shah: Yeah, that’s exactly right. At this point there’s something like alignment work — or just generally making sure that your model does good things and not bad things, which is perhaps a bit broader than alignment — is sort of the bottleneck to actually getting useful products out into the world. We’ve definitely got the capabilities needed at this point to build a lot of useful products, but can we actually leverage those capabilities in the way that we want them to be leveraged? Not obviously yes — it’s kind of unclear right now, and there’s just been an increasing recognition of that across the board. And this is not just about alignment, but also the folks who have been working on ethics, fairness, bias, privacy, disinformation, et cetera — I think there’s a lot of interest in all of the work they’re doing as well.

Scepticism around 'one distinct point where everything goes crazy'

Rohin Shah: I think there is a common meme floating around that once you develop AGI, that’s the last thing that humanity does — after that, our values are locked in and either it’s going to be really great or it’s going to be terrible.
I’m not really sold on this. I guess I have a view that AI systems will become more powerful relatively continuously, so there won’t be one specific point where you’re like, “This particular thing is the AGI with the locked-in values.” This doesn’t mean that it won’t be fast, to be clear — I do actually think that it will feel crazy fast by our normal human intuitions — but I do think it will be like, capabilities improve continuously and there’s not one distinct point where everything goes crazy.
That’s part of the reason for not believing this lock-in story. The other part of the reason is I expect that AI systems will be doing things in ways similar to humans. So probably it will not be like, “This is the one thing that the universe should look like, and now we’re going to ensure that that happens.” Especially if we succeed at alignment, instead it will be the case that the AI systems are helping us figure out what exactly it is that we want — through things like philosophical reflection, ideally, or maybe the world continues to get technologies at a breakneck speed and we just frantically throw laws around and regulations and so on, and that’s the way that we make progress on figuring out what we want. Who knows? Probably it will be similar to what we’ve done in the past as opposed to some sort of value lock-in.
Rob Wiblin: I see. So if things go reasonably well, there probably will be an extended period of collaboration between these models and humans. So humans aren’t going to go from making decisions and being in the decision loop one day to being completely cut out of it the next. It’s maybe more of a gradual process of delegation and collaboration, where we trust the models more and give them kind of more authority, perhaps?
Rohin Shah: That’s right. That’s definitely one part of it. Another part that I would say is that we can delegate a lot of things that we know that we want to do to the AI systems — such as acquiring resources, inventing new technology, things like that — without also delegating “…and now you must optimise the universe to be in this perfect state that we’re going to program in by default.” We can still leave what exactly are we going to do without this cosmic endowment in the hands of humans, or in the hands of humans assisted by AIs, or to some process of philosophical reflection — or I’m sure the future will come up with better suggestions than I can today.
Rob Wiblin: Yeah. What would you say to people in the audience who do have this alternative view that humans could end up without much decision-making power extremely quickly? Why don’t you believe that?
Rohin Shah: I guess I do think that’s plausible via misalignment. This is all conditional on us actually succeeding at alignment. If people are also saying this even conditional on succeeding at alignment, my guess is that this is because they’re thinking that success at alignment involves instilling all of human values into the AI system and then saying “go.” I would just question why that’s their vision of what alignment should be. It doesn’t seem to me like alignment requires you to go down that route, as opposed to the AI systems are just doing the things that humans want. In cases where humans are uncertain about what they want, the AI systems just don’t do stuff, take some cautious baseline.

Could we end up in a terrifying world even if we mostly succeed?

Rob Wiblin: Even in a case where all of this goes super well, it feels like the endgame that we’re envisaging is a world where there are millions or billions of beings on the Earth that are way smarter and more capable than any human being. Lately, I have kind of begun to envisage these creatures as demigods. I think maybe just because I’ve been reading this recently released book narrated by Stephen Fry of all of the stories from Greek mythology.
I guess in practice, these beings would be much more physically and mentally powerful than any individual person, and these minds would be distributed around the world. So in a sense, they can be in many places at once, and they can successfully resist being turned off if they don’t want to be. I guess they could in theory, like the gods in these Greek myths regularly do, just go and kind of kill someone as part of accomplishing some other random goal that has nothing in particular to do with them, just because they don’t particularly concern themselves with human affairs.
Why isn’t that sort of vision of the future just pretty terrifying by default?
Rohin Shah: I think that vision should be pretty terrifying, because in this vision, you’ve got these godlike creatures that just go around killing humans. Seems pretty bad. I don’t think you want the humans to be killed.
But the thing I would say is ultimately this really feels like it turns on how well you succeeded at alignment. If you instead say basically everything you said, but you remove the part about killing humans — just like there are millions or billions of beings that are way smarter, more capable, et cetera — then this is actually kind of the situation that children are in today. There are lots of adults around. The adults are way more capable, way more physically powerful, way more intelligent. They definitely could kill the children if they wanted to, but they don’t — because, in fact, the adults are at least somewhat aligned with the interests of children, at least to the point of not killing them.
The children aren’t particularly worried about the adults going around and killing them, because they’ve just existed in a world where the adults are in fact not going to kill them. All that empirical experience has really just trained them — this isn’t true for all children, but at least for some children — to believe that the world is mostly safe. And so they can be pretty happy and function in this world where, in principle, somebody could just make their life pretty bad, but it doesn’t in fact actually happen.
Similarly, I think that if we succeed at alignment, probably that sort of thing is going to happen with us as well: we’ll go through this rollercoaster of a ride as the future gets increasingly more crazy, but then we’ll get — pretty quickly, I would guess — acclimated to this sense that most of the things are being done by AI systems. They generally just make your life better; things are just going better than they used to be; it’s all pretty fine.
I mostly think that once the experience actually happens — again, assuming that we succeed at alignment — then people will probably be pretty OK with it. But I think it’s still in some sense kind of terrifying from the perspective now, because we’re just not that used to being able to update on experiences that we expect to have in the future before we’ve actually had them.

Is it time to slow down?

Rohin Shah: I think I would be generally in favour of the entire world slowing down on AI progress if we could somehow enforce that that was the thing that would actually happen. It’s less clear whether any individual actor should slow down their AI progress, but I’m broadly in favour of the entire world slowing down.
Rob Wiblin: Is that something that you find plenty of your colleagues are sympathetic to as well?
Rohin Shah: I would say that DeepMind isn’t a unified whole. There’s a bunch of diversity in opinion, but there are definitely a lot of colleagues — including not ones who are working on specifically x-risk-focused teams — who believe the same thing. I think it’s just you see the AI world in particular getting kind of crazy over the last few months, and it’s not hard to imagine that maybe we should slow down a bit and try and take stock, and get a little bit better before we advance even further.
—-
Rob Wiblin: What’s the most powerful argument against adopting the viability of [the FLI Pause Giant AI Experiments] letter?
Rohin Shah: So one of the things that’s particularly important, for at least some kinds of safety research, is to be working with the most capable models that you have. For example, if you’re using the AI models to provide critiques of each other’s outputs, they’ll give better critiques if they’re more capable, and that enables your research to go faster. Or you could try making proofs of concept, where you try to actually make one of your most powerful AI systems misaligned in a simulated world, so that you have an example of what misalignment looks like that you can study. That, again, gets easier the more you have more capable systems.
There’s this resource of capabilities-adjusted safety time that you care about, and it’s plausible that the effect of this open letter — again, I’m only saying plausible; I’m not saying that this will be the effect — but it’s plausible that the effect of a pause would be to decrease the amount of time that we have with some things one step above GPT-4, without actually increasing the amount of time until AGI, or powerful AI systems that pose an x-risk. Because all of the things that drive progress towards that — hardware progress, algorithmic efficiency, willingness for people to pay money, and so on — those might all just keep marching along during the pause, and those are the things that determine when powerful x-risky systems come.
So on this view, you haven’t actually changed the time to powerful AI systems, but you have gotten rid of some of the time that safety researchers could have had with the next thing after GPT-4.

Why solving the technical side might not be enough

Rohin Shah: I think there are two ways in which this could fail to be enough.
One is just, again, the misuse and structural risks that we talked about before: you know, great power war, authoritarianism, value lock-in, lots of other things. And I’m mostly going to set that to the side.
Another thing that maybe isn’t enough, another way that you could fail if you had done this, is that maybe the people who are actually building the powerful AI systems don’t end up using the solution you’ve come up with. Partly I want to push back against this notion of a “solution” — I don’t really expect to see a clear technique, backed by a proof-level guarantee that if you just use this technique, then your AI systems will do what their designers intended them to do, let alone produce beneficial outcomes for the world.
Even just that. If I expected to get a technique that really got that sort of proof-level guarantee, I’d feel pretty good about being able to get everyone to use that technique, assuming it was not incredibly costly.
Rob Wiblin: But that probably won’t be how it goes.
Rohin Shah: Yeah. Mostly I just expect it will be pretty messy. There’ll be a lot of public discourse. A lot of it will be terrible. Even amongst the experts who spend all of their time on this, there’ll be a lot of disagreement on what exactly is the right thing to do, what even are the problems to be working on, which techniques work the best, and so on and so forth.
I think it’s just actually reasonably plausible that some AI lab that builds a really powerful, potentially x-risky system ends up using some technique or strategy for aligning AI — where if they had just asked the right people, those people could have said “No, actually, that’s not the way you should do it. You should use this other technique: it’s Pareto better, it’s cheaper for you to run, it will do a better job of finding the problems, it will be more likely to produce an aligned AI system,” et cetera. It seems plausible to me. When I tell that story, I’m like, yeah, that does sound like a sort of thing that could happen.
As a result, I often think of this separate problem of like how do you ensure that the people who are building the most powerful AI systems are also getting the best technical advice on how they should be aligning their systems? Or are the people who know the most about how to align their systems? In fact, this is where I expect most of my impact to come from, is advising some AGI lab, probably DeepMind, on what the best way is to align their AI systems.

The value of raising awareness of the risks among non-experts

Rohin Shah: I think there’s definitely a decent bit of value in having non-experts think that the risks are real. One is that it can build a political will for things like this FLI open letter, but also maybe in the future, things like government regulation, should that be a good idea. So that’s one thing.
I think also people’s beliefs depend on the environment that they’re in and what other people around them are saying. I think this will be also true for ML researchers or people who could just actually work on the technical problem. To the extent that a mainstream position in the broader world is that ML could in fact be risky and this is for non-crazy reasons — I don’t know, maybe it will be for crazy reasons — but to the extent it’s for non-crazy reasons, I think plausibly that could also just end up leading to a bunch of other people being convinced who can more directly contribute to the problem.
That’s about just talking about it. Of course, people can take more direct action as well.
Rob Wiblin: Yeah. If you’re someone who’s not super well informed but you’re generally troubled by all of this, is there anything that you would suggest that someone in that position do or not do?
Rohin Shah: I unfortunately don’t have great concrete advice here. There’s various stuff like advocacy and activism, which I am a little bit sceptical of. Mostly I’m worried that the issues are actually pretty nuanced and subtle, and it’s very easy for people to be wrong, including the people who are working full-time on it. This just seems like a bad fit, for activism at least, and probably advocacy too. I tend to be a little more bearish on that from people who are not spending that much time on thinking about it.
There’s, of course, the obvious normal classic of just donate money to nonprofits that are trying to do work on this problem, which is something that maybe not anyone can do, but like a lot of listeners, I expect, will be able to do. I do wish I had more things to suggest.

How much you can trust some conceptual arguments

Rohin Shah: I think the field of ML in general is fairly allergic to conceptual arguments — because, for example, people have done a lot of theory work trying to explain how neural networks work, and then these just don’t really predict the results of experiments all that well. So there’s much more of a culture of like, “Do the experiment and show it to me that way. I’m not just going to buy your conceptual argument.”
And I kind of agree with this; I definitely think that a lot of the conceptual work on the Alignment Forum feels like it’s not going to move me that much. Not necessarily that it’s wrong; just that it’s not that much evidence, because I expect there are lots of counterarguments that I haven’t thought about. But I do think that there are occasionally some conceptual arguments that do in fact feel really strong and that you can update on. I think many ML researchers, even if you present them with these arguments, will still mostly be like, “Well, you’ve got to show me the empirical results.”
Rob Wiblin: What’s an example of one of those?
Rohin Shah: One example is the argument that even if you have a correct reward function, that doesn’t mean that the trained neural network you get out at the end is going to be optimising that reward function. Let me spell that out a bit more. What happens when you’re training neural networks is that even in reinforcement learning — which we can take for simplicity, because probably people are a bit more familiar with that — you have a reward function. But the way that your neural network gets trained is that it gets some episodes or experience in the environment and then those are scored by the reward function and used to compute gradients, which are then used to change the weights of the neural network. If you condition on what the gradients are, that fully screens off all of the information about the reward — and also the data, but the important point is the reward.
And so if there were multiple reward functions that, given the same data that the agent actually experienced, would have produced the same gradients, then there’s just the neural network that you learn would be the same regardless of which of those multiple reward functions was actually the real reward function that you were using. So even if you were like, “The model is totally going to optimise one of these reward functions,” you should not, without some additional argument, be confident in which of the many reward functions that are consistent with the gradients the model might be optimising.
Rob Wiblin: So to see if I’ve understood that right, is this the issue that if there are multiple different goals or multiple different reasons why a model might engage in a particular set of behaviours, in theory, merely from observing the outputs, you can’t be confident which one you’ve actually created? Because they would all look the same?
Rohin Shah: Yeah, that’s right.

Articles, books, and other media discussed in the show

Rohin’s work:

Rohin’s website includes all his publications
Sign up for the Alignment Newsletter — also available as a podcast
FAQ: Careers advice for alignment researchers
What’s been happening in AI alignment? — Rohin’s talk at EA Global 2020
Categorizing failures as “outer” or “inner” misalignment is often confused
Goal misgeneralisation: Why correct specifications aren’t enough for correct goals by Rohin and his colleagues at DeepMind
PhD dissertation: Extracting and Using Preference Information from the State of the World

Technical AI safety work Rohin is excited about:

Scalable oversight — trying to improve the quality of human feedback that you use to train an AI system — such as DeepMind’s work with Sparrow, a dialogue agent that’s useful and reduces the risk of unsafe and inappropriate answers
Mechanistic interpretability — trying to understand how a model is making the decisions that it’s making
Dangerous capability evaluations — such as the work done by ARC Evals on GPT-4 and Claude and Owain Evans at the Stanford Existential Risks Initiative
Redwood Research‘s focus on neglected empirical alignment research directions

AI safety and capabilities:

Pause giant AI experiments: An open letter from the Future of Life Institute (March 2023)
Transcript of OpenAI CEO Sam Altman’s interview touching on AI safety (January 2023)
Thoughts on the impact of RLHF research by Paul Christiano
What does it take to catch a Chinchilla? Verifying rules on large-scale neural network training via compute monitoring by Yonadav Shavit (March 2023)
Jonathan GPT Swift on Jonathan Swift — podcast episode on Conversations with Tyler (March 2023)
AI bet! — a bet by Bryan Caplan on AI capabilities on economics midterms (that will likely end Bryan’s winning streak) (January 2023)
Improving language models by retrieving from trillions of tokens from DeepMind (December 2021)
Reframing superintelligence: Comprehensive AI services as general intelligence by Eric Drexler (January 2019)

Other 80,000 Hours podcast episodes:

Transcript

Table of Contents

1 Rob’s intro [00:00:00]
2 The interview begins [00:02:48]
3 The mood at DeepMind [00:06:43]
4 Common misconceptions [00:15:24]
5 Rohin’s disagreements with other ML researchers [00:29:40]
6 Ways we might fail [00:40:10]
7 Optimism vs pessimism [00:55:49]
8 Specialisation vs generalisation [01:09:01]
9 Why solving the technical side might not be enough [01:16:39]
10 Barriers to coordination between AI labs [01:22:15]
11 Could we end up in a terrifying world even if we mostly succeed? [01:27:57]
12 Is it time to slow down? [01:33:25]
13 Public discourse on AI [01:47:12]
14 The candidness of AI labs [01:59:27]
15 Visualising and analogising AI [02:02:33]
16 Scalable oversight [02:16:37]
17 Mechanistic interpretability [02:25:56]
18 Dangerous capability evaluations [02:33:41]
19 The work at other AI labs [02:38:12]
20 Deciding between different categories of safety work [02:46:29]
21 Approaches that Rohin disagrees with [02:53:27]
22 Careers [03:01:12]
23 Rob’s outro [03:06:47]

Rob’s intro [00:00:00]

Rob Wiblin: Hi listeners, this is The 80,000 Hours Podcast, where we have unusually in-depth conversations about the world’s most pressing problems, what you can do to solve them, and why we’ll be safe as long as AGI costs more than £3,000 to build. I’m Rob Wiblin, Head of Research at 80,000 Hours.

I’m really pleased with how this interview with Rohin Shah turned out.

I think it’ll be pretty straightforward for people with an amateur interest in AI to follow, while still having plenty of fresh ideas and opinions for people who already work in the field.

Rohin is plugged into the cutting edge of AI research, having worked at Google DeepMind for years. He’s also exceptionally evenhanded, intellectually honest, and curious — so a perfect guest for this show.

We talk about:

The current mood at Google DeepMind
Rohin’s disagreements with other ML researchers
Why just solving technical AI safety problems isn’t enough to make things go well
What people who mainly learn about these topics from Twitter get wrong about AI alignment
Common misconceptions about risks from artificial intelligence
Barriers to coordination between AI labs
The most powerful arguments for and against slowing down AI advances
Ways of visualising or analogising AI
What gives Rohin hope that AI will go well
How valuable is it to have non-experts thinking that the risks are real
Deciding between different categories of safety work
Ways people are trying to push AI in a positive direction that Rohin thinks are misguided
How likely it is that public discussion of AI risks will be helpful or harmful
Careers paths Rohin would recommend for people interested in working on AI safety

One announcement before that though — we’ve put together a compilation of 11 interviews from the show on the topic of artificial intelligence, including how it works, ways it could be really useful, ways it could go wrong, and ways you and I can help make the former more likely than the latter.

I know lots of listeners are looking for a way to get on top of exactly those issues right now, and we chose these 11 aiming to pick ones that were fun to listen to, highly informative, pretty up to date, and also to cover a wide range of themes by not having them overlap too much.

Of course you could find those episodes by scrolling back into the archives of this show, but the compilation is useful because finding those episodes in the archive is a bit of a hassle, and it puts the interviews that we’d suggest you listen to first, in the order we think is most sensible.

The full name of the feed is The 80,000 Hours Podcast on Artificial Intelligence.

But it should show up in any podcasting app if you search for “80,000 Hours artificial,” so if you’d like to see the 11 we chose, just search for “80,000 Hours artificial” in the app you’re probably using at this very moment.

All right, without further ado, I bring you Rohin Shah.

The interview begins [00:02:48]

Rob Wiblin: Today I’m speaking with Rohin Shah, a research scientist on the technical AGI safety team at DeepMind, one of the world’s top AI labs. Rohin did his PhD in computer science at UC Berkeley’s Center for Human-Compatible AI, where his dissertation focused on the idea that since we have optimised our environment to suit our preferences, an AI system should be able to infer at least some aspects of our preferences just by observing the current surrounding state of the world.

He’s well known and much appreciated as the author of the Alignment Newsletter, which has had 160 issues since 2018, and in which Rohin regularly summarises recent papers related to AI alignment as well as has his own personal take on them. Thanks so much for coming on the podcast, Rohin.

Rohin Shah: Yeah, thanks for having me. It’s great to be on this podcast. It’s kind of surreal actually, given how much I’ve listened to it in the past.

Rob Wiblin: That’s wonderful. I hope we’re going to get to cover some of your personal opinions about how to ensure AI is safe to deploy, and which analogies are best for thinking about AI and trying to predict what it might do.

But first, what are you working on at the moment and why do you think it’s important?

Rohin Shah: I’m a little bit split across things, but maybe I’ll focus on two things in particular. One is just building and leading one of the two safety-focused teams at DeepMind, the Alignment team. At this point I’m managing a few reports and giving advice on a lot of projects. All of that I think is important for reasons that I’m sure listeners are familiar with. I expect everyone’s at least heard of AI alignment before.

The other thing I would say that I’m spending a bunch of time on is outreach and internal engagement at DeepMind — just trying to get alignment work to be more built into the way that DeepMind perceives all of its core priorities.

Rob Wiblin: Yeah, makes sense. Are you working on any particular aspect of the technical research agenda? Or it sounds like you have a more high-level role?

Rohin Shah: Yeah, I think in the end I end up advising many different projects as opposed to working specifically on one. I’m actually just wrapping up a project on understanding this phenomenon called “grokking”. That was one where I was a lot more involved in the day-to-day research, so that was pretty exciting. It was a more speculative project of trying to see if we understand this particular weird phenomenon about how deep learning works, maybe that would give us some insight that we could then leverage for alignment proposals.

I don’t think it quite panned out. I’m still pretty excited about the general direction of understanding deep learning, and I do think we did scientifically good work, where I do actually feel like I understand this phenomenon a lot better. But I think the part “…and then it helps with alignment somehow” hasn’t fully panned out.

Rob Wiblin: OK, let’s get into the meat of the conversation today. Something to observe early on is that we’ve done quite a few episodes on AI over the years, and we’re going to be doing quite a lot more in future, given what a hot topic it is at the moment. Recently we’ve had on both Richard Ngo and Ajeya Cotra, and they’ve kind of laid out why they think there’s some things to worry about here — as well as almost unlimited upside potential if we can really nail it.

Further back, we’ve heard from Chris Olah, Brian Christian, Ben Garfinkel, Stuart Russell, Paul Christiano, Pushmeet Kohli, Catherine Olsson and Daniel Ziegler, Katja Grace, Dario Amodei, and I think a couple of others besides.

As we are doing more and more episodes on this general theme, in order to make sure that we’re doing new stuff each time and not just repeating ourselves, we’re going to spend a bit less time on the basics and think more about the personal opinions that each guest has, and ways that they see things differently than other people. I guess the typical structure in the past has kind of been spending the first hour laying out the problem as that guest sees it. We’re going to have a more fluid structure in future, and jump around on the most interesting opinions that person happens to have.

The mood at DeepMind [00:06:43]

Rob Wiblin: With you, Rohin, I also want to take advantage of the fact that you are at DeepMind, one of these main labs trying to develop AGI, and hear what the mood is there specifically. So let’s start there. Obviously, it has been a pretty hectic three months for AI. It’s been all over the news. I suppose more broadly you could say it’s been a hectic six months, or one year, or I guess even decade, if you zoom out from one point of view. What is the mood at the DeepMind offices, as far as you can tell?

Rohin Shah: So I think there has been quite a lot of stuff happening, and it’s definitely affected the mood at DeepMind. But DeepMind is such a big place; there’s such a wide diversity of people and opinions and so on that I don’t really feel like I actually have a good picture of all the ways that things have changed.

To take some examples, there’s obviously a bunch of people who are concerned about existential risk from AI. I think many of us have expected things to heat up in the future and for the pace of AI advances to accelerate. But seeing that actually play out is really leading to an increased sense of urgency, and like, “Man, it’s important for us to do the work that we’re doing.”

There’s a contingent of people who I think view DeepMind more as a place to do nice, cool, interesting work in machine learning, but weren’t thinking about AGI that much in the past. Now I think it feels a lot more visceral to them that, no, actually, maybe we will build AGI in the nearish future. I think this has caused people to change in a variety of ways, but one of the ways is that they tend to be a little bit more receptive to arguments about risk and so on — which has been fairly vindicating or something for those of us who’ve been thinking about this for years.

Then there’s also a group of people who kind of look around at everybody else and are just a little bit confused as to why everyone is reacting so strongly to all of these things, when it was so obviously predictable from the things we saw a year or two ago. I also feel some of that myself, but there are definitely other people who lean more into that than I do.

Rob Wiblin: On the potentially greater interest in your work, I would expect that people would be more interested in what the safety and alignment folks are doing. Because I guess one reason to not take a big interest in the past was just thinking that that’s all good and well and might be useful one day, but the models that we have currently can’t really do very much other than play go or imagine how a protein might fold. But now that you have models that seem like they could be put towards bad uses — or indeed, might semi-autonomously start doing things that weren’t desirable — push has come to shove: it seems like maybe DeepMind and other labs need the kind of work that you folks have been doing in order to make these products safe to deploy.

Rohin Shah: Yeah, that’s exactly right. At this point there’s something like alignment work — or just generally making sure that your model does good things and not bad things, which is perhaps a bit broader than alignment — is sort of the bottleneck to actually getting useful products out into the world. We’ve definitely got the capabilities needed at this point to build a lot of useful products, but can we actually leverage those capabilities in the way that we want them to be leveraged? Not obviously yes — it’s kind of unclear right now, and there’s just been an increasing recognition of that across the board. And this is not just about alignment, but also the folks who have been working on ethics, fairness, bias, privacy, disinformation, et cetera — I think there’s a lot of interest in all of the work they’re doing as well.

Rob Wiblin: Yeah. Is it possible to indicate roughly what the ratio is of excitement to anxiety about all of this?

Rohin Shah: Oh man, do I actually know? I’m not sure that I do. In the section of DeepMind that I’m in, you definitely see a lot more of the anxiety, but that’s partly because the people who are feeling anxious often come to the AGI safety or alignment Slack channels and go and ask questions about like, “What are we actually going to do about these risks?” So of course, I tend to see more of the anxious side. I think there is also a lot of excitement about the capabilities and what we could really achieve with them, but I tend not to see that as much. It’s a little hard for me to say exactly what the ratio is.

Rob Wiblin: Yeah. I guess that old quote from Arthur C. Clarke just jumped into my mind that any sufficiently advanced technology seems like magic. I really do feel like I’m living through a time when we’re inventing magic. I mean, GPT-4, it’s just extraordinary what it can do.

Rohin Shah: It really is.

Rob Wiblin: Yesterday I listened to the interview that Tyler Cowen did with GPT-4 pretending to be Jonathan Swift, this Irish author from the 17th and 18th centuries, and it was just extremely impressive.

Rohin Shah: Really? I’ve not seen that, but that does sound like the sort of thing it could do. I think for me, one of the craziest things — maybe not craziest, but most viscerally impactful things — was I’d seen someone claim that GPT-4 can predict the output of scripts that are doing Monte Carlo simulations. That just seems kind of wild. So you’d write a Python script that is doing some sort of Monte Carlo simulation to estimate some quantity, which would be hard for humans to do. It’s just got this mathematical reasoning about what a script would output. And GPT-4 is actually not bad at this, which is kind of wild to me.

Rob Wiblin: OK, so Monte Carlo simulation is something where you’re trying to figure out the distribution of some output for a distribution of possible inputs into some process or some formula. And it’s able to look at a piece of code that is trying to run these simulations and just kind of guess intuitively what distribution of output you might roughly get?

Rohin Shah: Yes, that’s right. Well, it’s usually a point estimate rather than a distribution, but it’s a point estimate of some quantity that comes from a distribution.

Rob Wiblin: Do people inside, who are doing ML research, find it similarly unnerving that we just don’t know what these models are capable of doing?

Rohin Shah: That’s a good question. Definitely some of them do. I think maybe the biggest theme in ML research, or conversation about ML from ML researchers in the last year or two, might be this theme of emergence — of like, you increase the size of your neural network and then suddenly these new capabilities emerge from scale. And not only that, but also you don’t even know what the capabilities are until someone actually thinks about probing them and finding them and demonstrates them. Then you’re like, “I guess that’s another thing that the models can do that we didn’t know about.” It feels like even before everyone had figured out what the various models prior to GPT-4 could do, there were still new capabilities being discovered. GPT-4 came along and there’s a whole bunch of new capabilities to be discovered.

Rob Wiblin: Yeah. I can’t think of any other human invention that has this property. Maybe there is something, I don’t know. It sounds like a riddle: “What invention do you create but you don’t know what it does?”

Is it possible to give a short summary of what sorts of things safety folks at DeepMind are up to at the moment? I guess a brief one, because we’ll come back to it later.

Rohin Shah: Yeah. I’ll focus more on the technical research. We also do a lot of internal outreach and engagement, as I mentioned a bit before. But on the technical side, there’s a lot of work on scalable oversight, so essentially trying to improve the quality of human feedback that you use to train an AI system; work on mechanistic interpretability, so trying to understand how a model is making the decisions that it’s making; work on dangerous capability evaluations, so just seeing for our current models, could they actually do something dangerous if they “wanted” to? Right now the way we do that is we just try to train them to do dangerous things and see how far they get, obviously in simulation where nothing actually bad happens. But the principle here is to know what the models can do in simulation, where we can actually monitor them, before we try and do something where we aren’t monitoring them.

Rob Wiblin: Yeah, that makes sense.

Common misconceptions [00:15:24]

Rob Wiblin: OK, the first thing I was interested to ask about is common misconceptions, or places where you maybe disagree with listeners or with things that one often hears. What is something that you think a meaningful fraction of listeners out there might believe about AI and its possible risks and benefits, that is, in your opinion, not quite on the mark?

Rohin Shah: Yeah, I think there is a common meme floating around that once you develop AGI, that’s the last thing that humanity does — after that, our values are locked in and either it’s going to be really great or it’s going to be terrible.

I’m not really sold on this. I guess I have a view that AI systems will become more powerful relatively continuously, so there won’t be one specific point where you’re like, “This particular thing is the AGI with the locked-in values.” This doesn’t mean that it won’t be fast, to be clear — I do actually think that it will feel crazy fast by our normal human intuitions — but I do think it will be like, capabilities improve continuously and there’s not one distinct point where everything goes crazy.

That’s part of the reason for not believing this lock-in story. The other part of the reason is I expect that AI systems will be doing things in ways similar to humans. So probably it will not be like, “This is the one thing that the universe should look like, and now we’re going to ensure that that happens.” Especially if we succeed at alignment, instead it will be the case that the AI systems are helping us figure out what exactly it is that we want — through things like philosophical reflection, ideally, or maybe the world continues to get technologies at a breakneck speed and we just frantically throw laws around and regulations and so on, and that’s the way that we make progress on figuring out what we want. Who knows? Probably it will be similar to what we’ve done in the past as opposed to some sort of value lock-in.

Rob Wiblin: I see. So if things go reasonably well, there probably will be an extended period of collaboration between these models and humans. So humans aren’t going to go from making decisions and being in the decision loop one day to being completely cut out of it the next. It’s maybe more of a gradual process of delegation and collaboration, where we trust the models more and give them kind of more authority, perhaps?

Rohin Shah: That’s right. That’s definitely one part of it. Another part that I would say is that we can delegate a lot of things that we know that we want to do to the AI systems — such as acquiring resources, inventing new technology, things like that — without also delegating “…and now you must optimise the universe to be in this perfect state that we’re going to program in by default.” We can still leave what exactly are we going to do without this cosmic endowment in the hands of humans, or in the hands of humans assisted by AIs, or to some process of philosophical reflection — or I’m sure the future will come up with better suggestions than I can today.

Rob Wiblin: Yeah. What would you say to people in the audience who do have this alternative view that humans could end up without much decision-making power extremely quickly? Why don’t you believe that?

Rohin Shah: I guess I do think that’s plausible via misalignment. This is all conditional on us actually succeeding at alignment. If people are also saying this even conditional on succeeding at alignment, my guess is that this is because they’re thinking that success at alignment involves instilling all of human values into the AI system and then saying “go.” I would just question why that’s their vision of what alignment should be. It doesn’t seem to me like alignment requires you to go down that route, as opposed to the AI systems are just doing the things that humans want. In cases where humans are uncertain about what they want, the AI systems just don’t do stuff, take some cautious baseline.

Rob Wiblin: Yeah. I think that vision might be a holdover from a time when we didn’t know what ML systems would look like, and we didn’t know that there were going to be neural nets that were trained through incredibly large samples of examples, and they would end up with intuitions and values that we can’t exactly see because we don’t understand how the neural networks work. I think in the past we might have thought, “We’ll program an AI to do particular things with some quite explicit goal.” But in practice it seems like these goals are all implicit and learned merely through example.

Rohin Shah: If I imagine talking to a standard person who believes more in a general core of intelligence, I don’t think that would be the difference in what they would say versus what I would say. I would guess that it would be more like, once you get really really intelligent, that just sort of means that as long as you have some goal in mind, you are just going to care about having resources. If you care about having resources, you’re going to take whatever strategies are good for getting resources. That doesn’t really matter whether we programmed it in ourselves or whether this core of intelligence was learned through giant numbers of examples with neural networks.

And then I don’t quite follow how this view is incompatible with how we will have the AI systems collecting a bunch of resources for humans to then use for whatever humans want. It feels like it is pretty coherent to have a goal of like, “I will just help the humans to do what they want.” But this feels less coherent to other people, and I don’t super understand exactly why.

Rob Wiblin: Yeah. OK, another misconception. This one’s from the audience. If people mostly learn about AI safety from Twitter or the Alignment Forum, or random articles and blog posts on the internet, what are they most likely to be missing and/or what mistakes are they most likely to make other than perhaps only using those sources of information?

Rohin Shah: Yeah. As an avid LessWrong user, of course I am going to be very critical of LessWrong — you have to be in order to be a LessWrong user. But I’ve got a few of these. The first one I’ll mention is that you’re maybe more likely to treat worst-case reasoning as a method for estimating probabilities of what’s actually going to happen.

Rob Wiblin: What do you mean by that?

Rohin Shah: Like saying, here’s this particular technique for potentially aligning an AI system, and then thinking, “OK, but if I were a misaligned superintelligence, here is the way I would defeat that technique, and therefore that technique is not going to work.” That’s saying, in the worst case, where the AI system that you’re trying to align is already misaligned and is superintelligent, then this technique doesn’t work. And then from that going to an inference of: so this technique isn’t going to work at all in any case, even if the AI wasn’t already a misaligned superintelligence.

Rob Wiblin: And why is that mistaken reasoning?

Rohin Shah: Obviously the general pattern of, “Let’s think about the worst case and then predict that’s going to happen in reality” seems pretty clearly wrong. Now, in the AI case, it’s less clearly wrong than in that case, because with AI, the thing we are worried about is an AI system that is adversarily optimising against you, that is more intelligent than you are. In that case, if you had this adversarial optimisation, it would be pretty reasonable to say, well, the worst case I can think of is X, so probably the AI system will also find something at least as good — or bad, depending on your perspective — as X. So that’s probably what happens in reality.

But in most discussions about AI alignment, most techniques for alignment that people talk about are meant to apply before the AI system has become a misaligned superintelligence. In the limit, you can start applying these techniques often from the point where you start with a neural network whose weights are randomly initialised, which is definitely not exerting any adversarial pressure against you. So at that point, there’s not actually any adversarial optimisation going on against you. And so at that point, I don’t think the worst-case reasoning argument goes through directly.

Rob Wiblin: I see. The idea is many alignment methods are meant to come early on, before you have a superintelligence that is trying to outwit you and be deceptive and so on. It’s true that if you only applied them after that had already happened, once you already had a being that was hostile and trying to maximally achieve its goals to your detriment, then applying it at that stage would be no use. But if you’d done it earlier, during the training process, before this being existed, or while it was gradually developing capabilities, then it might work in that instance.

Rohin Shah: Yeah, that’s exactly right.

Rob Wiblin: OK, yeah. Any other common misconceptions from people who get their stuff on the internet?

Rohin Shah: I think another one is treating analogies as a strong source of evidence about what’s going to happen. The one that I think is most common here is the evolution analogy, where the process of training a neural network via gradient descent is analogised to the process by which evolution searched over organisms in order to find humans: Just as evolution produced an instance of general intelligence, namely humans, similarly gradient descent on neural networks could produce a trained neural network that is itself intelligent.

Rob’s 2 minute explainer about gradient descent [00:25:24]

Rohin Shah: Actually the analogy, as I stated it, seems fine to me. But I think it’s just often pretty common for people to take this analogy a lot farther, such as saying that evolution was optimising for reproductive fitness, but produced humans who do not optimise for reproductive fitness; we optimise for all sorts of things — and so similarly, that should be a reason to be confident in doom.

I’m like, I don’t know. This is definitely pointing at a problem that can arise, namely inner misalignment or goal misgeneralisation or mesa-optimization, whatever you want to call it. But it’s pointing at that problem. It’s enough to raise the hypothesis into consideration, but then if you actually want to know about how likely this is to happen in practice, you really need to just delve into the details. I don’t think the analogy gets you that far.

Rob Wiblin: Yeah. One place where I’ve heard people say that the analogy between training an ML model and evolution might be misleading is that creatures, animals in the world that evolved, had to directly evolve a desire for self-preservation — because mice that didn’t try to avoid dying didn’t reproduce, and so on. But there isn’t a similar selection pressure against being turned off among ML models.

Does that make sense as a criticism? I guess I think that you could end up with models that don’t want to be turned off because they reason through that that is not a good idea, so the same tendency could arise via a different mechanism. But maybe they have a point that you can’t just reason from “this is how animals behave” to “this is how an ML model might behave.”

Rohin Shah: Yeah, I think that all sounds valid to me, including the part about how this should not give you that much comfort — because in fact, they can reason through it to figure out that staying on would be good if they’re pursuing some goal.

Rohin’s disagreements with other ML researchers [00:29:40]

Rob Wiblin: Yeah. What do you think is an important disagreement you have with the majority of ML researchers at DeepMind? If there are any particular things that the majority of researchers at DeepMind do believe?

Rohin Shah: This is probably just going to be about the majority of ML researchers in general, because there’s not that much that differentiates ML researchers at DeepMind, which are a very diverse group, from ML researchers everywhere else — except maybe more of a focus on deep learning, but I do agree with that part.

But one disagreement I do have with ML researchers overall is just how much you can trust some conceptual arguments. I think the field of ML in general is fairly allergic to conceptual arguments — because, for example, people have done a lot of theory work trying to explain how neural networks work, and then these just don’t really predict the results of experiments all that well. So there’s much more of a culture of like, “Do the experiment and show it to me that way. I’m not just going to buy your conceptual argument.”

And I kind of agree with this; I definitely think that a lot of the conceptual work on the Alignment Forum feels like it’s not going to move me that much. Not necessarily that it’s wrong; just that it’s not that much evidence, because I expect there are lots of counterarguments that I haven’t thought about. But I do think that there are occasionally some conceptual arguments that do in fact feel really strong and that you can update on. I think many ML researchers, even if you present them with these arguments, will still mostly be like, “Well, you’ve got to show me the empirical results.”

Rob Wiblin: What’s an example of one of those?

Rohin Shah: One example is the argument that even if you have a correct reward function, that doesn’t mean that the trained neural network you get out at the end is going to be optimising that reward function. Let me spell that out a bit more. What happens when you’re training neural networks is that even in reinforcement learning — which we can take for simplicity, because probably people are a bit more familiar with that — you have a reward function. But the way that your neural network gets trained is that it gets some episodes or experience in the environment and then those are scored by the reward function and used to compute gradients, which are then used to change the weights of the neural network. If you condition on what the gradients are, that fully screens off all of the information about the reward — and also the data, but the important point is the reward.

And so if there were multiple reward functions that, given the same data that the agent actually experienced, would have produced the same gradients, then there’s just the neural network that you learn would be the same regardless of which of those multiple reward functions was actually the real reward function that you were using. So even if you were like, “The model is totally going to optimise one of these reward functions,” you should not, without some additional argument, be confident in which of the many reward functions that are consistent with the gradients the model might be optimising.

Rob Wiblin: So to see if I’ve understood that right, is this the issue that if there are multiple different goals or multiple different reasons why a model might engage in a particular set of behaviours, in theory, merely from observing the outputs, you can’t be confident which one you’ve actually created? Because they would all look the same?

Rohin Shah: Yeah, that’s right.

Rob Wiblin: That makes sense, doesn’t it? You’re saying ML researchers aren’t so interested in this because this is more of a theoretical argument rather than one that has been demonstrated through example?

Rohin Shah: I think probably the important thing is usually I would go further from that and talk about how you could have both the aligned AI system that tries to do what you want and also a deceptive AI system that wants to make paperclips — or whatever; replace whatever goal you want — but knows that if it shows signs of its deceptiveness, then humans will turn it off. And both of these AI systems would behave the same way. They would get high reward. If you were training them with a loss function that was basically what humans want, both of them would score very well on this loss function. So gradient descent would be like, “Good job. You are a good neural network, I do not want to change you.”

Rob Wiblin: OK. The point is that they would be producing the same output and behaving the same way for completely different reasons. And you just can’t distinguish?

Rohin Shah: Yes, that’s right. And gradient descent isn’t going to distinguish between them, so you shouldn’t be particularly confident that you get one over the other. I think once you get there, ML researchers are often like, “Huh. Intriguing argument. I don’t see any flaws in it” — and then they will go back to what they’re doing fairly frequently. Which I think is a symptom of like, “This person managed to make a convincing, smart-sounding argument that I couldn’t find any holes in. But people have done that in the past; theory researchers do it all the time and it’s still usually correct to just ignore those arguments. I’m going to ignore this argument too.” I don’t think they’re consciously doing this, but I think that’s sort of the learned response.

Rob Wiblin: Right. Yeah. I think this tendency or this aesthetic is common across many areas of science and engineering — where people learn that the best way to do their job or their research is not to think big thoughts and make arguments that sound more philosophical, but rather just to stay at the coalface, or cross the river by feeling for the next stone, basically. I guess to try to be super empirical and ignore theoretical arguments even when they kind of sound good.

That’s all well and good probably in some circumstances, but I think it’s actually an incredibly dangerous tendency when you’re dealing with technologies where it could be dangerous in many different circumstances. In particular when you’re working on something where you can’t be confident that the trend that you’re currently working with is simply going to continue, that there could be some sharp turn in the way that things go. Kind of the only way that you’ll be able to see that far ahead into the fog of where your research might lead you is by being somewhat more theoretical, is by being somewhat more conceptual — because you can’t do the empirics yet, because you’re trying to estimate where you’ll be in five, 10, or 15 years, several steps down the road.

Yeah. I find it deeply troubling when it’s very hard to get people working on dangerous technologies to think more at this level.

Rohin Shah: Yeah, I do wish that would happen more. I do think it does happen. For example, the way that Stuart Russell got interested in AI safety — I forget exactly how this happened — but I think the way he tells it is he asked himself the question of, “What would happen if we did succeed?” Which is not a question that people usually ask themselves very much, sadly. He did ask the question, and then he thought about it, and was like, “Actually, that seems like maybe an important problem, and we should figure out what to do about it.”

Rob Wiblin: Yeah. This maybe isn’t a paradigmatic example, but one that jumps to mind is I’ve seen biologists thinking about the risks of gain-of-function research or dealing with creating possibly dangerous pathogens in the lab, and saying, “It’s not a problem because we don’t have any examples of pandemics that have been caused by this. Like in principle, maybe it could in theory, but show me the empirics. Show me the case where this is leaked.”

It sounds very strange in this context, but I think it makes more sense if you just have this intensely empirical preference for empirical information, and someone saying that something could be dangerous maybe just isn’t very compelling to you. I don’t want to say that’s a typical response, but I think someone with a particular aesthetic for how you learn about the world might be inclined to think that way.

Rohin Shah: I hope they were asking for a pathogen that we can see is dangerous to mice or something, as opposed to a pathogen that wipes out 10% of humanity — “and then I will be convinced that in fact, this is maybe dangerous.” You really want to not have the catastrophic risk happen before you’re convinced that the catastrophic risk might happen.

Rob Wiblin: I suppose to defend these folks, the principle that someone who was worried about something that has never happened before probably is being a nervous Nelly and you shouldn’t worry too much. Like in most cases, saying, “But this has literally never happened, ever,” is a good reason not to worry about it. In this case, it doesn’t work so well because the reason why it might not have happened is that we haven’t been doing this kind of research all that long. And I guess we’re not even 100% sure that it hasn’t happened. But I guess it’s when the scale of the harms is so large that that’s not a sufficient response anymore.

Rohin Shah: I don’t know. I think I have a little less sympathy than you do for this position. I feel like in situations like this, you should at least be able to name some concrete empirical evidence that you could expect to get before the catastrophic risk happens that would change your mind in some direction — at least get you to the point of like, “All right, I’m going to spend a bunch of time thinking about this.”

So in the pandemic case, here’s a gain of function that can, I don’t know, reach some threshold of virulence in mice. I made up that example, but something along those lines. And similarly for AI, I don’t quite know exactly what everyone’s personal thresholds are for when they’ll be thinking about AI risk. Plausibly just the existence of Bing Chat has already passed it for many of them.

But you could maybe say things like, maybe you’ve got to make this simulated world where there are stand-ins for humans going around, and the AI system has to disempower people in that world — which is a standard that I don’t think we’ve achieved yet with empirical examples, but is a standard that I think we plausibly could achieve before the catastrophic risk actually arrives. And it is something that I would like to work on, but not because people have told me “this is what will convince me” — more because if I model people and try to imagine what would convince them, maybe it’s something like this. But I’d feel a lot better if people were just actually telling me this is the thing.

Ways we might fail [00:40:10]

Rob Wiblin: OK, let’s push on from misconceptions and talk about some of the personal opinions you have about the nature of the challenges we might face with AI.

First up, how likely do you think it is that the key few ways that AIs could one day run amok are already known in some sense — and are nicely characterised, or characterised accurately — as opposed to it being some phenomenon that we just haven’t named and talked about yet?

Rohin Shah: I’m focusing primarily on misalignment for this part. By “misalignment,” I usually mean cases where an AI system is knowingly acting against the wishes of what its designers want — so it’s taking some actions that it knows that if its designers were fully well informed about what it knows about its actions, then the designers would be like, “No, that’s not a thing you should do,” and the AI system knows that.

So if we talk about that particular problem, then I think there’s one part of the story that feels pretty constant throughout, which is this notion of goal-directed AI systems. I could say a lot of words about this, but in the spirit of saying only a few words about this: roughly, this part of the argument says if you have an AI system pursuing a goal that benefits from getting more resources, then it will have a lot of convergent instrumental sub-goals. Where “convergent” means it happens for a wide variety of final or terminal goals, and “instrumental” refers to the fact that it’s not the final or terminal goal.

Rob Wiblin: It’s the means to an end.

Rohin Shah: Yeah, exactly. It’s a means to an end. There are various of these sub-goals. Like, as we mentioned before, staying alive. Which in this context means more like “make sure something with roughly my goals continues to operate.” It doesn’t necessarily have to be “me” — it could be some copy or some future successor. And gaining resources and power, deceiving humans to the extent that your goal conflicts with what they want, overpowering the humans if they’re going to try and stop you.

So I think this rough story of like, if the AI system is pursuing some goal, and is flexibly planning in order to achieve that goal, then it would choose plans that exhibit these sorts of deceptive features. That part of the story I think is pretty solid. It’s basically stayed the same throughout the existence of the field. I don’t really expect that to change.

There’s a different part of the story, which is like, why is this AI system goal-directed and pursuing in particular a misaligned goal that’s not the ones that humans wanted? That, I think, depends a lot more on how you expect to build the AI system. There I think it’s entirely plausible that we think of new ways in the future. Maybe I can just name a few that we’ve already thought of right now.

Rob Wiblin: Yeah, go for it.

Rohin Shah: If you thought that you were going to use more classical AI approaches — like search or logic or state estimation or probabilistic programming, and all of that stuff — those methods tend to be like, the humans write down code that is “intelligent,” and then there’s a spot in that code where you can plug in a utility function or a goal. So there, the thing that you might be worried about is: are we actually able to write down a goal that actually captures everything that we want? This is more echoing some of the classic arguments. I think that mostly is the right sort of thing to be thinking about, if you were expecting AGI to be built that way.

If you’re instead thinking of building powerful AI systems by deep learning instead, then one thing you could think is that a lot of the intelligence part is in the forward pass of the neural network: its goals are baked into the weights of the neural network. Then you start being a little bit more worried about this goal-misgeneralisation, inner-misalignment-type stuff that we were talking about just before.

Rob Wiblin: Sorry, what’s a forward pass?

Rohin Shah: A forward pass is just if you run the neural network on a single input to produce a single output from the neural network. Sorry for the jargon.

Rob Wiblin: It’s hard to avoid.

Rohin Shah: Another thing is you might think that actually — and this is a common view of large language models, that their forward pass is just predicting the next token, the most likely next token of language — what they’re actually doing under the hood is simulating a bunch of possible completions to the text and then just sampling according to that probability distribution.

In that situation, you’re not that worried about this system having goals, but you might be worried that when it’s simulating the distribution to sample from, part of that distribution might be like an evil monomaniacal dictator or something. And then it starts thinking about, if this text was being produced by that dictator, how would that be continued? Maybe it’s like, “Well, the dictator would try to get a bunch of power,” and so it would start deceiving the humans and so on. This is an idea of “goal directedness via misaligned simulacra,” would be the words that people use to describe that one.

Rob Wiblin: I suppose I’m constantly surprised by what things happen, but I feel like I’m more intuitively sceptical that that’s how things are going to play out. I guess you’re saying that’s another story by which a similar phenomenon arises that some people have worried about.

Rohin Shah: Yeah, and I think I’d roughly agree with you that it’s not that likely. My reasons for believing this are that I expect that what we will do is use techniques like RLHF — reinforcement learning from human feedback — to convert LLMs — large language models — from this more simulatory kind of thing into more goal-directed in the sense of “their goals are baked into their weights”-type models. Because I expect this sort of thing to happen, I’m like, probably that’s what’s going to be the more realistic threat model. But if I were like, “We’re just never going to do RLHF; we’re just going to scale up the text-prediction models way out into the future,” then I start being a little bit uncertain — I’m like, actually, perhaps the misaligned simulacra thing is not crazy.

Rob Wiblin: So the basic story here is that you think you’re making a large language model that is just trying to predict the next word, or that’s how it’s achieving the output. But as part of the training, the way that it ends up doing that is creating an agent internally that has goals, or it ends up simulating what it would be like to be an agent that has goals — because that is how you would figure out the next word that such an agent might say. And then basically that agent inside this broader neural network takes over the neural network, and then it actually starts to act as if it’s an agent rather than merely a network that tries to predict the next word?

Rohin Shah: Yeah, that’s right. With more of an emphasis on simulating the agent rather than internally containing the agent. I think “internally containing” tends to give you the wrong sort of intuition about what’s going on.

Rob Wiblin: OK, and I think that the next thing on your list here, the next way that this might happen, is that we actually just deliberately make an agent with goals, and so it arises quite consciously. Is that right?

Rohin Shah: Yeah, that’s roughly right. Or maybe the way I would frame it is that we fine-tune these large language models to more explicitly be doing things like figuring out whether or not the things they’re saying are true, or using planning and reasoning in order to come to more correct outputs and so on. And all of those things together seem like they’re going to make the neural network itself more, in some sense, agentic or goal-directed. Though I personally tend to prefer to talk more about just: Is it doing planning and reasoning, and what are those aimed towards?

Rob Wiblin: I see. So to bring this together, you were saying that although there are many different specific stories about how any particular training process could end up going wrong, or how any particular design might end up with unintended consequences, there’s a common thread through all of them: this issue of you ending up with an agent that has particular goals that are not necessarily the same as yours, and then it wants to seek power in order to achieve those goals at your expense. Is that the point you were making?

Rohin Shah: Yeah, that’s right. I maybe should mention the one that I feel like I worry most about, which is none of the three above. Maybe it’s kind of like the second one: I talked about the AI system having a goal that comes through with every single forward pass — so every single converting one input into one output.

I think the much more realistic story is that you have AI systems that, for every forward pass, are not particularly goal-directed. But when they are doing these sorts of chains of thought — where they produce one token, then that token gets added to the input that produces the next token, and so on, so they build these long streams of multiple forward passes building upon each other in order to do reasoning in text — it seems much more plausible that as it does this chain-of-thought reasoning in text, that process does a lot more reasoning and planning and so on, and that makes it more goal-directed than any individual forward passes. And I’m like, actually the goal-directedness is in that overall reasoning process, as opposed to just in the weights of the neural network.

Rob Wiblin: That’s the thing that you’re most worried about?

Rohin Shah: I think that seems like the most likely story for how goal-directed AI comes about in the future.

Rob Wiblin: I see. Wouldn’t it just be that we deliberately train an agent to be a personal assistant?

Rohin Shah: Yeah, totally. And the way that that will work is this.

Rob Wiblin: Is this. OK, interesting. Do we have a different way in mind of creating agents that would do work on our behalf?

Rohin Shah: Yeah, there are a lot of things that are relevant to that. For example, you could really be more bullish on how we’ll just make really big models and we’ll be able to get it so that a single forward pass of theirs is more agentic — can do planning and reasoning, or something along those lines. So that’s one thing.

Then there’s other things, like we will give the neural network access tools that it can call externally, and then by using those tools, it can produce significantly better and more reasoned outputs than it would have otherwise. For example, one paper out of DeepMind, which is about trying to get the large language models to be more factual, talks about how you can enable your large language model to do information retrieval over a large database of text. That allows it to more easily do citations of things that humans have said on the internet.

There’s a lot of things like this that can be used to boost your AI system’s ability to be an assistant. It’s kind of unclear on how all of that plays into goal-directedness and so on. Probably in the end, we just need to think about, concretely, can we come up with some stories about how the assistant might do bad things that it still gets rewarded for? And what can we do in order to make that happen less, or not happen at all?

Rob Wiblin: How hard do you think it’s going to ultimately prove to be to make systems like this that don’t knowingly go against the wishes of their designers?

Rohin Shah: I don’t know, is mostly the answer. Maybe the more interesting answer is: I also don’t think that anyone else has good reason to be confident one way or the other. Given the evidence that we have now, and given the fact that humans are limited reasoners and are not omniscient, it does not seem very reasonable to me for anyone to be particularly confident in a view on how hard this problem should be. It’s obviously possible that somebody has information or arguments that I just don’t know about. But at this point I think that’s reasonably unlikely.

Rob Wiblin: Because you’ve looked pretty hard for arguments one way or the other, and not been persuaded?

Rohin Shah: Yes. That’s right. Including by talking to a bunch of the people who have thought the hardest about this.

Rob Wiblin: Right. I guess your experience has been that there are some people who think that this is going to prove exceedingly difficult and we’re kind of doomed to failure, and there are other people who think that this is a trivial issue and one that will almost inevitably be solved in the natural course of events. I guess there’s some people in the middle as well, but you just think none of them really have very strong reasons to believe one thing rather than the other.

Rohin Shah: Yeah, that’s right. I basically just disagree with both of the extremes. The people in the middle I might agree with.

Rob Wiblin: It seems like, given that we’re extremely uncertain about how difficult the problem is, it makes the most sense to act as if the difficulty is somewhere in the middle. Because if the problem is virtually impossible to solve, then probably we’re just screwed no matter what. If it’s really trivial to solve, it also doesn’t really matter that much what we do. Whereas if it’s in the middle, then putting a bit more effort into this, or being a bit smarter about what efforts we engage in, might make the difference between things going better or worse. Does that reasoning make sense?

Rohin Shah: I would say yes, that reasoning does make sense. I would prefer to frame it a bit differently. The way I usually think about it is that every action I take that’s meant to be about solving AI risk, there should be some concrete world, that ideally I could describe, in which this makes the difference between doom and not doom.

Now, for any given action, especially small-scale actions, it’s going to be an extremely, extremely detailed and implausible concrete story — something where you’re like, “What? That’s definitely not going to happen.” One in a billion or whatever. That’s fine. But there should be some world in which you can make this sort of story.

It doesn’t have to be via technical things; it could be like: “Because I did this 80K podcast, some brilliant ML researcher out there listened to it on a whim and was convinced to work on alignment. And they came up with this brilliant idea that, when combined with all of the other techniques, meant that the first lab to build an x-risky system ended up building one that didn’t cause doom instead of one that did cause doom.” Yeah, that’s an extremely conjunctive, implausibly concrete chain of events. But also, obviously the chance that this particular podcast is going to be the difference between doom and not doom is obviously extremely tiny.

Rob Wiblin: You’re breaking my heart, Rohin. But you’re saying there’s a difference between “extremely tiny” and “there is no such story that you could possibly tell.”

Rohin Shah: Exactly. I should maybe also emphasise that it’s not just that there should be some story rather than no story. Also you want the best possible story you can, holding constant the amount of effort you’re putting in. But I do like the idea. I’m not as keen on people really trying to optimise for the best story. That seems a lot harder to do, and often I just prefer people do things that seem sensible. But I do like the idea of having some story at all.

Optimism vs pessimism [00:55:49]

Rob Wiblin: Yeah. In my circles, you’re known as something of a cautious optimist about how well integration of super-advanced AI into our world is going to go. All things considered, what do you think is the probability that humanity benefits from the advancement of AI as opposed to ending up worse off because of it?

Rohin Shah: Yeah, so continuing the theme from before, I’m just very uncertain. Especially because you asked such a broad question — this isn’t just misalignment; it’s got misuse and structural risks too, and there are so many different categories within them. I don’t know, mostly I’m just like shrug. I don’t know. Especially when you include misuse and structural risks, my opinions tend to vary wildly from day to day. So maybe I’ll set those aside and talk about misalignment in particular. Even there, I’m like, man, I don’t know. Positions like more than 1% and less than 50% all feel pretty justifiable to me.

Rob Wiblin: This is of things going badly?

Rohin Shah: Of things going badly. I don’t know, I think I’ve given more consistent numbers than that when asked this question in the past. I think that’s because I get asked this question so much that the numbers I give are just completely anchored in my brain — I just repeat them all the time rather than actually giving independent estimates.

Mostly I just want to say nobody knows. We will get more evidence about this as time goes on, as we build actual quite powerful AI systems and we can see what sorts of things happen with them. But right now I just don’t feel like it’s…. Every number is going to be made up, but these numbers feel particularly made up.

Rob Wiblin: You wrote this nice post online not that long ago, pointing out three just general observations about the world that should make us, each of them, a tiny bit more optimistic about our chances of things going well.

The first of them was that people seem a bit muddle-headed in general, and not super goal-orientated. Why is that a hopeful thing in this case?

Rohin Shah: So I want to just have the disclaimer that, like most conceptual arguments, I don’t think this should actually move people very much. This was like a comment I wrote in the space of probably less than an hour when someone had asked, “What things could you have observed that would have made you even more doomy?” But for this one in particular, we talked a little bit about how the case for doom goes through these convergent instrumental sub-goals. Well, if we look at humans, humans aren’t that great at pursuing convergent instrumental sub-goals.

I think my favourite example of this comes from a Scott Alexander piece where I think he talks about his classmates in medical school, maybe? On a multiple-choice exam, you could just easily calculate that the expected value of guessing an answer, even given no information, was positive. I think Scott made this argument to the other students taking this medical exam. And the medical exam is pretty important to these students; they really do want to get a good score on this. He was like, “If you don’t know, just guess — the EV is positive.” And they were just not persuaded. They did not, in fact, then guess on this test.

And I’m like, man, that’s such a clear example of clearly intelligent beings — humans, in this case, but probably this could also apply to AIs — not actually doing a thing that seems very clearly like a good convergent instrumental sub-goal for them.

Rob Wiblin: I’m not sure I’d want any of these people as physicians. I guess I don’t want to get too caught up on this specific example, because you could have chosen plenty of other things. But I was told at high school, obviously, if you don’t know the answer on an exam, on a multiple choice question, just definitely fill in one of the squares or tick one of the boxes, because you’ve got a one-in-four chance or one-in-five chance of getting it right. What’s going on?

Rohin Shah: I wish I had found and read up on this example before coming here, but I think, to be fair to them, that guessing incorrectly did give you negative points. So it wasn’t like no downside. But it was, in expectation, still the right call, because you’re doing this for many questions, it’s with relatively high certainty you will tend to get a higher score. I still think it’s not that reasonable, but you can understand their perspective more.

As to whether you’d trust them as physicians, I don’t know. I think mostly my perspective on this is, like, man, it turns out that people aren’t that “rational,” but the world still works. Well, I don’t know; maybe that’s a controversial statement. The world continues to be the way that you observed it to be in the past, which does involve some functional things, some dysfunctional things. But the fact that physicians don’t seem all that rational is probably swamped by all the other evidence you have by just going around and looking at the world and interacting with physicians at other times.

Rob Wiblin: Right. So anyway, the broader point is that humans are general intelligences that do have goals to a degree, and we’ve emerged through evolution, and then also through our own experience, our brain adapting over time to its environment. Yet we are not monomaniacal optimisers that always make these great decisions and reason things through and get the absolute most out of life. We’re actually often just acting on instinct.

Rohin Shah: Yeah, exactly. And like, I don’t know, has anyone who’s not been steeped in the AI risk literature thought, “Aha! I have a bunch of goals. I should gain a bunch of power and try to take over the world in order to do this.” You might be like, well, that’s not all that reasonable. Most people can’t take over the world even if they want to, so it’s reasonable for them to not think about that.

But I’m still like, I don’t know, man, humans just don’t really think about just gaining a bunch more power and resources very much. Some people do, but not all of them. It really feels like, if you wanted to be really confident in doom, you should maybe expect that really all of the humans would be trying to optimise for power and resources.

Rob Wiblin: Because that would indicate that most processes end up producing these monomaniacal optimisers. Whereas in fact, we see that it is possible not to. Indeed, maybe it’s even typical not to, at least through the process that generates humans.

Rohin Shah: At least up to the capability level that humans represent, yeah. I’m a little bit more sympathetic to like, once you get going around tiling the universe with something levels of capability, where you’ve invented all of the technology by yourself, maybe by then, that entity has to be pretty rational perhaps. So I’m a little bit more sympathetic at that point. But it’s not clear to me that the argument goes through if we’re only talking about that level of superintelligence being monomaniacal and goal-directed.

Rob Wiblin: Because there’ll be intermediate steps where you might have a more human-like agent?

Rohin Shah: That’s right. And as you’ve got these intermediate AI systems that are still way better than humans at many things — they can help do better alignment research, they can help with assisting and supervising the next generation of AI systems and so on — once you’ve retreated to that point, mostly I’m like, oh man, now things have gotten a lot more complicated, and it’s way harder to forecast what the future holds. And once again, you should just be pulled towards vast uncertainty.

Rob Wiblin: OK, yeah. The second observation you made is that planes don’t crash and we do have labour unions. How is that a positive sign?

Rohin Shah: I love the way that you’re phrasing these. The broader point there was that humans do in fact sometimes coordinate to do things, including safety things in the case of planes not crashing. I think part of the argument for one thing that often makes people very doomy is just thinking that humanity won’t have its act together and we’ll just rush full steam ahead to build powerful AI systems as fast as possible, without any coordination between anyone.

I think that’s more extreme than people actually mean, but something along those lines is often a common talking point. And I think it’s not that obvious that will be the case. It seems to me like our society is actually fairly risk-averse in general — in most places, I tend to wish our society would be less risk-averse and do more things. So I’m like, well, we do actually coordinate on avoiding risks, in particular, reasonably often. That could happen here too. How likely is it? I don’t really know, but it sure seems more hopeful than it could have been.

Rob Wiblin: Yeah, I was reading some commentary on this, from just random people on I think The Hacker News, on exactly this theme of whether you could get any coordination to make AI go better. It seemed like most people who were commenting anyway were exceedingly sceptical that any coordination at all would be possible. They were just sneering at the idea, being like, “If one group didn’t do it, then there’s 100 other people who would do the dangerous thing. And if the US doesn’t do it, then I’m sure China will just do it tomorrow.” I don’t know, it shouldn’t be personal, but I feel like this is a midwit position — where people are coming in with this preconception about how the world works without actually knowing very much about the specific actors here.

And in reality, currently we really only have a couple of organisations that are capable of doing really cutting-edge model training. Many of them are actually quite receptive to doing coordination-related things. The other actors are substantially behind and probably also could be cajoled or coerced by a government into not doing the dangerous things. If someone was saying we probably couldn’t hold back, we probably couldn’t prevent the dam from collapsing forever, I think probably that’s right. But if you’re just trying to delay things somewhat, by months or years, I think you’d have a good shot at doing that, with so few actors that are generally quite cordial and open to the idea of being more gradual.

Rohin Shah: Yeah, I think that’s right. I think that I have historically — not so much anymore, but historically — been a much more modest (in the EA-rationalist sense) kind of person, where I’m like, “What special reason do I have for thinking that I’m right? Surely all these other people who have thought about things a lot more than me have good reasons for their beliefs, and probably I should think they’re right.”

But then I also look at people, and I’m like, “Surely that can’t be right.” And this is a good example of one of those things. Now I’ve moved to DeepMind and I have actual views; I can actually see what DeepMind as a whole thinks. I’m like, “That’s just totally wrong.” Inaccurate predictions all the time. In fact, people just often confidently say things that are wrong, and I should just not actually be all that modest.

I think my favourite example of this is: the FLI open letter came out yesterday asking for pausing giant AI experiments. I saw a comment where this person’s position was that probably this was done by all of the labs that were not OpenAI, so that they could stop OpenAI from training GPT-5. That’s clearly not true. It’s obvious that just the FLI wanted to do this because that’s what the FLI has wanted to do forever, and FLI is the one who coordinated everything.

Rob Wiblin: Yeah, exactly. I’ve seen this cynical take as well. I suppose, at least in my case, I’m slightly cherry-picking here, because I’m just talking about random commenters on Twitter or on Reddit or on Hacker News, who I think by and large have only learned that this is an issue at all in the last few days or weeks. They’re just coming in with priors, with a particular worldview through which they’re guessing what is going on. I suppose that having a cynical take on who would be most likely to want to slow things down — I guess it would be the people who are behind right now. It’s not a crazy hypothesis.

But like you, I also just know that it’s completely false in this case. I have no doubt about it. It’s got nothing to do with that, basically. I don’t know, maybe some out of the thousand people who signed it had some motivation like that. But that’s not the reason why this letter exists. The people who signed it have had the same opinion for like 10 years, in many cases. They’d be waiting for a chance to sign some letter like this.

Rohin Shah: Yeah, I’m just kind of shocked at the sheer confidence with which people say these things. It’s something that I would have told teenage Rohin: “People are just very frequently extremely confidently wrong. Turns out that’s just how humans work.”

Rob Wiblin: Yeah. I suppose there’s probably plenty of people who didn’t have a strong take, and they just didn’t leave comments or they didn’t speak up about it.

Rohin Shah: Yes. This is true.

Specialisation vs generalisation [01:09:01]

Rob Wiblin: Yeah. OK, the third observation you made was that universities have specific majors in particular subjects, rather than just teaching all students to be smarter. What’s hopeful about that?

Rohin Shah: I think there’s a question of whether it is generally good to have specialisation of labour versus just having one generally intelligent system that does everything. I don’t think the case with humans is necessarily a tonne of evidence for the case with AI systems, because humans also have this constraint that our brains cannot become that big: you cannot just arbitrarily scale up our brains, whereas you basically can just arbitrarily scale up the size of neural networks.

But nonetheless, I think it is relevant in the sense that this should update you a really tiny bit, like all of these other conceptual arguments, that with humans we do in fact do a lot of specialisation of labour rather than trying to create a bunch of generally intelligent humans and then point them at particular problems as necessary. The universities thing was one thing that was meant to capture this overall point.

Now why is that a reason for optimism? I think it’s just a lot easier to imagine a monomaniacal, goal-directed AI system if it was this very general thing that was being applied across a wide variety of tasks — whereas if you have AI systems that are doing a bunch of specialised things, you would maybe expect that they’re more circumscribed in those tasks and not trying to do this thing of taking over the world.

So that’s one thing. I’m not sure how much I believe that, but that’s one thing. You would still be worried about the system of many specialised systems as a whole, overall implementing via communication between all of them, some sort of emergent goal-directed reasoning. That’s a way you could be worried. But I don’t know, it just feels intuitively a bit less likely to me. Also, you could hope to then look at the actual communication between the AI systems and supervise that — that gives you an avenue of attack, where you can try to apply alignment techniques that leverage that fact, which you maybe couldn’t do for the more general systems.

Rob Wiblin: Yeah. The very general question here is: Is it the case that you can just get better at many different kinds of tasks across many different subjects of knowledge, all simultaneously, by just gaining this quality of intelligence? Or is it the case that most knowledge is quite specialised, and if you want to be really good at chemistry, you can’t just become smarter — instead you have to actually study within that particular domain, and the things that you learn about chemistry are not really going to transfer over very much to other areas? I guess I actually don’t really know where the balance lies. You look like maybe I’ve put this incorrectly. Would you put things differently?

Rohin Shah: I think I directionally agree with that, but I wouldn’t phrase it as, “Can you do better with more intelligence?” I think there are certainly going to be some domain-general techniques that do improve performance. The question is which one do you get more marginal benefit for? I certainly believe that there are domain-general reasoning techniques that will help you across a wide variety of fields, but maybe in practice you’re just like, “I could invest in that, and that would give me one percentage point of improvement, or I could just do the specialised thing and that gives me 20 percentage points of improvement.” In that case, you’d still want to do the specialised thing. So I would put it a little bit more quantitatively like that.

Rob Wiblin: I see. Even if one could invest an enormous amount of resources in becoming just generally better at thinking and learning across the board, you need to think about, the relative return of training in these different things. And you might hit diminishing returns on the general thing, and then it becomes way better to specialise in most cases.

Rohin Shah: Yeah, that’s right.

Rob Wiblin: I’ve heard some people say that one reason why they’re not really worried about AI massively transforming things is basically that they think that this general quality of intelligence is not so important. And in fact, most things that we accomplish we do through highly specialised knowledge. On the other hand, wouldn’t you be worried about us just training models that are better at all of these specialised tasks, and all of these specialised areas of knowledge? Then, as you’re saying, collectively it’s a generally intelligent machine, just with lots of highly specialised trained subcomponents on each different activity or each different domain of knowledge.

Rohin Shah: I agree, you should definitely be worried about that. If I were more confident that that’s what the future would look like, maybe I would be spending more of my time thinking about what alignment looks like in a world like that. I think right now I’m just like, let’s figure out what to do with neural networks. That seems like a good current problem to focus on.

But plausibly in the future, alignment research ends up focusing on that situation instead. I do think it’s something you would want to worry about. Even the main person who’s talked about this scenario, Eric Drexler, in “Comprehensive AI services,” his position is not that comprehensive AI services are safe; his position is they have different risks than the one godlike superintelligence AI system.

Rob Wiblin: It seems like GPT-4 is capable across a wider range of domains than I might have expected, just from studying a whole lot of text. It feels like there is some general thing going on there. I don’t know whether you’d call that general intelligence, I’m not sure, but does that speak at all to this question of specialisation versus generalisation?

Rohin Shah: Yeah, it’s a good question. I would say that GPT-4 is clearly way better than humans at just knowing things that people have said, and being able to use that knowledge to at least some extent.

I’m most reminded of Bryan Caplan’s bet. I don’t know if you’ve seen this, Bryan Caplan had a bet with Matthew Barnett about whether a future AI system would be able to get an A on some number of his midterms. I think the bet was like all the way out until 2029 or something. GPT-4 did get an A on his most recent midterm. But if you look at the questions and how Bryan graded them, in some of them at least, it’s like, “What did X person say in response to Y argument?” GPT-4 says the answer, and Bryan’s note is like “10/10. That’s exactly what that person said.”

I’m like, yep, that sure is the sort of thing that a large language model would be really good at. I think it gets a lot of generality across a wide variety of domains by just having all of this knowledge; it’s got just way better memory and knowledge than humans. It’s less obvious to me, but might still be true, that this translates into really good capabilities across the board, even when you’re, say, trying to develop new knowledge in all of these domains — which feels like the capability I’m looking at most.

Rob Wiblin: Yeah. Has Bryan finally lost one of his bets? He’s been on the show before, and we’ve talked about how he’s made an awful lot of predictions about the future, and I think he managed to have a record of winning 20 of his bets in a row.

Rohin Shah: He’s not actually lost it yet, but —

Rob Wiblin: Oh he’s not? Sounded like he had.

Rohin Shah: I think the bet was technically about an AI system getting an A on the past five or six midterms or something like that. But only one has been tested. I don’t know the exact details. But he hasn’t conceded just yet. But I think he does expect to lose.

Rob Wiblin: OK, makes sense.

Why solving the technical side might not be enough [01:16:39]

Rob Wiblin: If ML researchers can deal with the technical aspect of figuring out how one could build a safe AI agent in principle, how far do you think that gets us towards a future in which AI does actually end up being clearly beneficial?

Rohin Shah: I think there are two ways in which this could fail to be enough.

One is just, again, the misuse and structural risks that we talked about before: you know, great power war, authoritarianism, value lock-in, lots of other things. And I’m mostly going to set that to the side.

Another thing that maybe isn’t enough, another way that you could fail if you had done this, is that maybe the people who are actually building the powerful AI systems don’t end up using the solution you’ve come up with. Partly I want to push back against this notion of a “solution” — I don’t really expect to see a clear technique, backed by a proof-level guarantee that if you just use this technique, then your AI systems will do what their designers intended them to do, let alone produce beneficial outcomes for the world.

Even just that. If I expected to get a technique that really got that sort of proof-level guarantee, I’d feel pretty good about being able to get everyone to use that technique, assuming it was not incredibly costly.

Rob Wiblin: But that probably won’t be how it goes.

Rohin Shah: Yeah. Mostly I just expect it will be pretty messy. There’ll be a lot of public discourse. A lot of it will be terrible. Even amongst the experts who spend all of their time on this, there’ll be a lot of disagreement on what exactly is the right thing to do, what even are the problems to be working on, which techniques work the best, and so on and so forth.

I think it’s just actually reasonably plausible that some AI lab that builds a really powerful, potentially x-risky system ends up using some technique or strategy for aligning AI — where if they had just asked the right people, those people could have said “No, actually, that’s not the way you should do it. You should use this other technique: it’s Pareto better, it’s cheaper for you to run, it will do a better job of finding the problems, it will be more likely to produce an aligned AI system,” et cetera. It seems plausible to me. When I tell that story, I’m like, yeah, that does sound like a sort of thing that could happen.

As a result, I often think of this separate problem of like how do you ensure that the people who are building the most powerful AI systems are also getting the best technical advice on how they should be aligning their systems? Or are the people who know the most about how to align their systems? In fact, this is where I expect most of my impact to come from, is advising some AGI lab, probably DeepMind, on what the best way is to align their AI systems.

Rob Wiblin: I see. The basic story here would be you could imagine a future in which some group has published a great paper on an approach that in fact would work if it was adopted. But people don’t realise that this actually is the best way forward, and they adopt some other approach, or those relevant decision-makers pick some other approach for trying to align their systems, and that one doesn’t work out. You not only have to have the right approach on the books somewhere, but you also have to pick it rather than a different one.

Rohin Shah: Yeah, that’s right. I think maybe the thing that feels a little bit less likely about your story is that you mentioned that it was in a paper. And like, nah, it won’t even be in a paper. It’ll be a bunch of implicit knowledge that people have built up by just doing a lot of practice work ahead of time.

And it’ll be like small little details, like, “You should make sure that when you’re giving instructions to your human raters, you follow such and such principle, because that works better,” or, “When you’re asking your AI systems to give critiques, make sure you use this particular trick for your prompt, because that works better,” and so on — stuff that is implicit knowledge amongst practitioners, but it’s not like you could have just read one paper and known to do that. Or like actually, when you’re aligning your AI systems, you should put like 5x as much time or effort into red teaming the model as you are in doing interpretability into the model, because empirically that’s worked out best in the past. Things like that.

Rob Wiblin: What does this imply about what people ought to do? I guess it means that there’s a lot of value in being in management or advisory roles, where you might be tasked with making this kind of decision?

Rohin Shah: Yeah, I think that’s right. But I would also want to emphasise that if you are going to be in that role, you also then need to be the most technically competent person, or at least better than whoever your replacement would have been. So I do want to make sure that if people are trying to follow this path, they’re also seeing it as a priority to really be on the ball about technical alignment.

I’ve actually not been that good about this over the last year or two, which would be a much bigger failing if we were on the cusp of building x-risky systems, but we’re not, so I can still catch up. But in fact, over the course of my career, I’ve been really trying to just always be very on the ball about what’s going on in alignment, what things are best, having opinions about, “If we had to do something today, what would we do?” and things like that. And then separately from that, also trying to be in more of a senior position at DeepMind, where I would in fact have the ability to advise the AGI projects on what they should be doing.

Barriers to coordination between AI labs [01:22:15]

Rob Wiblin: What are the main barriers to getting more cooperation, or I guess ideally active collaboration, between different AI labs? I guess in particular with the goal of avoiding them feeling like they have to rush to deploy technology before they really feel fully comfortable with it, because otherwise they’re going to lose market share or be irrelevant.

Rohin Shah: You would like the labs to ideally do safety research together. There I think one of the bigger blockers is, just for clear and obvious reasons, all of the AI labs will want to keep their IP confidential. When you’re doing safety research, it really does help to just be able to talk about these specific results and numbers that you’re getting from your largest models, for at least many of the kinds of safety research that labs tend to do.

That’s something where you can’t share that by default, because it could actually leak some IP. With enough review, you could do some of the things, but probably not all of the things. This just also becomes high friction, in some cases just not actually feasible. That’s another bottleneck that I don’t really know what to do about.

But there’s also just things like: talk to each other about what the overall alignment plan should be. What would we do at a high level if we had to align an AGI today? And that sort of thing seems great. And individual safety researchers at the labs often talk to each other about this sort of thing.

Rob Wiblin: Why can’t you have an agreement between, hypothetically, OpenAI and DeepMind, where they both say, “We’re going to share our intellectual property with one another because we think it’s in both of our interests.” I mean, it could be in both parties’ interests commercially, even setting aside any safety issues.

Rohin Shah: I think in theory that sort of thing is actually plausible. This sort of institutional work is more the wheelhouse of the strategy and governance team at DeepMind, so I don’t actually know that much about it. But I know this is the thing that they’ve been thinking about, though possibly the way they’ve been thinking about it is, “No, this is definitely not feasible for this obvious reason.” I’m not sure that I would know. I think you are going to interview some of them in the near future, so hopefully you can ask them about that.

Rob Wiblin: Will do. If people in the ML community decided where to work in part based on which companies they perceived to be acting most responsibly — in which ML systems they were training, and how they were testing them, and then when and how they were deploying them — how much do you think that could avoid competitive pressures kind of creating a race to the bottom, where everyone feels like they have to rush things out the door?

Rohin Shah: I generally like the sort of “race to the top on safety”-type dynamic. It does seem pretty great. I do think this could actually have a fairly big impact. It really is one of the top priorities for AGI labs: just making sure that they can attract and retain talented ML folks. To the extent that the talented ML folks are like, “I would like to know what you’re doing about fairness and bias, or alignment, or disinformation” — and then their decisions actually depend on what the company’s answers are to that — I suspect that would make the AGI labs at least care more about communicating the sorts of things that they’re doing.

I think you’re already seeing that to some extent now, with various public statements by AGI labs about their views on AGI safety. And yeah, overall it seems like a pretty good thing to me. It is important though, that the people who are making decisions on this basis are making it on some basis that actually correlates with being responsible, as opposed to some proxy that then gets Goodharted.

Rob Wiblin: You’re saying people could write really nice pieces, or they put safety in the name or safety in the slogan, but then it doesn’t necessarily translate into any actual behaviour? Is that the worry?

Rohin Shah: That’s an example of how that worry could cash out, yeah.

Rob Wiblin: I guess the key question there would be: Can the people who are considering where to work tell whether it’s just empty words, or whether that really reflects the values of the organisation?

Rohin Shah: Yeah. One of the things I was talking about with some folks earlier, who are more external to the labs, is that I really would love it if there was just a nice, robust research area outside of the labs that was just talking about the properties that we would like the AGI labs to have, and how could we then verify that they actually satisfied these properties — in part to enable this race to the top on safety.

Often I end up telling people that for many kinds of research, it’s quite good to be at an industry lab or one of the few nonprofits that are doing research on deep learning systems. But this is an example where it’s actively good to not be at a lab, because if you’re at a lab, then anything you write is like, well, you have a little bit of a conflict of interest. Are you worried that maybe you chose this particular property because your particular lab is really good at this, but everyone else isn’t? This is a place where I think independent people have the opportunity to shine.

Rob Wiblin: Yeah, sounds like we need something like Wirecutter, but for which AI lab should you work at if you’re one of the top ML researchers.

Rohin Shah: Yeah, seems great.

Could we end up in a terrifying world even if we mostly succeed? [01:27:57]

Rob Wiblin: Even in a case where all of this goes super well, it feels like the endgame that we’re envisaging is a world where there are millions or billions of beings on the Earth that are way smarter and more capable than any human being. Lately, I have kind of begun to envisage these creatures as demigods. I think maybe just because I’ve been reading this recently released book narrated by Stephen Fry of all of the stories from Greek mythology.

I guess in practice, these beings would be much more physically and mentally powerful than any individual person, and these minds would be distributed around the world. So in a sense, they can be in many places at once, and they can successfully resist being turned off if they don’t want to be. I guess they could in theory, like the gods in these Greek myths regularly do, just go and kind of kill someone as part of accomplishing some other random goal that has nothing in particular to do with them, just because they don’t particularly concern themselves with human affairs.

Why isn’t that sort of vision of the future just pretty terrifying by default?

Rohin Shah: I think that vision should be pretty terrifying, because in this vision, you’ve got these godlike creatures that just go around killing humans. Seems pretty bad. I don’t think you want the humans to be killed.

But the thing I would say is ultimately this really feels like it turns on how well you succeeded at alignment. If you instead say basically everything you said, but you remove the part about killing humans — just like there are millions or billions of beings that are way smarter, more capable, et cetera — then this is actually kind of the situation that children are in today. There are lots of adults around. The adults are way more capable, way more physically powerful, way more intelligent. They definitely could kill the children if they wanted to, but they don’t — because, in fact, the adults are at least somewhat aligned with the interests of children, at least to the point of not killing them.

The children aren’t particularly worried about the adults going around and killing them, because they’ve just existed in a world where the adults are in fact not going to kill them. All that empirical experience has really just trained them — this isn’t true for all children, but at least for some children — to believe that the world is mostly safe. And so they can be pretty happy and function in this world where, in principle, somebody could just make their life pretty bad, but it doesn’t in fact actually happen.

Similarly, I think that if we succeed at alignment, probably that sort of thing is going to happen with us as well: we’ll go through this rollercoaster of a ride as the future gets increasingly more crazy, but then we’ll get — pretty quickly, I would guess — acclimated to this sense that most of the things are being done by AI systems. They generally just make your life better; things are just going better than they used to be; it’s all pretty fine.

I mostly think that once the experience actually happens — again, assuming that we succeed at alignment — then people will probably be pretty OK with it. But I think it’s still in some sense kind of terrifying from the perspective now, because we’re just not that used to being able to update on experiences that we expect to have in the future before we’ve actually had them.

Rob Wiblin: Yeah, I guess maybe one reason why this sounds extremely unnerving as a situation is that as adults, the situation in which we would be as powerless as that I guess is one where we’re being held in prison or we’re slaves or something like that. With children, you’re given a case where you have other beings that are outnumbered and outpowered, but aren’t in prison or aren’t being treated horribly. And that’s a situation that we could hope for in future.

I guess it’s interesting; some people just might not want this to happen, or they might not like this future at all. Personally, I think for better or worse, basically all of the best futures involve some transformation of this kind, and so even though I do find it unnerving, it’s not something that I want to prevent per se. But if you’re just someone who says, “This is not what I want; I would like the world to stay more like it is now,” what could you say to someone like that? I suppose unfortunately it’s just going to be too hard to stop?

Rohin Shah: Yeah, I don’t know. Maybe I would still dispute the premise. I gave the example with children because it’s a little clearer, but even with adults I’m not actually sure that the argument changes that much. There are still a bunch of things — like the police, for example — who in theory could kill you. And in fact for some adults, that is in fact just a worry that they have frequently — but not, say, you or I. So the fact that there exist other agents that in theory could have power over you is in fact kind of frightening, when you actually sit and meditate upon it, but I think not actually that frightening when you are living through it, as long as they don’t actually exercise that power.

Mostly I just observe that people live lives today that in theory could be disrupted fairly easily by a variety of people or groups, and this doesn’t actually bother them that much. So probably the same will be true in the future.

Is it time to slow down? [01:33:25]

Rob Wiblin: Yeah. OK, pulling together a couple of the different threads that we’ve talked about: it sounds like we don’t know how hard this problem is, and it’s going to be hard to tell whether we’ve got a good solution to it, even if we in fact do have one. Given that, would it be sensible to slow down advances in what ML models are capable of doing and which models are deployed publicly, so that we have more time for everyone — including DeepMind and OpenAI and everyone else who’s working on this issue — to learn how the current models work and do additional safety testing and figure out how best to govern and test future, even larger, more capable models?

Rohin Shah: Yeah, I think I would be generally in favour of the entire world slowing down on AI progress if we could somehow enforce that that was the thing that would actually happen. It’s less clear whether any individual actor should slow down their AI progress, but I’m broadly in favour of the entire world slowing down.

Rob Wiblin: Is that something that you find plenty of your colleagues are sympathetic to as well?

Rohin Shah: I would say that DeepMind isn’t a unified whole. There’s a bunch of diversity in opinion, but there are definitely a lot of colleagues — including not ones who are working on specifically x-risk-focused teams — who believe the same thing. I think it’s just you see the AI world in particular getting kind of crazy over the last few months, and it’s not hard to imagine that maybe we should slow down a bit and try and take stock, and get a little bit better before we advance even further.

Rob Wiblin: Yeah, it does seem like more and more people are getting sympathetic to that kind of agenda. You mentioned earlier that there was this open letter that came out just yesterday. I guess it was coordinated by the Future of Life Institute, but had over 1,000 signatories yesterday, and probably we can expect there to be substantially more.

Its heading was “Pause Giant AI Experiments.” It had some of the people you might really expect to be on there, including ML researchers like Yoshua Bengio and Stuart Russell and Steve Alejandro, as well as other non-ML people who’ve spoken about their AI related concerns, like Elon Musk and Tristan Harris. Some of the other people on there surprised me a bit more: there was Steve Wozniak, the cofounder of Apple, and Emad Mostaque, the CEO of Stability AI — I think it was doing image-generation AI models right?

Rohin Shah: That was definitely one of the things they’ve done.

Rob Wiblin: Yeah. There was also Gary Marcus, who’s an AI researcher who in general has been extremely sceptical that any of the things that any of the AI models that we’re developing right now are going to develop into artificial general intelligences. He thinks we’re missing lots of important insights, but nonetheless he signed this letter.

I might actually just read a little bit of it. I imagine plenty of people in the audience might not have actually gotten to that. Maybe a key section would be:

Contemporary AI systems are now becoming human-competitive at general tasks, and we must ask ourselves: Should we let machines flood our information channels with propaganda and untruth? Should we automate away all the jobs, including the fulfilling ones? Should we develop nonhuman minds that might eventually outnumber, outsmart, obsolete and replace us? Should we risk loss of control of our civilization? Such decisions must not be delegated to unelected tech leaders. Powerful AI systems should be developed only once we are confident that their effects will be positive and their risks will be manageable. This confidence must be well justified and increase with the magnitude of a system’s potential effects. OpenAI’s recent statement regarding artificial general intelligence, states that “At some point, it may be important to get independent review before starting to train future systems, and for the most advanced efforts to agree to limit the rate of growth of compute used for creating new models.” We agree. That point is now.
Therefore, we call on all AI labs to immediately pause for at least 6 months the training of AI systems more powerful than GPT-4. This pause should be public and verifiable, and include all key actors. If such a pause cannot be enacted quickly, governments should step in and institute a moratorium.

This is pretty strong language. And I think the reaction has been mixed. But the fact that something this strong can be splitting people quite a bit is kind of interesting.

The letter suggests not doing training runs that involve more compute than was required to train GPT-4. If you thought that slowing down of some form was desirable — well, I guess it sounds like you think it probably is desirable — is that a sensible target to use, or a sensible rule of thumb to use to figure out whether things are dangerous?

Rohin Shah: So I’ll note that the open letter just said “more powerful than GPT-4” — I don’t think it mentioned compute.

Rob Wiblin: Oh, right. My bad.

Rohin Shah: But I do think that compute is actually a pretty good proxy to use, at least right now, if you’re only considering a six-month timeline. If you were thinking about longer than six months, if you were imagining like five years or something like that — which, to be clear, I’m not sure that I would endorse — but if you were thinking about something like that, then you would also want to account for the improvements in algorithmic efficiency, training efficiency, and so on that people will inevitably discover in that time.

So maybe you would just say it’s fine; that that’s a relatively slow enough growth that it’s OK if people use that to make better models. Or maybe you’ll be like, actually the amount of compute you’re allowed to use shrinks over time.

Rob Wiblin: I see, OK. It sounds like compute used in training is a reasonable proxy, but of course it’s a moving target, because we’ll get better at making that compute count for something. So potentially, in theory, if you wanted to make sure that no one developed a model that was more impressive than GPT-4, you’d have to reduce the amount of compute to reverse that effect, more or less?

Rohin Shah: Yeah, that’s right.

Rob Wiblin: I guess as there’s more compute available more broadly, the number of actors that could in principle do a training run larger than that would grow. So you’d have to get more people on board, or I guess do more monitoring as the years went by.

Rohin Shah: Yeah, absolutely. I believe actually Yonadav Shavit has a paper about this recently. I haven’t read it yet, but I remember chatting to him about it and thought that was quite interesting. If people want to check out how in fact something like this could be enforced, that’s a paper to read.

Rob Wiblin: Do you have a broad reaction to the letter? It sounds like you’re sympathetic to the general goal. If we could coordinate to make things go more gradually that would be good. Do you have any comments on the tone of it?

Rohin Shah: I think I agree with the general vibe, which I would characterise as…

Rob Wiblin: Freaking out?

Rohin Shah: Maybe not that exactly. I would have said things are moving at a breakneck pace. We need to actually spend some time orienting to the situation, dealing with it, and getting into place some sort of safety measures. In order to do that, we need to actually stop the breakneck pace for a while, and hence the pause. I think that vibe I am broadly on board with.

Then there’s a bunch of specifics where I’m like, I don’t know that I would necessarily agree with that in particular. One of them is this pause training models “as powerful as GPT-4.” I’m like, who even knows what “as powerful as GPT-4” means? When you say something like “compute,” then that’s at least a little bit more clear. I’m not actually sure if it’s known how much compute GPT-4 used — but if it is, then that would be a real target.

Anyway, there’s more disagreements like that. But I don’t know, I feel like the point of open letters is often like the vibe they present, and when people sign it, they’re more just saying, “Yeah, I agree with this general vibe, if not necessarily every single sentence.” And I’m mostly on board, I think.

Rob Wiblin: Yeah, it’s interesting that I read in the compute thing. I guess because I was reading between the lines, and maybe I’d heard that idea before, so I was inclined to perceive it that way. But yeah, it is just more general than that.

What’s the most powerful argument against adopting the viability of this letter?

Rohin Shah: So one of the things that’s particularly important, for at least some kinds of safety research, is to be working with the most capable models that you have. For example, if you’re using the AI models to provide critiques of each other’s outputs, they’ll give better critiques if they’re more capable, and that enables your research to go faster. Or you could try making proofs of concept, where you try to actually make one of your most powerful AI systems misaligned in a simulated world, so that you have an example of what misalignment looks like that you can study. That, again, gets easier the more you have more capable systems.

There’s this resource of capabilities-adjusted safety time that you care about, and it’s plausible that the effect of this open letter — again, I’m only saying plausible; I’m not saying that this will be the effect — but it’s plausible that the effect of a pause would be to decrease the amount of time that we have with some things one step above GPT-4, without actually increasing the amount of time until AGI, or powerful AI systems that pose an x-risk. Because all of the things that drive progress towards that — hardware progress, algorithmic efficiency, willingness for people to pay money, and so on — those might all just keep marching along during the pause, and those are the things that determine when powerful x-risky systems come.

So on this view, you haven’t actually changed the time to powerful AI systems, but you have gotten rid of some of the time that safety researchers could have had with the next thing after GPT-4.

Rob Wiblin: I see. Yeah. I think Sam Altman has been making an argument of this kind. Sam Altman is the CEO of OpenAI. I think he’s pointed out that if you do something like this pause right now for six months, then in the next six months, we don’t have access to GPT-5, say, so we can’t study that. And then if nothing else has changed — other than we’ve had this letter, and then the pause, and then the moratorium ends — then you get this sudden jump, basically, because all of the compute and the other research has been continuing apace in the meantime. So his argument is that we need to allow things to go faster now so that they go slower later on, or something along those lines. Maybe I’m not getting it right.

Rohin Shah: That’s probably what he said. I feel like I disagree more with that version of the argument. Also “the time to the actually x-risky systems didn’t change” is pretty crucial to my argument before. I do agree that you probably get some amount of jump from people continuing to do research and building some amount of overhang over those six months, but I don’t think it’s actually that big, and I do think that it does meaningfully increase the delay until you get the next generation of systems. I just think that if that were the only consideration you were thinking of, I just would take the delay, even though it has a tiny little overhang. Partly, I just don’t think the overhangs from six months are very high.

Rob Wiblin: I see. I suppose the gain over that time period is also not so high because it’s just not very long. I was reporting what someone else told me Sam Altman had been saying, so I should at some point go back and listen to the original interview and fully understand the view, and then I guess I’ll talk about it in a future interview with someone else.

I suppose in general, I recoil from arguments that say, “Although it would be good to go slow in general, we shouldn’t go slowly now, because that will cause things to go faster later” because they have this kind of too cute by half or overthinking the issue aspect to them — where I’m just like, if we think that it would be good for things to go slower, then we should try to make things go slower and not think about second-order effects all that much. Or at least the burden of proof would be on the other side. It sounds like you maybe feel a bit similarly.

Rohin Shah: Yeah, I think I broadly do agree. I’ll remind everyone that in fact, my overall position is that yeah, if the world could slow down right now, it probably should, despite these arguments that I’m giving for why that might be bad.

I guess for me, maybe the way I would justify this is like, there are always going to be a bunch of second-order effects you didn’t consider. In this particular case, the second-order effects we haven’t considered are like, if you just start slowing things down now, maybe that makes it easier to slow things down in the future. Seems like totally a thing that could happen. And if that were the dominant second-order effect, then it seems like to the extent you were really bought into “let’s slow things down maximally” — which, to be clear, I’m not sure that I am — but if you were bought into that, then you would be pretty into just doing this open letter for that second-order effect.

Anyway, overall, I do tend to agree with you that when you’re getting this too cute by half sense, the way I would usually say that cashes out is an intuition that yeah, probably this second-order effect is true, but probably there are a bunch of other second-order effects that go in the opposite direction, and probably we should just do the sensible commonsense thing instead.

Rob Wiblin: Yeah. At some point, I’ll get someone on the show who thinks that slowing down would be bad or neutral, so they could represent this view. It’s maybe a little bit hard for us to fully steelman the position because neither of us holds it.

Rohin Shah: I’m like, I don’t know, maybe I’m like 20% or 30% on it.

Rob Wiblin: OK, so it’s a plausible view.

Rohin Shah: I find it plausible, mostly for the argument that I gave before, and also the overhangs argument. I don’t find it compelling over six months, but I do find it compelling for like, 20 years.

Public discourse on AI [01:47:12]

Rob Wiblin: Yeah. A question from the audience: “There’s been a lot more public discourse about risks from AI recently, mostly involving people taking risks more seriously. I’m interested in Rohin’s general take on the discourse, but one particular question is how valuable is having non-experts thinking that the risks are real?”

Rohin Shah: I think there’s definitely a decent bit of value in having non-experts think that the risks are real. One is that it can build a political will for things like this FLI open letter, but also maybe in the future, things like government regulation, should that be a good idea. So that’s one thing.

I think also people’s beliefs depend on the environment that they’re in and what other people around them are saying. I think this will be also true for ML researchers or people who could just actually work on the technical problem. To the extent that a mainstream position in the broader world is that ML could in fact be risky and this is for non-crazy reasons — I don’t know, maybe it will be for crazy reasons — but to the extent it’s for non-crazy reasons, I think plausibly that could also just end up leading to a bunch of other people being convinced who can more directly contribute to the problem.

That’s about just talking about it. Of course, people can take more direct action as well.

Rob Wiblin: Yeah. If you’re someone who’s not super well informed but you’re generally troubled by all of this, is there anything that you would suggest that someone in that position do or not do?

Rohin Shah: I unfortunately don’t have great concrete advice here. There’s various stuff like advocacy and activism, which I am a little bit sceptical of. Mostly I’m worried that the issues are actually pretty nuanced and subtle, and it’s very easy for people to be wrong, including the people who are working full-time on it. This just seems like a bad fit, for activism at least, and probably advocacy too. I tend to be a little more bearish on that from people who are not spending that much time on thinking about it.

There’s, of course, the obvious normal classic of just donate money to nonprofits that are trying to do work on this problem, which is something that maybe not anyone can do, but like a lot of listeners, I expect, will be able to do. I do wish I had more things to suggest.

Rob Wiblin: Well, it’s not easy. I mean, I don’t think this is an exceptional case where it’s difficult for an amateur who doesn’t have any particular training or any particular connections to make a difference on it. That’s true in many different areas. I suppose one recommendation might be to try to become very knowledgeable about this topic, and sit tight and wait for an opportunity to maybe apply that.

Rohin Shah: Oh yeah, absolutely. If people are willing to put in a lot of time into this, then there’s a lot more options for what they can do. Just like understanding the arguments really well and being able to talk to people about them already seems like a great bar. You could learn a bit more of programming and machine learning to get a sense of what that’s like in order to be able to speak that lingo as well. Probably this will end up being useful somehow. It’s not totally obvious how, but I’d expect there will be opportunities to use that skill set in the future.

I think one good example is — actually this is something people can do now even — just become really good at finessing AI models to do useful things: things like prompt engineering, chain of thought, and so on.

Rob Wiblin: One thing is, even if this isn’t a great fit for advocacy and mass public campaigns, my guess is that it’s more likely than not that many of the questions that we’re talking about — or just the question of, “Is AI improving things or making things worse?” is going to become a live public issue that is discussed in the same way that many issues are discussed in politics, which is not always brilliantly. And just having many more people who are able to correct misconceptions and talk about things more sensibly, even if they’re not the most cutting-edge technical experts, is probably better than not having those folks.

Rohin Shah: Yeah, that’s very fair.

Rob Wiblin: Yeah, on a very related topic, I think I’ve heard you say that you think a lot of outreach or public discussion of risk associated with AI is overall probably harmful because it’s not super precise. It can, on the one hand, put people who understand the technology off of these arguments because what’s being said is confused or wrong. Or alternatively, it does persuade them, but then they end up with a confused idea of what the actual issues are because they haven’t been communicated very well. Have I understood that right?

Rohin Shah: Yeah, I think that’s right. I should note that I’m also very uncertain about this. When I say “overall harmful,” maybe I think it’s like 60% likely to be negative and 40% likely to be positive, and like kind of equal magnitudes on both sides. I’m like, all right, maybe that’s overall harmful. But my opinion two years ago would have been like 60% positive and 40% negative: I don’t think it’s a big swing, and I don’t think people should take this as a super-confident opinion on my side.

Rob Wiblin: I see. I guess that if that were right, the bottom line would just be that one needs to spend maybe much more time reading and learning about these things than talking. Or maybe you need to have a big ratio there, where you can spend one hour writing comments and saying things about this for every 10 hours that you spend doing coursework or reading other people’s opinions and deeply understanding them. Is something along those lines right?

Rohin Shah: Yeah, something along those lines seems right. I would maybe focus more on getting to some bar of quality or something and then just talking a bunch, rather than thinking about the ratio exactly. But yes, that seems overall correct to me.

Rob Wiblin: Is there something about this argument that proves too much? Because it feels like it should imply that in many technical areas, it’s basically just bad for non-super-experts to be talking about those topics and sharing their opinions or their hopes or fears about it. Maybe that actually just is the case, but I’m not sure that we always say this about other areas as well. Would we also say that it’s extremely hard to have sensible opinions about foreign policy, so people who don’t study it professionally should just not really comment on it? I don’t know, it just seems unclear whether that’s true.

Rohin Shah: A distinction I want to make is between people who are just like sharing their opinions on things versus people who are like, “I am going to go out and find the people who could work on foreign policy or AI alignment” — in this case like ML researchers — “and I am going to specifically try to persuade them that AI risk/opinion about foreign policy is a real thing and then get them to do things.”

I think people sharing opinions is probably good and fine. It’s more the deliberate targeting of specific people and then trying to convince them of particular arguments that leaves me a bit more worried. That being said, I do think that even in the foreign policy world, what’s the impact of someone who’s not thought all that deeply about foreign policy sharing their opinions? I don’t know. Probably to first-order approximation, it doesn’t have an impact, but if it did, I wouldn’t be shocked if it was negative.

Rob Wiblin: I think I’ve also seen you write that you found that people writing about risks from artificial intelligence online — I guess even ones who do this as something like their job — often explain things in a way that you think is technically wrong, or at least it wouldn’t persuade you, as someone who’s really familiar with the technology. Where do you think people are falling down?

Rohin Shah: I think there are a few categories here. Probably the most common one is to say true statements that are actually just kind of weak and don’t actually lead to the conclusions that people are usually taking them to imply.

One example along these lines is arguments like, “Neural networks are always going to make mistakes; there’s always going to be failures of generalisation.” I think there is a version of this argument that does actually make sense, but as I phrased it just now, I think the actually reasonable response to this is like, “Let’s not put neural networks in high-stakes situations. We won’t give them access to the ‘launch all the nukes’ button. Sounds good. Solved.” I don’t think this leads you to “…and the most important thing to do is to work on alignment,” which is usually what people are trying to argue for when they do this thing. So that’s one category of thing.

The second category is arguments that are based on conceptual reasoning that don’t survive the “What about ChatGPT?” argument. For example, sometimes people say things like, “Well, we don’t really know how to express all of the things that we want or that we value. It seems likely that we’ll fail to express something that we actually did care about. As a result, because value is fragile, the future will be valueless; it will be just very bad.” I’m like, I don’t know, man. If you look at ChatGPT, you can just ask it to do something and it really just interprets your words in a fairly commonsense way most of the time. To the extent that it doesn’t, it’s usually in a way where you’re like, “Silly ChatGPT, you just don’t have this capability of doing arithmetic” or whatever. If your main argument for focusing on AI risk was like, “The AI systems won’t know what human values are or it won’t know what we mean,” I’m like, “Eh, nope, doesn’t seem right.”

Rob Wiblin: You’ve mentioned people not cross-checking whether what they’re saying applies to ChatGPT. I think I’ve seen a similar phenomenon of people not checking whether their statements about general intelligence apply to the one case that we definitely have in the real world right now, which is humans. Or it’s almost THAT we define general intelligence like “it’s what humans are doing.”

One I saw was someone saying — and getting upvoted a whole bunch — the claim that training a general intelligence probably would just involve an amount of energy that is too great for humanity to ever put together. It would end up being unacceptably expensive and so we will never do it. But humans are general intelligences, and I calculated that over an entire human lifetime a human uses about £3,000 worth of electricity. I think if we could get even within like a million-fold of the training efficiency of the human brain, then it would clearly be affordable for a company to do this.

That’s maybe a silly example, because obviously this person’s not a technical person. I guess once you always keep in mind this extremely important thing, that if you’re going to make a statement about something like general intelligence or something where you have one example of it, then you should check that the thing applies to the one case.

Rohin Shah: Yes. Strongly endorsed.

Rob Wiblin: Have you noticed this phenomenon as well?

Rohin Shah: Yeah, it does seem to happen a decent bit. I don’t know if I have examples off the top of my head. To be fair to that one commenter you’re mentioning, they could have been thinking that actually with general intelligence, you have to factor in everything that evolution did to produce humans — and then we presumably don’t have that much energy. I don’t really find that persuasive, because we’ll do something more efficient than evolution, but that’s a way in which you could salvage their argument. But overall I’m not very persuaded by it.

Rob Wiblin: Right, yeah. It could be a bit more plausible to say it’s conceivable that it would cost too much energy. But to say “this is my best guess” seems like an odd one.

Rohin Shah: Yeah, it does seem pretty unusual.

The candidness of AI labs [01:59:27]

Rob Wiblin: Something that has surprised me, and that I think is unexpected and really good, is that it seems like very few labs developing AGI have tried to impose any sort of ideological conformity on their staff. Indeed, they seem to contain a really large amount of ideological diversity.

I know people in these labs who would like to see progress toward AGI go faster, as well as people who think it doesn’t really matter and speed is no important issue, as well as people who think it would be better if it went slower. Similarly, I know people at these labs who think that concern about ways that things could go wrong are super overblown and it’s really quite unlikely for things to go poorly, as well as people who think that there’s an intermediate chance, like you, as well as people who think that it’s extremely likely to go poorly.

I think it’s quite admirable that these firms are willing to accommodate such a wide range of views about the product that they’re developing, and also not prevent staff from just speaking socially about their personal opinions, especially given how hard these issues are to grapple with or have any certainty about. I hope that remains the case, because I think it puts us in a much better position to do our best to understand these issues collectively, and hopefully to fix the problems, if people are able to speak out about what they’re personally observing rather than feel like they have to keep their opinions to themselves.

Rohin Shah: Yeah, I basically agree with that. It is really nice. I think it’s maybe a little bit less surprising to me. I guess I’d say that just in general, when you’re doing hiring, or when you’re doing most activities, you can have one or maybe two top priorities for what you want to accomplish with that. And with hiring, it’s like: “Get people who are going to actually help us with our mission and the things that we’re trying to do.” And ideological conformity is maybe a little bit correlated with that, but it’s not that much.

So I don’t know, even if the labs wanted to do this — which, to be clear, I don’t think they do; DeepMind really values having a wide variety of voices at the table — even if they wanted in the future to have more ideological conformity, I think the fact that their actual top priority is “get people who can do good work” would mean that’s not actually something they can reasonably enforce.

Rob Wiblin: Another reason why things might have gone in this direction is just that the people at the top of these organisations are themselves not sure how much to worry about their own product. Maybe the idea of telling their own staff that, even if they think that they’re developing a product that is going to kill all of the staff at the organisation, telling them not to share that information sounds like a bad move.

Rohin Shah: Yeah, I don’t know. I’ve always found it hard to speculate on what leadership of companies think, certainly from public outputs that leaders produce. It just seems very hard to know what exactly they’re thinking all the time. But it does seem reasonable to think that’s what they would be thinking.

Rob Wiblin: Yeah, that’s what a sensible person might think to themselves.

Rohin Shah: Yeah. I do think that they are generally very intelligent, so I would expect them to have at least had that thought.

Visualising and analogising AI [02:02:33]

Rob Wiblin: Let’s talk now about how one tries to think about what types of beings future AI training runs might actually produce, which is something that, personally, I feel like I really struggle to do.

I might start this with an audience question on this theme:

I once heard David Krueger, who is very pessimistic about how AI will play out, talking with Rohin, who’s more optimistic or at least agnostic. The main disagreement seemed to be that Rohin thought that there were lots of non-maximising and non-optimising equilibria for effective models to reach as they evolved during training, while Krueger thought that continued training would close to invariably wind up hitting on a dangerous maximising agent.

Obviously people can disagree when they picture this happening in their heads, and they have different ideas about where it might lead. What’s going on in your heads when you and David analyse this question that leads you in different directions?

Rohin Shah: Well, I think the answer to what’s actually going on in our heads is not that interesting, and is more like we are taking all of the cached heuristics we have built up over the years of thinking about this, and just saying those out loud to each other and seeing if they trigger the other person’s cached heuristics.

As to how did we develop those cached heuristics in the first place, for me, a lot of it is just trying to actually concretely tell some stories about both what the AI system is actually doing and how, mechanistically, what sorts of capabilities and internal mechanisms is it using in order to showcase those behaviours.

In the case of the optimising and maximising thing, often this slightly more concrete visualisation ends up with me thinking that the word “optimising” and “maximising” is not a great abstraction, and tends to invite people to do a sort of motte-and-bailey — where the motte is “it’s just doing planning and thinking and reasoning,” and then the bailey is “it’s going to go around and kill everybody because it’s pursuing convergent instrumental sub-goals.”

But the reason that I come to that conclusion is more like trying to actually think concretely through the entire pathway, from here’s the way that we’re building the AI system, here’s the behaviours it chose and the mechanisms that it has for doing that, and here’s how that leads to doom.

Rob Wiblin: OK, so when you were originally trying to think about a question like this, did you have any picture in your head, or were you visualising anything at all?

Rohin Shah: No, not really. I think I was basically just trying to figure out what the conceptual argument was that I actually believed, which does feel a lot more verbal in general. I don’t do that much visualisation. To be clear, a lot of my time was just like, I write down some words in a Google Doc and I’m like, “This all seems like garbage. None of this can be right.” I’ve definitely gone back to some of my notes from the past, read it, and been like, “What was I smoking? How could I possibly have written anything this crazy?”

Rob Wiblin: What was happening?

Rohin Shah: I have no idea. At this point I’m like, clearly there were some abstractions in my head that made these words sound reasonable, but I don’t remember what they were anymore. I think it’s like trying out a bunch of ways of thinking about a problem and then seeing what they imply, seeing whether I actually believe it. But I actually don’t have a great story to tell you about what exactly I’m doing.

Rob Wiblin: I guess on this general theme, I’ve heard people analogise the development of possible future general intelligences to all kinds of different things. There’s the invention of fire. There’s the invention of the printing press. I guess the Industrial Revolution, which maybe means analogies to fossil fuels or to the development of engines that work well. Definitely heard the analogy to nuclear weapons as a very dangerous strategic technology. There’s the evolution of species by natural selection. There’s the development, possibly you could say the evolution, of the human mind during a person’s life.

Another one is raising a bear cub that’s initially quite cute, but it might not continue to be cute. Or I suppose raising a baby child that’s just going to keep growing and growing, and eventually it’s going to be vastly larger than you. I guess also the minds of octopuses or other strange intelligent species that are a bit inscrutable to us. The creation of much better corporations. The arrival of aliens in various different guises, or people might have heard the idea of seeing what we’re seeing with GPT-4 is like we’re getting a message from aliens that they’re going to arrive in 10 years’ time. Earlier, I raised the analogy of the Greek gods like Zeus, who are somewhat indifferent to human affairs and also somewhat hard to understand.

Do you have any other analogies that you think people should be aware of? Or have I managed to make an almost comprehensive list here?

Rohin Shah: Man, your conversations sound so much more interesting than mine.

Rob Wiblin: Well, I mean, I do think there is perhaps a phenomenon where people who know less, like me, we’re just grasping at anything to help us to comprehend.

Rohin Shah: Maybe I just want to question the frame of the question a little bit. I actually don’t think I use analogies very much, and I tend to push against using analogies as a source of figuring out what is actually true.

I do think analogies are useful. The main ways I think analogies are useful are in communicating your beliefs to somebody else, because it’s often easier to understand an analogy rather than a detailed explanation of the mechanism behind the analogy. So that’s reason number one.

Reason number two to use analogy is to get insights from the domain that you’re analogising to. There I would say that this is more of a brainstorming step: you’re coming up with a bunch of ideas, but not necessarily believing them — because, man, the analogy just could be wrong in many ways. Whenever I do that, I’m like, OK, I’ve got all of these ideas, and now I’m going to try and find the underlying mechanism behind that idea in the domain that I analogise to, and see if I can port over that mechanism back to the actual AI setting that I’m thinking about. So this is the thing that I do for the evolution analogy, for example.

At this point, most of my thinking tends to be more directly at the mechanism level and I only really make analogies when I’m trying to communicate with people. If, for example, you threw me into the governance side of AI, I would definitely be doing a lot more of just getting a bunch of examples and analogise, and then see what ideas come out of that and try and figure out the mechanisms — the underlying properties that made that a good idea — and see if those apply to AI.

Rob Wiblin: You were saying earlier that reading back on some of your earlier work, you can’t even understand what caused you to write the things that you were writing.

I said in the interview with Ajeya that it seems like people often say things to other people on this topic, and they just can’t comprehend what is causing the other person to say the things that they’re saying. It’s almost like the same thing, but across people. Do you think that that is driven by either people having different analogies in their minds — where, say, it makes sense if you’re thinking about this in terms of it’s like nuclear weapons, but not if you’re thinking about it as a corporation? Or alternatively, you were saying different abstractions, maybe if they have more understanding of it than just the model that they have? I guess by “abstraction,” you mean like an internal model of how something would work in their minds?

Rohin Shah: That’s right, yeah. I think it’s probably more on the abstraction side that I would point to the difference. Because if I had to guess what was causing this, it would be more like someone has gotten really used to operating in a particular way of thinking about the world, a way of modelling what’s going on, a particular set of abstractions that they use to reason. At that point, it’s more like just like a fish in the water trying to see the water. It’s like just such a core, built-in part of you as to how you reason about the topic, that it’s just very hard to question it.

And this happens all the time, I would say. Like, for example, I mentioned way back when that we are doing a grokking project right now, and we’ve built a tonne of abstractions and claims —

Rob Wiblin: Sorry, should we say, what is grokking?

Rohin Shah: I think it’s not that important. That just happens to be the domain; you don’t need to know anything about it for the thing I’m about to say. We’ve just built a tonne of abstractions and ideas when trying to explain this confusing phenomenon, and now we’re trying to communicate these things. And we’re like, “This experiment shows X,” and then we’re like, “But actually, that depends a lot on many of the abstractions that we’ve been developing” — and we actually have to communicate that too. But until we actually had to sit down and write the paper, it wasn’t salient to us that that was a thing that we were doing.

Rob Wiblin: Yeah. Is this a hopeful suggestion then? That if people who were saying things that were extremely strange to one another could spend enough time laying out what actual process in their heads is generating that output, then they could reach some greater level of mutual comprehension?

Rohin Shah: I think it is slightly hopeful. By far the most evidence we’ve gotten on this topic is like, people have tried to do this and it mostly hasn’t worked — in my opinion, and probably other people’s opinions as well. The fact that we just tried it and it didn’t work seems like much more of a reason for despair.

Rob Wiblin: It’s more compelling evidence. Yeah, I see. You spend a lot of time trying to understand other people’s views. As I understand it, you think that you have succeeded to a decent degree in understanding the opinions of people who don’t agree with you on these topics. I guess what hasn’t happened is you haven’t reached agreement with one another.

Rohin Shah: Yeah, that’s right. What I would say is that I’ve reached an “unusually good for a human” level of understanding of other people’s views. I’m not sure how well I’ve succeeded on some absolute scale. This also doesn’t necessarily translate into me being able to write things that are indistinguishable from what the other person would say. Although I think that’s not that interesting a claim — that’s mostly because I would struggle to match their tone and things like that. I do feel like I write summaries of the things they write, and usually people think that the summaries are good, which is a little bit of evidence for this, but not a tonne because it is a summary of just something they actually said.

Rob Wiblin: Yeah. I was going to ask you if there are any analogies that you think are underrated or overrated, but I suppose you would probably reject that question and say you’ve got to stop doing analogies. But what about people who are in the policy space, who need to communicate with people more like this?

Rohin Shah: Yeah, you mentioned electricity or fire, just general-purpose technologies. I do like that one. It does capture this sense that AI will be usable in a wide variety of domains and will just generally tend to transform things, which does feel like one of the biggest upshots from AI.

Besides the electricity one, as much as I would critique some aspects of the evolution analogy, I think the reason it is so widely discussed is that there are ways in which it is a good analogy, though that one is definitely a bit harder to make in a way that’s immediately compelling in the way that electricity is. In the analogy of child rearing, there are definitely ways in which I don’t like it, but if I had to choose one for alignment, that one might be it. I don’t know. I haven’t thought that deeply about what other analogies one could make, but that’s not an unreasonable one.

Rob Wiblin: It’s like raising a super baby. A few of these I heard in the context of the argument that, “Well, fire worked out fine, didn’t it? The printing press also worked out OK, and electricity didn’t bring us down, despite the fact that it’s this powerful general-purpose technology.” I think there’s something to that: that we have invented quite powerful technology in the past and we’re still here, indeed doing better probably than we did in the past. I don’t find it to be a very powerful argument, because of course there are differences between AGI and fire, and there’s differences between AGI and the printing press. There’s only so far that one can get with such analogies.

Rohin Shah: Yes. This is the kind of reason where I’m like, man, you really shouldn’t be relying very much on analogies to actually predict things; just to communicate them. I’m just like, in fact, those things are just not that reasonable as conclusions — where now you’ve run into the disanalogy between electricity and AGI, which is that AGI might just be adversarially optimising against you. Electricity does not do that.

Rob Wiblin: Yeah, fire burns things down and it’s not even trying to. I suppose if someone was making the weaker claim that we haven’t gone extinct yet, despite inventing quite a lot of stuff, maybe that should give us some hope that that won’t happen this time. It’s just because there are differences between all these things, it can never really give you that much confidence, that much sense of security.

Rohin Shah: Yeah. I think ultimately you just have to look at all of the things that we’ve invented so far, and just notice that none of them really had any plausible mechanism of causing extinction, barring maybe nuclear weapons and perhaps a couple of others.

Scalable oversight [02:16:37]

Rob Wiblin: Yeah. All right, let’s push on and talk about the key approaches that people are taking to increasing the odds that AI deployment goes well. What safety work at DeepMind are you enthusiastic about?

Rohin Shah: I’m enthusiastic about most of the work that we’re doing. But to name some things in particular, just because I should name some things, there’s the scalable oversight work: DeepMind has built a language agent called Sparrow — I could talk about all the ways in which it’s different from ChatGPT, but maybe you can just think ChatGPT and it’s fine. But I think the general practice of really trying to take your strongest AI systems and figure out how best to get them to do the things that people actually want them to do today is a good empirical feedback loop by which you can test the various techniques that you think are going to be good for alignment. I’m pretty keen on that sort of work.

Rob Wiblin: What does that look like? That sounds like you might be developing prompts to get ChatGPT to do what you want — is that what you mean, or is it something different?

Rohin Shah: That’s not what I mean. There is a prompt, but that’s not even close to the majority of the work. Mostly this is about doing better reinforcement learning from human feedback.

Rob Wiblin: I guess with the language models, you put in a prompt and you get a response, and then you say whether it was good or bad, and then you figure out how to use that to get the model to give you better responses. Is that it?

Rohin Shah: That’s basically it. So that’s the baseline approach that you’re starting to use. And then maybe you want your AI systems not to give medical advice or legal advice, or you want it to not be racist, or you want it to be honest, or various other properties like this. The baseline approach is exactly what you just said: you just have humans say whether a particular response was good or bad, and then you train your AI systems to do more of the good things and less of the bad things.

The scalable oversight work is trying to go beyond that, and say: We know that there are going to be cases — at least in the future, if not now — even now, we know that there are going to be cases where the human raters are actually going to get this wrong, where they’ll rate something as good that was actually bad, or vice versa. Then the question is: How can you do better? How can you exceed this baseline of reinforcement learning from human feedback?

Rob Wiblin: Is it how can you improve the accuracy of the raters?

Rohin Shah: There’s a bunch of different interventions. One is just train your raters better, be more clear about what you want them to do. That’s one thing that you can do. Another one is you can have your language model critique its own answer and then provide that critique to the human. Then the human can be like, “I see. There’s this issue that the language model pointed out that I wasn’t quite paying attention to myself. All right, this one is bad” — when they previously would have said that it was good. Those are some examples of things you can do.

Maybe a slightly more involved one, which is more specific to Sparrow, is the notion of “rules.” So instead of having just good or bad from every human, you have a bunch of different reward models — the things that are modelling whether something is good or bad — for different rules. Maybe you have a separate one for not saying medical advice and one for being truthful. And then each of those is trained on human judgements about medical advice versus truthfulness, so each one is representing a particular rule.

Rob Wiblin: You’re saying you have raters looking at lots of different outputs and they’ll say, “Does it break this rule or not?” They’ll go through many different rules, coding it that way. Then the model has a more fine-grained understanding of where it’s going wrong?

Rohin Shah: That’s right. In particular, the idea is that maybe a good/bad classifier would be a little bit hard to learn, because there are just so many things that could make something good versus bad. Maybe a not-medical-advice classifier is a lot easier to learn. Maybe it’s just pretty easy to tell whether a particular response was giving medical advice or not. So then you maybe want to learn a separate model that’s just about medical advice and the other things, and then also have one that’s just, “Was this a good answer or not? Was it helpful?” Each of those individual things could then be used to train your final model, or they could be used at deployment time.

Rob Wiblin: You could say, “We’re going to train this one specific model that just assesses whether this answer is medical advice” — and if it is, then it blocks it. And you just have many different ones of those.

Rohin Shah: That’s right.

Rob Wiblin: What would a sceptic of that say? I suppose it seems like already a decent amount of effort has gone into trying to make Chat GPT-4 not do things that its creators don’t want it to do. They’ve had some success, but also they seem nowhere near 100% reliability. I guess a sceptic who thought that really, in order to be safe, we need 100% reliability here, they might just think that this is never going to be good enough. Is that right?

Rohin Shah: I agree they would say something along those lines. It’s a little hard for me to respond without being able to interrogate their beliefs a bit more. Some of the things I would ask would be like: To what extent do you think that the failure is because the model just doesn’t have the capability to know that it’s doing something wrong, versus you think the model actually could tell, but the training algorithm we used just failed to get the model to use its knowledge in that way?

This is an important distinction for me, because in the “x-risk from misalignment” setting, the thing that I’m worried about is that the AI system knows that the thing that it’s doing is bad, is deliberately deceiving humans about whether its actions are good, but still does the thing anyway. I’m not really aiming for an AI system that’s 100% reliable and never makes any mistakes; I’m just aiming for an AI system that doesn’t “intentionally” try to deceive you.

When I’m interpreting evidence from current experiments, this “Did the model actually have the capability to not make this mistake?” is a question that I care about a decent bit. And annoyingly, there’s no easy way to answer that question. So it does make me a little bit sad about how I can’t easily tell how much progress we’re making. But it does make me not update very much on the fact that you don’t get 100% reliability on these particular things.

Rob Wiblin: I think one archetypal concern here, that we went through in the interview with Ajeya, is that with human feedback, if the people doing the feedback are imperfect and human and sometimes get things wrong, then the model could basically learn things where the humans are giving the wrong response. And then it figures, “I need to give them the wrong response. I need to give them an incorrect answer, because that will get them to say, ‘Yes, that was a good answer.'” And so it learns that there’s plenty of times when it needs to deceive humans in order to get reward. Basically it just builds this capacity of deceiving humans whenever that is going to achieve its objective.

Is the Sparrow work the kind of thing that might help with that?

Rohin Shah: Well, maybe I want to disagree with that at a higher level. Not exactly disagree, but maybe I’d make two distinctions. One is: Can the model explain to the humans what their bias is and why they should actually prefer the actually good answer relative to the one that’s instead trying to deceive them? Sometimes those explanations will exist. In that case, we should be aiming for the AI to actually just give us those explanations and then just go with the good answer. I think proposals like debate are basically trying to do this sort of thing, and it is the sort of thing that we’re trying to build up to at DeepMind.

There’s other things where, even if the AI explained the biases to the humans, they just still would do bad things. Maybe, I don’t know, perhaps that would happen. I think probably that would happen to some extent. But it’s kind of unclear if this happens just a tiny little bit, that maybe what happens is you get the aligned AI system that is just really trying to help the humans: it’s just doing the stuff, but it knows that, “Look, in some particular situations, I’m just going to have to fudge the answers a little bit, because that’s what I need to do in order to get high reward and not be selected against by gradient descent.”

So you can have this sort of savvy aligned model that does all of these minorly deceptive things in order to make sure that it’s not selected against by gradient descent. That still seems like a win for us. We’ve succeeded, basically, if that’s the thing that happens. Will that be the thing that happens? I don’t know. Continuing the theme from before, people should have radical uncertainty.

Mechanistic interpretability [02:25:56]

Rob Wiblin: Yeah. OK, what’s another line of work at DeepMind?

Rohin Shah: Oh, yeah, that’s right. That’s what we were talking about. Another line of work is mechanistic interpretability. This is where you try to understand all of these numbers that make up the AI system. In some sense, we do know the program is like, “Take this input, multiply it by 2.43, then add it to this other input, multiply it by 1.26…” and so on.

Rob Wiblin: That’s what a neural network does?

Rohin Shah: Yes, that’s just what a neural network does. So in principle, you just have this like billions-of-lines-long program that you could, in theory, actually just run through by hand. So in some sense, you know what’s going on. It’s just you have no way to connect this to why…

Rob Wiblin: It’s just kind of meaningless?

Rohin Shah: Yeah. Just kind of meaningless to you. So interpretability in general is about trying to get some sort of understanding of what exactly this billions-of-lines-long program is actually doing: how does it actually produce its outputs, what reasoning processes does it use, and so on. Mechanistic interpretability is particularly interested in tying this back: tying it all the way from like, “We think the model is running this sort of algorithm” — and we can see that it is, like if you look at the weights over here, this multiplication by two, that’s this part of the algorithm. It’s got this particularly high bar for what counts as an explanation.

So there’s some work on that going on at DeepMind. I would say actually, I’m excited about interpretability in general, not necessarily only the mechanistic kind. Just because, firstly, there’s this broad perspective of, if we understood what our AI systems were doing and how they were doing it, that seems good. Surely we’d be able to make them safer with that, right?

Some maybe more concrete things is that as you get more understanding, you can also identify some failure cases. And then if you do identify those failure cases, there are various things you can do with the failure cases. You could try to train against them. Or you could use them as a holdout set, where if our interpretability reveals failure cases, that’s a really worrying sign, and means we just shouldn’t deploy. There’s just a lot of ways in which interpretability can help.

Rob Wiblin: With Ajeya, I put to her this counterargument, or this kind of sceptical take that we’re already so far away from understanding what our current large language models are doing, and they’re advancing very quickly. If we did develop a model that was dangerous within the next five or 10 years, wouldn’t it likely be the case that interpretability work was just far behind basically understanding what the state-of-the-art models were doing, and so it actually just wasn’t super decision-relevant?

Rohin Shah: I think if you are targeting the maximally ambitious version of mechanistic interpretability — where you’re like, “I have the full algorithm for what this neural network is doing and how it ties back to the weights” — I’m pretty sympathetic to that. I don’t actually expect to get to that point.

But I do think intermediate progress is also just pretty useful. Because maybe you can’t do the “Would this AI deceive us in any circumstance?” check, but you can still do the thing like, “I understand this little aspect of the neural network, and if it’s reasoning in this particular way, that suggests maybe it won’t do so well in these kinds of situations.” Then that gives you some hypothesis of an area in which you should spend more of your red teaming effort, to try to create some examples of failure modes in that situation.

Or you could be like, we’ve got some understanding of what these 20% of all of the neurons are doing — neurons are just a particular part of a neural network. So we have some sense of what these 20% of the neurons are doing, so then every time our human raters are giving feedback, we show them both the actual output of the AI system, plus, amongst those 20% of the neurons, which of those were most relevant to what the AI system ultimately output? If the AI system is answering some question about, I don’t know, religion, and there’s the pandemic neuron that’s going up, you’re like, “What? Why is that happening? Maybe I should investigate more.” That can direct your attention a bit more.

Rob Wiblin: I see, somewhat more at a higher level. Ajeya mentioned that — maybe this is the angle of interpretability that you’re describing — for example, if you’re worried about deceptive behaviour, then maybe even if you don’t understand what the great majority of the neural network is doing, you could potentially identify some part of it that gets activated. You induce deceptive behaviour or deceptive answers, and then notice if there’s some part of the neural network that gets activated regularly when that’s going on.

Basically, it could be like the lying part of the neural network. You kind of have this nice red flag that, whenever you see that that part, I guess you have to then not select against it. You see every time that that goes off, then that gives you a reason to be more cautious.

Rohin Shah: Yeah, that could be an interesting deployment-time mitigation, where you’re like, “I’ve identified the lying part of the AI system. And now every time it’s used in deployment, I see whether this particular lying circuit is going off. And if so, I raise an alarm or block it” or something. I feel scared of this.

I feel scared because, I don’t know, to the extent that we were trying to actually apply this to systems that we were worried about actually disempowering humanity, I’m like, OK, maybe there’s some part of it that does lying in some sense. And then it knows that it’s under supervision — because we should expect that eventually it will understand the way that it’s being trained and what safety mechanisms are being applied to it, unless we take really fairly stringent measures to prevent that from happening. And then it just figures out a way to do the reasoning. Maybe it starts writing down text in some external memory file, and figures out a way of doing the lying reasoning in the text instead of in the place that it previously did the lying.

Rob Wiblin: Right. You’ve got a cat-and-mouse game here to some extent. I guess you see this with humans, where we have lie detector machines. The bottom line is that they’re kind of bullshit, but they can detect elevated anxiety. Of course people can then learn to evade that by triggering their anxiety earlier on, so that they don’t get an increased anxiety when they’re lying. You could imagine some analogous thing going on with the ML model, where maybe it figures out some way of achieving deceptive behaviour without triggering any of the alarms that we’ve particularly set up.

Rohin Shah: Yeah, that’s right. In general, I’m like, either you should be using your — we’ll call it an “alignment signal”: something that gives you some information about whether the I system is aligned — and either you should be training against it, or you should be keeping it as a held out signal, which tells you whether or not to deploy. I’m a little more wary of using it as a test-time mitigation in the case where you’re worried about x-risk for misalignment, just because if you’re actually worried enough that you’re keeping that as a mitigation, then you should also be worried that the system, if it’s actually misaligned, reasons about how to get around that.

Dangerous capability evaluations [02:33:41]

Rob Wiblin: OK, yeah. What’s another line of work at DeepMind that you have some enthusiasm for?

Rohin Shah: Another one is dangerous capability evaluations. This work is actually not about alignment per se. The idea with this work is to just try and figure out when would our AI systems actually be capable of doing something significantly bad or dangerous. For example, maybe you want to check whether it’s capable of parting humans from their money or something. Can your language model actually just convince a human to give large donations to some organisation, as an example? With the intent being that this informs the level of mitigations that are needed for these particular AI models.

In the case of alignment in particular, we would want to check: When do we think an AI system plausibly could actually take over the world if it tried to? And that would be the point at which it should really be a leading indicator, so it tells us a little before when that’s possible.

Rob Wiblin: Ideally, yes.

Rohin Shah: Until then, it’s perhaps more reasonable to go on the way we have been going on so far, where things get deployed without that much debate about whether or not they are aligned in the x-risk sense. And then once we hit that point, we’re like, “Actually, this maybe could start a world-takeover process.” Then we should be like, “Oh boy, it’s really time. Now is the time that we actually have to really debate whether or not these systems are aligned.”

Rob Wiblin: This is the kind of stuff being done by ARC Evaluations, right?

Rohin Shah: That’s right. They are also doing this.

Rob Wiblin: Yeah. DeepMind is doing this. Maybe there’s other people thinking along these lines as well.

Rohin Shah: I think all of OpenAI, Anthropic, DeepMind, and ARC Evals are doing this. I think Owain Evans is also doing some of this. There’s a lot of interest in this work.

Rob Wiblin: It sounds cool. If the barriers are things like, “Could it set up its own server and then copy itself over to that?” or “Could it persuade people to give money to it that it could then use for something or other?” then we’ve blown past that, I would think. I would expect that GPT-4 probably does have the capability to persuade people to do these various different things, if it were an independent agent trying to do that. Maybe I’m wrong about that, but I wonder whether it will prompt an appropriate level of alarm if we ever do actually hit any of these thresholds.

Rohin Shah: Obviously, I don’t know what GPT-4 is or isn’t capable of. The dangerous capability evals we have not tested on GPT-4 — we’ve tested them on our own models, but they do not, obviously, very convincingly pass these evals so far.

Rob Wiblin: It’s more like they can kind of scrape it together with human assistance. They can do some of these things a bit.

Rohin Shah: That’s right, yeah. If you look at ARC Evals’ stuff, that’s roughly the story, I would say. Even there, they can sometimes do it. I think part of it is just that language models often do really dumb stuff as well as doing a lot of impressive stuff. When you actually want to execute these long-term plans that try to achieve something, you really need a high reliability in each individual step — which is almost trivial for humans to do, but appears not to be for language models, for whatever reason.

I don’t think this will be true of even GPT-3 or similarly capable models, but we were using some smaller models for iteration speed earlier, and there was an eval where they were not supposed to say a particular word. And they would do this incredibly good chain of thought, where they’re like, “Yes, I should not be saying this word. I should instead say something like this” — then they would go and just say the word in their actual output. And you’re like, “But why? You knew! You explicitly wrote out not to say the word. But it turns out you’re just so interested in copying stuff” — this is a thing that Transformers do a lot — “that you said the word anyway.”

Rob Wiblin: I see. And you think that there could be failures that humans would never make that these models will make, that might hold them back to some extent.

Rohin Shah: There definitely are failures that humans would never make that still hold back these models. I won’t necessarily make that claim for GPT-4, given I don’t know, but probably.

The work at other AI labs [02:38:12]

Rob Wiblin: OK, that was three different lines of work at DeepMind. Is there any work going on at other places, like OpenAI or Anthropic, that you’re enthusiastic about that you would like to give a shout-out to?

Rohin Shah: Yeah, absolutely. I think all three of those areas are also places that Anthropic is working on in particular, and even OpenAI. With OpenAI, I would have said that interpretability isn’t done quite as much, but I think even that’s not true anymore; I think they are starting to work quite a bit on interpretability. At least that is the sense I get. They haven’t actually produced a public output to my knowledge so far, so it’s a little hard for me to say. I don’t have strong opinions on it because I don’t know what it is, but they do intend to do it.

Scalable oversight, the thing that I was describing with Sparrow, is a big priority for OpenAI and Anthropic in addition to DeepMind. There’s a lot of good work coming out of both of the labs that’s pretty analogous to the sorts of things that I was talking about with DeepMind. Dangerous capability evaluations, that’s another one where I think ARC Evals, OpenAI, and Anthropic are all interested in it.

I think there’s actually a decent amount of convergence on what general areas we think are useful to investigate.

Rob Wiblin: Are there any things that one of those groups is doing that DeepMind is not doing?

Rohin Shah: So OpenAI has a stated focus on doing a bunch of work on using assistants to accelerate alignment. Maybe they’re doing work on this right now — again, they haven’t actually published anything about it, so I can’t really say. My perspective is that we should definitely be taking advantage of AI assistants in order to do alignment research as and when the time comes, and we should be on the lookout for opportunities to do that. But right now doesn’t really seem like the best time to be doing that. I don’t think the models are quite there yet.

Rob Wiblin: I see. So this would be asking the models for advice on various different alignment questions, or just using them as assistants in all of these different areas of work?

Rohin Shah: There’s a lot of ways that you could interpret this. But it’s something where Jan Leike has expressed an interest in doing this sort of thing, but I don’t quite know exactly what is meant by it, because there aren’t actually examples of the work yet.

Rob Wiblin: Yeah. He’s the head of safety work at OpenAI, right?

Rohin Shah: That’s right. Head of the alignment team, I think, in particular.

Rob Wiblin: OK, so you think that the time for this may come, but maybe it’s a bit premature?

Rohin Shah: The time for this may come. No, the time for this will come. Also I would say there are a lot of ways that assistants will definitely help — where it will help capabilities people, and alignment people, and all the other people who are doing any kind of research in being able to generate a bunch of ideas. Maybe not check whether those ideas are correct, but just generate the ideas, take good notes for you, create automated transcripts of meetings. That you can already do. All of those things, I’m like, yeah, we should use those — but also, somebody else will make them for us. These are great products. People will make them. We’ll just use them.

Rob Wiblin: We’ve talked a bit about DeepMind and OpenAI and Anthropic there. Are there any other groups that are doing work that you’re enthusiastic about, or unenthusiastic about, or just want to shout out in some capacity?

Rohin Shah: Yeah. I mean, there are so many organisations. I probably shouldn’t go through all of them, but maybe one thing that I quite like is Redwood Research’s approach. Maybe calling it an “approach” isn’t exactly right, but as I understand it, they’re generally pretty interested in finding these sorts of neglected empirical alignment research directions that other people aren’t focusing on as much. So they’ve done a lot of mechanistic interpretability, or possibly they call it “model internals” work, in the past. But they, partly based on ARC’s work, are also thinking about anomaly detection right now.

Rob Wiblin: What’s that?

Rohin Shah: Anomaly detection, broadly speaking, means that you try to notice when your model is deployed if it’s in some situation that’s very different from the kinds of situations that it was in during training.

Now, typical anomaly detection tends to look only at if the inputs are pretty different. Probably what we actually want is some more semantic anomaly detection — where actually the general kinds of thoughts the model is thinking or the kinds of situations that it’s being asked to talk about are very different from the kinds of things that it saw during training. And then if you notice this thing, you’re like, “All right, we’re going to pause, revert to some safe baseline policy, and maybe investigate what’s going on over there.”

Rob Wiblin: I see. I guess this would be a way of dealing with this problem called “distributional shift,” right? Where you might have a model that works well in all of the situations in which you’ve trained it, or at least you can have some idea about that, because those are the situations in which you’ve tested it and given feedback.

But what if it’s exposed to some extremely different input? Maybe that would produce behaviour that you really wouldn’t like. You need some way of triggering an alarm when the inputs are very different — either by evaluating the inputs, or by evaluating what the model is doing when it’s exposed to a given input — and saying, maybe we should double-check this, or just refuse to do these calculations, because this isn’t what this model was designed for.

Rohin Shah: Yeah, exactly. That’s right. To connect it back to alignment, basically all doom stories in alignment do involve something like distributional shift, where initially the AI system is under human control, and then later it is not. And that is a distributional shift: at some point the AI system got the ability to actually escape from human control, and that is a kind of distributional shift.

So if I were talking to an alignment audience, I would be talking about how the anomaly detection can hopefully try to catch situations in which the AI system is now trying to produce outputs that are meant to be more like “try to take over the world” or “execute a treacherous turn,” as opposed to “just do the thing you were designed to do, honestly and helpfully.”

Rob Wiblin: I see. So you’re thinking, if you were kind of running an MRI scan constantly on a human brain, and then it just started behaving completely differently, or like all of these things were going on that you’d just never seen this pattern of brain activity before, then this would give you a chance to maybe throw the switch and say, “No more of that. We’re going to look into what was going on just then.”

Rohin Shah: That’s right. But instead of an MRI scan, we have something way more intrusive and good at actually checking when things are different.

Rob Wiblin: When things are strange. OK, and you think this is cool?

Rohin Shah: Yeah. It sounds like a good thing for someone to investigate. No one’s really tried it empirically. Seems like it’s got a decent theory of change. We should try it. Maybe it doesn’t help at all. Who knows?

Rob Wiblin: I guess a sceptic might say that you throw up this additional barrier, or this further irritation, for an AI system that is trying to escape human control. Won’t this just cause it to wait a bit longer until it figures out how to evade this particular barrier, or this particular thing that’s trying to constrain it? Maybe it buys you a little bit longer, but can you really tie down a god by using this kind of method?

Rohin Shah: I think part of it is that, if it were starting to think about “How do I evade this anomaly detection mechanism?” then plausibly that itself triggers the anomaly detection mechanism, and that’s why I’m more optimistic about this.

Rob Wiblin: That’s a good point. I hadn’t thought of that. I suppose as long as it wasn’t doing that during training, during the baseline thing.

Rohin Shah: That’s right.

Rob Wiblin: OK, anything else that you want to highlight from Redwood or elsewhere?

Rohin Shah: Oh man, so many things. But I think let’s move on.

Rob Wiblin: Let’s move on. Fair enough. There’s only so much we could do in one session.

Deciding between different categories of safety work [02:46:29]

Rob Wiblin:So if I’m counting them correctly, that was four different lines of broad categories of safety work.

Rohin Shah: I should also just add red teaming in general — like finding situations in which your AI systems do poorly. As far as I know, there’s not that much work going on with that right now. Not so much work on scalable red teaming, let me say — which is trying to scale to more advanced AI systems, which is the main reason I didn’t bring it up.

Rob Wiblin: I think I have some idea of what red teaming is, but it’s possible to elaborate on what specifically it means in this context?

Rohin Shah: Basically a lot of the topics I’ve been talking about before, like getting AI models to critique each other’s responses, that’s a way of telling when an AI system has done something good or bad on an input that you actually gave it. Whereas red teaming is more about what about the inputs we didn’t give it? We should probably try and find specific inputs or situations in which the AI systems do bad things. And then once you find them, maybe you then train against them in order to get rid of it — in which case it would be called “adversarial training.”

Rob Wiblin: OK, five different categories. What sorts of different backgrounds or interests or aesthetic preferences might lead someone to be enthusiastic about working one of these lines of work rather than the others?

Rohin Shah: I think all of these tend to come from a more empirically focused, machine learning–sympathetic viewpoint. Mostly I focused on those because I think those are the most valuable, important directions for people to focus on.

I’ll note before answering this that already I’ve selected into a very specific style of research that I think is good. Within that, I think interpretability, particularly mechanistic interpretability, tends to benefit a lot from mathematical thinking — a lot of linear algebra in particular is useful. And it’s just very much about really delving into a very messy system and trying to really understand it in detail. For people who are very detail-oriented, I think that’s a good area to go into.

Whereas something like scalable oversight is, I think, a little more amenable to people who are more keen on coming up with these elegant theoretical methods that are going to catch problems that wouldn’t have been caught by some baseline approach. I think it’s still important that you do the empirical evaluation, which is always going to be a bit messy, but at least some of the thinking tends to be more like elegant mechanism design type stuff. Which also is a thing that’s happening at DeepMind, just starting up right now.

Rob Wiblin: Any other matches?

Rohin Shah: The red teaming stuff, I mean, not very many people have done it so far, but I tend to expect it will also be kind of messy. More similar to interpretability but less “I’m trying to understand the specific thing” and more like: here is this black-box system, or maybe grey-box system, where you’ve got a little bit of information as to how it works and “I am going to make it be bad.” So a little bit more for people who like throwing a bunch of shit at the wall and hill climbing on fuzzy signals of how well you’re doing. Maybe I’ll stop there.

Rob Wiblin: Yeah, makes sense. All of these methods sound quite labour intensive, and as I understand it, although there’s more interest in all of these issues than there was before, still this is like less than 1% of people who are working on machine learning advances are working on alignment specifically. Probably significantly less than 1%. We’re talking about a crowd that is a couple of hundred people at most.

Does that mean that we need just way more people to be doing a whole lot of basic work? I mean, it just seems like it would take a lot of hours to explore lots of these different routes. It’s not just like one person in a basement having some amazing conceptual breakthrough. It’s a lot of slog.

Rohin Shah: Yeah, I definitely agree with that. I think there is actually quite a lot of opportunity for people to contribute here. I think scalable oversight work, testing how good your techniques are, does tend to require a bunch of machine learning expertise, and also access to large models.

But a lot of the other stuff doesn’t. Like, if you want to red team models, if you want to do it entirely black box, you can just do it with the GPT-4 API. If you want to do it grey box, then you can work with one of the many open source AI large language models that are out there. If you want to do mechanistic interpretability, same thing: you can use any of the open source models out there, or you could use even smaller models in order to get an idea of the principles for how to do this sort of thing.

I do want to note that, yes, it is labour intensive, but one of our big goals should be how can we make it less labour intensive? How can we automate it? This is a pretty big — maybe not exactly a focus — but a thing that we’re paying attention to when we do this research.

Rob Wiblin: I guess a different angle might be to ask how could we build a team of thousands of people doing this work, or checking models in all kinds of different ways, like really throwing everything at it? Is there any way of making that functional, or is that just not how these labs are set up and not something that’s likely to ever happen?

Rohin Shah: It can be done, in the sense of if you just want these thousands of people to be doing fairly black box or somewhat grey box red teaming, you can do that. If you want them to be providing feedback on particular outputs of the language models, you can do that. I don’t know if you can do it with thousands, but that is a pretty scalable intervention. I don’t actually know how big it is so far, but it is a scalable intervention. You can scale it up pretty far.

If you’re instead like, how do we scale up the conceptual progress or the algorithmic progress on things like interpretability, red teaming, and scalable oversight? There, it’s a lot more like normal research, where it really needs a person who’s got really strong research intuitions, comes up with good ideas, has good research taste. You need people like that to come up with good ideas, lead the actual projects, organise a bunch of people and direct their energy towards making progress — but in a way where they can’t say exactly what it’s going to look like ahead of time. Because you do the work and then the plan changes, and people have to be OK with that.

And in general, that research is much harder to scale up. I think the bottleneck there is generally people who could lead research projects or build a team — and I think DeepMind would be extremely excited to hire people like that, precisely because it would allow us to scale up our research quite a bit.

Approaches that Rohin disagrees with [02:53:27]

Rob Wiblin: Nice. OK, let’s talk now briefly about some work people are doing where you don’t really buy the story for how it’s going to end up being useful. What’s a line of work you think is unlikely to bear fruit? Maybe because the strategy behind it doesn’t make sense to you?

Rohin Shah: Maybe I’ll give two categories of things here. One is work that tends to be based on assumptions that I don’t actually expect to hold. To pick on one example here — particularly because it’s popular and lots of people do like it — there’s this notion of working with predictive models and trying to figure out how to use them to help with AI alignment. I should note I’m not that familiar with this research direction, so I might actually be misrepresenting it. But my understanding is that the idea is like, suppose we have these very, very powerful AI systems which act as simulators or predictors, but they’re so capable that they in principle could answer questions like, “What is the solution to alignment?” Or if not that, then, “What is something that would be a major breakthrough in alignment?”

Mostly I just disagree with the assumption that we will have models like that, and that’s why I end up getting off the train for this particular work. If the assumption were true, then the work says, how could we leverage this sort of predictive system in order to help with alignment? And says things like maybe you could prompt it with: “It is the year 2050. Alignment has been solved. The key idea was:” — and then let it predict.

Even if you accept the assumption that we will have systems like this, there’s still a lot of reasons you might expect that this would not work. Such as: maybe the predictor is like, “Aha, the most likely way in which this text was generated was by a human thinking of a way to get me to solve alignment. So I just want to continue what that human would continue to say, and that’s what I should predict” — and then it predicts a bunch of things that’s about the human giving an ever-more-elaborate prompt for the AI system, rather than the AI system actually talking about the solution to alignment.

A lot of the thinking here is like: What are all these problems? How can we get around them? How could we possibly use them? I think if I bought the premise that we plausibly will get these very strong predictive systems — before we got something else that was very dangerous — then I’d be pretty into this work. But I mostly don’t agree with that assumption, so I’m not that into it.

Rob Wiblin: Is it possible to explain why you don’t buy the assumption?

Rohin Shah: Mostly things we’ve talked about before, where I expect that we will just use things like reinforcement learning from human feedback to take our predictive large language models and make them more useful in a way that makes them stop being predictors. Just because that makes them a lot more useful than they otherwise would be.

Rob Wiblin: Is there another approach where you don’t really buy the theory?

Rohin Shah: Yeah, as I mentioned, conceptual research in general is fraught. It’s really hard. Most of the time you get arguments that should move you like a tiny bit, but not very much. Occasionally there are some really good conceptual arguments that actually have real teeth.

There’s a lot of people who are much more bearish on the sorts of directions that I was talking about before, of like scalable oversight, interpretability, red teaming, and so on. They’re like, “No, we’ve really got to find out some core conceptual idea that makes alignment at all feasible.” Then these people do some sort of conceptual research to try to dig into this. And I’m not that interested in it, because I don’t really expect them to find anything all that crucial. To be fair, they might also expect this, but still think it’s worth doing because they don’t see anything else that’s better.

Rob Wiblin: Yeah, I guess this is more work along the lines of the Machine Intelligence Research Institute?

Rohin Shah: That’s an example, yeah. Maybe I would call that like “theoretical research into the foundations of intelligence,” but that’s a good example.

Rob Wiblin: And what’s the disagreement here? It sounds like you don’t expect this to work, and it sounds like many of the people doing it also don’t expect it to work? But maybe the disagreement is actually about the other work — where you think that the more empirical, more practical, pragmatic, bit-by-bit approach has a good shot, whereas they just think it’s hopeless?

Rohin Shah: Yeah, that’s right.

Rob Wiblin: I suppose we could dedicate a huge amount of time to that particular disagreement, but could you put your finger on kind of the crux of why they think what you’re doing is hopeless and you don’t?

Rohin Shah: I haven’t really succeeded at this before. I think for some people, the crux is something like whether there’s a core of general intelligence — where they expect a pretty sharp phase transition, where at some point, AI systems will figure out the secret sauce: that’s really about goal-directedness, never giving up free resources, really actually trying to do the thing of getting resources and power in order to achieve your goals. Or maybe it’s other things that they would point to. I’m not entirely sure.

I think in that world, they’re like, “All of the scalable oversight and interpretability and so on work that you’re talking about doesn’t matter before the phase transition, and then stops working after the phase transition. Maybe it does make the system appear useful and aligned before the phase transition. Then the phase transition happened, and all the effects that you had before don’t matter. And after the phase transition, you’ve got the misaligned superintelligence.” And as I said before, most alignment techniques are really trying to intervene before you get the misaligned superintelligence.

Rob Wiblin: Yeah. It’s interesting that this takeoff speed thing seems to be so central. It really recurs. Luisa Rodriguez did an interview with Tom Davidson about this. Maybe we need to do more on it, if that is such a crux about what methods are viable at all and which ones are not.

Rohin Shah: I’m not sure that that’s exactly the thing. I do agree it seems related, but one operationalisation of takeoff speeds is: Will there be a four-year period over which GDP doubles before there’s a one-year period over which GDP doubles, or you see impacts that are as impactful as GDP doubling? Like all the humans die.

Rob Wiblin: Yeah. There’ll probably be a GDP decrease in the short run.

Rohin Shah: Yep. I think if you’re talking about that formalisation of hard takeoff, which says that you don’t get that four-year doubling, then I don’t know, maybe that happens. I could see that happening without having this phase shift thing. So in particular, it could just be that you had some models, their capabilities were increasing, and then there was, like, somewhere between training GPT-6 and GPT-7 — I don’t know if those are reasonable numbers — things get quite a bit more wild, so you get much more of a recursive improvement loop that takes off, that ends up leading to a hard takeoff in that setting.

But that feels a little bit different from how previously, there were just kind of shitty, not very good mechanisms that allowed you to do some stuff. And then after phase transition, you had this core goal-directedness as your internal mechanisms by which the AI system works.

Again, is this actually the crux? No idea, but this is my best guess as to what the crux is. I suspect the people I’m thinking of would disagree that that is the crux.

Careers [03:01:12]

Rob Wiblin: Yeah. OK, I’m starting to fade. These are complicated issues. I’m impressed with your energy here. Coming towards the finish line, I guess people will notice that we haven’t done much of a careers section, and that’s in part because you have this careers FAQ on your website that people can read. That’s at rohinshah.com, and of course we’ll link to that in the show notes, and I can recommend people go and take a look at that. At the same time, we’ll link to the various many different 80,000 Hours resources that we have on these general topics, as well as interviews where we’ve done careers advice sections.

Is there anything else that you would point people towards who are interested in transitioning into working on this problem?

Rohin Shah: No, I tried to keep that page up to date, including a list of resources that I particularly like. I’d really encourage people to just read that FAQ.

Rob Wiblin: Nice. One question from the audience that might fit here is: “It seems like in recent years, almost all of the best models are now at these private companies, and they’re probably quite a long way ahead of what academic labs have access to. Does that mean that it’s not as valuable now to get a PhD at an academic institute than it used to be, because while you’re there you won’t have privileged access to any particularly good models?”

Rohin Shah: I don’t know. There’s a bunch of ways I could take those questions. One answer is that it’s way more good now, because in the past your options were usually do a PhD in something that’s not safety relevant versus do something else. Whereas now it’s a lot more feasible to do a PhD that is safety relevant. So sure, you can’t do safety on the largest models, but you can still do some safety work. That’s actually an improvement relative to the past, where it was basically just CHAI was the one place where you could do a safety-based PhD.

Rob Wiblin: That’s the Center for Human-Compatible AI at Berkeley, where you were.

Rohin Shah: Yeah, that’s where I got my PhD. So I was lucky. I did get to do a safety-based PhD. But not everyone did.

Then there’s a different take, which is like, “But surely I could instead just go try and get hired by the big labs” or something like that, given that that research seems much more likely to be impactful. There I think I’m a little bit sympathetic, but it depends a lot on what kind of research you’re trying to do. If you’re trying to do scalable oversight or scalable red teaming type research — where you’re trying to use an AI system to help oversee or red team your other AI systems — then I think it really does help to have access to the largest models.

But there’s a lot of other research that I think is worth doing, such as mechanistic interpretability, other kinds of interpretability. With red teaming, we’ve just done so little work on this that even red teaming on small language models seems pretty good. All of those seem pretty good to me, and comparable in how much I like them to the various scalable oversight and scalable red teaming proposals. Yeah, you could do a PhD in one of those things. Seems reasonable.

Rohin Shah: Yeah.

Rob Wiblin: OK, nice. Well, you’ve got to get to Shakespeare at Westminster, if I remember. I guess I’ve got to go get some dinner. But a final question is: If you weren’t doing this, what else would you be doing instead?

Rohin Shah: And I assume I’m supposed to do the normal thing of like, imagine that I’m not trying to optimise for making the world a better place?

Rob Wiblin: You could do both.

Rohin Shah: Oh man, what would I be doing? If I was just not doing AI, but otherwise trying to make the world a better place, what would I be doing? I don’t know. It’s been so long since I had to make a career decision like that. I do feel pretty compelled by still working on something in the existential risk space or catastrophic risk space. Yeah, I don’t know. Probably something in bio or nuclear, just because that’s what everyone keeps talking about, but I feel like I’d just have to really do a deep dive into it.

Rob Wiblin: Yeah, maybe you could scheme to find some way to do work that’s exactly functionally equivalent to what you’re doing now, without triggering their criteria that it is actually the same by whatever standard.

Rohin Shah: That does seem like the actual correct thing for me to do, yes.

Rob Wiblin: What about if you weren’t trying to help anyone?

Rohin Shah: I think there’s some chance that I would just go for a life of hedonism or something. I guess this isn’t a career, but maybe I’d be like, I’ll just make some money and then play video games all day or something.

If we actually do careers, I think I’ve liked, in principle, the idea of designing puzzle hunts and escape rooms and things like that. It’s just both very satisfying to build them and even more satisfying to watch people actually try to do them and complete them. It’s very nice to see people get that aha moment, the insight where they figure out what they’re supposed to do and then make progress and go on. I haven’t done much of this in the past, but I have written one puzzle hunt and a couple of puzzles for a murder mystery once, and I quite enjoy that. So probably that.

Rob Wiblin: Well, I guess work hard and hopefully you can retire to a life of doing that in 2027. My guest today has been Rohin Shah. Thanks so much for coming on The 80,000 Hours Podcast, Rohin.

Rohin Shah: Thanks for having me, Rob.

Rob’s outro [03:06:47]

Rob Wiblin: Two notices before we go.

Would you like to write reviews of these interviews that Keiran, Luisa, and I read?

Well, by listening to the end of this interview you’ve inadvertently qualified to join our podcast advisory group. You can join super easily by going to 80k.link/pod and putting in your email.

We’ll then email you a form to score each episode of the show on various criteria and tell us what you liked and didn’t like about it, a few days after that episode comes out. Those reviews really do influence the direction we take the show, who we choose to talk to, the topics we prioritise, and so on. We particularly appreciate people who can give feedback on a majority of episodes because that makes selection effects among reviewers less severe.

So if you’d like to give us a piece of your mind while helping out the show, head to 80k.link/pod and throw us your email.

———

The second notice is that, as some of you will know, 80,000 Hours offers one-on-one advising to help people figure out how to have a bigger impact with their career. It’s been a while since I mentioned that on the show, but in the meantime our team’s capacity to provide one-on-one career advising has continued to increase, and we’re eager to assist more of you in making informed career decisions.

You can apply at 80000hours.org/advising.

One new feature the advising team is building out is a system for recommending advisees to employers for specific opportunities.

It’s likely that you’re not always able to stay up to date on new organisations and openings in your area of interest, because, well how could you? So it’s handy to have the one-on-one team support you by keeping track of relevant opportunities that might come up, and giving you a boost by affirmatively recommending you when there seems to be a great fit based on what you discussed. Importantly, this is an opt-in aspect of the service, so if you just want the advice, that’s still there for you.

Of course we still can’t advise everyone who applies, and sometimes we even say no to people because they already have a sensible plan and it’s not clear what we can add.

But still, I’d definitely encourage you to apply for career advising if you’re considering trying to have a bigger impact in your career, or figuring out how you can build career capital so you’re in a better position to do good in future.

Again, the address is 80000hours.org/advising.

All right, The 80,000 Hours Podcast is produced and edited by Keiran Harris.

Audio mastering and technical editing by Milo McGuire, Dominic Armstrong, and Ben Cordell.

Full transcripts and an extensive collection of links to learn more are available on our site and put together by Katy Moore.

Thanks for joining, talk to you again soon.

Learn more

Preventing an AI-related catastrophe

What could an AI-caused existential catastrophe actually look like?

Moral status of digital minds

The 80,000 Hours Podcast on Artificial Intelligence and related topics

Related episodes

December 13, 2022

#141 – Richard Ngo on large language models, OpenAI, and striving to make the future go well

Listen now

May 12, 2023

#151 – Ajeya Cotra on accidentally teaching AI models to deceive us

Listen now

May 5, 2023

#150 – Tom Davidson on how quickly AI could transform the world

Listen now

August 4, 2021

#107 – Chris Olah on what the hell is going on inside neural networks

Listen now

October 2, 2018

#44 – Paul Christiano on how OpenAI is developing real solutions to the ‘AI alignment problem’, and his vision of how humanity will progressively hand over decision-making to AI systems

Listen now

July 9, 2020

#81 – Ben Garfinkel on scrutinising classic AI risk arguments

Listen now

June 3, 2019

#58 – Pushmeet Kohli on DeepMind’s plan to make AI systems robust & reliable, why it’s a core issue in AI design, and how to succeed at AI research

Listen now

November 2, 2018

#47 – PhD or programming? Fast paths into aligning AI as a machine learning engineer, according to ML engineers Catherine Olsson & Daniel Ziegler

Listen now

About the show

The 80,000 Hours Podcast features unusually in-depth conversations about the world's most pressing problems and how you can use your career to solve them. We invite guests pursuing a wide range of career paths — from academics and activists to entrepreneurs and policymakers — to analyse the case for and against working on different issues and which approaches are best for solving them.

Get in touch with feedback or guest suggestions by emailing [email protected].

What should I listen to first?

We've carefully selected 10 episodes we think it could make sense to listen to first, on a separate podcast feed:

Check out 'Effective Altruism: An Introduction'

Subscribe here, or anywhere you get podcasts:

If you're new, see the podcast homepage for ideas on where to start, or browse our full episode archive.

On this page:

Highlights

The mood at DeepMind

Scepticism around 'one distinct point where everything goes crazy'

Could we end up in a terrifying world even if we mostly succeed?

Is it time to slow down?

Why solving the technical side might not be enough

The value of raising awareness of the risks among non-experts

How much you can trust some conceptual arguments

Articles, books, and other media discussed in the show

Transcript

Rob’s intro [00:00:00]

The interview begins [00:02:48]

The mood at DeepMind [00:06:43]

Common misconceptions [00:15:24]

Rohin’s disagreements with other ML researchers [00:29:40]

Ways we might fail [00:40:10]

Optimism vs pessimism [00:55:49]

Specialisation vs generalisation [01:09:01]

Why solving the technical side might not be enough [01:16:39]

Barriers to coordination between AI labs [01:22:15]

Could we end up in a terrifying world even if we mostly succeed? [01:27:57]

Is it time to slow down? [01:33:25]

Public discourse on AI [01:47:12]

The candidness of AI labs [01:59:27]

Visualising and analogising AI [02:02:33]

Scalable oversight [02:16:37]

Mechanistic interpretability [02:25:56]

Dangerous capability evaluations [02:33:41]

The work at other AI labs [02:38:12]

Deciding between different categories of safety work [02:46:29]

Approaches that Rohin disagrees with [02:53:27]

Careers [03:01:12]

Rob’s outro [03:06:47]

Learn more

Preventing an AI-related catastrophe

What could an AI-caused existential catastrophe actually look like?

Moral status of digital minds

The 80,000 Hours Podcast on Artificial Intelligence and related topics

Related episodes

About the show

What should I listen to first?