#81 – Ben Garfinkel on scrutinising classic AI risk arguments

By Howie Lempel, Robert Wiblin and Keiran Harris · Published July 9th, 2020 ·

#81 – Ben Garfinkel on scrutinising classic AI risk arguments

By Howie Lempel, Robert Wiblin and Keiran Harris · Published July 9th, 2020

Output from Google's DeepDream AI.

80,000 Hours, along with many other members of the effective altruism movement, has argued that helping to positively shape the development of artificial intelligence may be one of the best ways to have a lasting, positive impact on the long-term future. Millions of dollars in philanthropic spending, as well as lots of career changes, have been motivated by these arguments.

Today’s guest, Ben Garfinkel, Research Fellow at Oxford’s Future of Humanity Institute, supports the continued expansion of AI safety as a field and believes working on AI is among the very best ways to have a positive impact on the long-term future. But he also believes the classic AI risk arguments have been subject to insufficient scrutiny given this level of investment.

In particular, the case for working on AI if you care about the long-term future has often been made on the basis of concern about AI accidents; it’s actually quite difficult to design systems that you can feel confident will behave the way you want them to in all circumstances.

Nick Bostrom wrote the most fleshed out version of the argument in his book, Superintelligence. But Ben reminds us that, apart from Bostrom’s book and essays by Eliezer Yudkowsky, there’s very little existing writing on existential accidents. Some more recent AI risk arguments do seem plausible to Ben, but they’re fragile and difficult to evaluate since they haven’t yet been expounded at length.

There have also been very few skeptical experts that have actually sat down and fully engaged with it, writing down point by point where they disagree or where they think the mistakes are. This means that Ben has probably scrutinised classic AI risk arguments as carefully as almost anyone else in the world.

He thinks that most of the arguments for existential accidents often rely on fuzzy, abstract concepts like optimisation power or general intelligence or goals, and toy thought experiments. And he doesn’t think it’s clear we should take these as a strong source of evidence.

Ben’s also concerned that these scenarios often involve massive jumps in the capabilities of a single system, but it’s really not clear that we should expect such jumps or find them plausible.

These toy examples also focus on the idea that because human preferences are so nuanced and so hard to state precisely, it should be quite difficult to get a machine that can understand how to obey them.

But Ben points out that it’s also the case in machine learning that we can train lots of systems to engage in behaviours that are actually quite nuanced and that we can’t specify precisely. If AI systems can recognise faces from images, and fly helicopters, why don’t we think they’ll be able to understand human preferences?

Despite these concerns, Ben is still fairly optimistic about the value of working on AI safety or governance.

He doesn’t think that there are any slam-dunks for improving the future, and so the fact that there are at least plausible pathways for impact by working on AI safety and AI governance, in addition to it still being a very neglected area, puts it head and shoulders above most areas you might choose to work in.

This is the second episode hosted by our Strategy Advisor Howie Lempel, and he and Ben cover, among many other things:

The threat of AI systems increasing the risk of permanently damaging conflict or collapse
The possibility of permanently locking in a positive or negative future
Contenders for types of advanced systems
What role AI should play in the effective altruism portfolio

Get this episode by subscribing to our podcast on the world’s most pressing problems and how to solve them: type 80,000 Hours into your podcasting app. Or read the transcript below.

Producer: Keiran Harris.
Audio mastering: Ben Cordell.
Transcriptions: Zakee Ulhaq.

Highlights

AI development scenarios

You might think that currently the way AI progress looks like at the moment, to some extent, is year by year AI systems, at least in aggregate become capable of performing some subset of the tasks that people can perform that previously AI systems couldn’t. And so there will be a year where we have the first AI system that can beat the best human at chess or have a year where we have the first AI system that can beat a typical human at recognizing certain forms of images. And this thing happens year by year of there’s this gradual increase in the portion of relevant tasks that AI systems can perform.
And you might also think that, at the same time, maybe there’d be a trend in terms of the generality of individual systems. So this is one thing that people work on in AI research, is trying to create individual AI systems which are able to perform a wider range of tasks, as opposed to relying on lots of specialized systems. It seems like generality is more of a variable than binary, at least in principle. So you could imagine that the breadth of tasks an AI system can perform will become wider and wider. And you might think that there are other things that are fairly gradual, like the time horizons that systems act on, or the level of independence that AI systems exhibit might also increase smoothly. And so then maybe you end up in a world where there comes a day where we have the first AI system that can, in principle, do anything a person can do.
But at that point, maybe that AI system already radically outperforms humans at most tasks. Maybe the first point where we have AI systems that can do all this stuff that people can do, they’ve already been able to do most things better than people can before that point. Maybe this first system that can do all this stuff that an individual person can do also exist in a world with a bunch of other extremely competent systems of different levels of generality and different levels of competence in different areas. And maybe it’s also been preceded by lots of extremely transformative systems that in lots of different ways are superintelligent.

Contenders for types of advanced systems

I think it’s probably pretty difficult to paint any really specific picture. So I think there’s some high-level things you could say about it. So one high-level thing you might say about it is that, take a list of economically relevant tasks that people perform today. Take the Bureau of Labor statistics database, and then just cross off a bunch and assume that AI systems can do them, or they can either do certain things that make those tasks not very economically relevant. We can also think that there’s some stuff that people just can’t do today. That’s just not really on people’s radar as an economically or politically or militarily relevant task that maybe AI systems will be able to perform.
So one present day example of something that people can’t do, that AI systems can do, is generating convincing fake images or convincing fake videos. These things called Deepfakes. That’s not really something the human brain is capable of doing on its own. AI systems can. So just at a very general abstract level, imagine lots of stuff that’s done today by humans is either done by AI systems or made irrelevant by AI systems and then lots of capabilities that we just wouldn’t even necessarily consider might also exist as well.
It’s probably lots of individual domains. You can maybe imagine research looks very different. Maybe really large portions of scientific research are automated, or maybe just the processes are maybe just in some way AI-driven in the way that they’re not today. Maybe political decision-making is much more heavily informed by outputs of AI systems than it is today, or maybe certain aspects of things like law enforcement or arbitration or things like this are to some extent automated and just the world could be just quite different in lots of different ways.

Implications of smoothness

One implication is that people are more likely to do useful work ahead of time or more likely to have institutions in place at that time to deal with stuff that will arise. This definitely doesn’t completely resolve the issue. There are some issues that people know will happen ahead of time and are not really sufficiently handling. Climate change is a prominent example. But, even in the case of climate change, I think we’re much better off knowing, “Oh okay, the climate’s going to be changing this much over this period of time”, as opposed to just waking up one day and it’s like, “Oh man, the climate is just really, really different”. Definitely, the issue is not in any way resolved by this, but it is quite helpful that people see what’s coming when the changes are relatively gradual. Like institutions, to some extent, are being put in place.
I think there’s also some other things that you get out of imagining this more gradual scenario as opposed to the brain in the box scenario. So another one that I think is also quite connected is people are more likely to also know the specific safety issues as they arise, insofar as there are any unresolved issues about making AI systems behave the way you want them to do. Or, let’s say not unintentionally or, in some sense, deceiving the people who are designing them. You’re likely to probably see low-level versions of these before you see very extreme versions of these. You’ll have relatively competent or relatively important AI systems mess up in ways that aren’t, let’s say, world-destroying before you even get to the position where you have AI systems whose behaviors are that consequential.
If you imagine that you live in a world where you have lots and lots of really advanced AI systems deployed throughout all the different sectors of the economy and throughout political systems, insofar as there are these really fundamental safety issues, you should probably have noticed them by that point. There will have probably been smaller failures. People should also be less caught off guard or less blindly oblivious to certain issues that might arise with the design of really advanced systems.

Instrumental convergence

If you’re trying to predict what a future technology will look like, it’s not necessarily a good methodology to try and think, “Here are all of the possible ways we might make this technology. Most of the ways involve property P so therefore we’ll probably create a technology with property P”. Just as some simple silly illustrations. Most possible ways of building an airplane involve at least one of the windows on the airplane being open. There’s a bunch of windows. There’s a bunch of different combinations of open and closed windows. Only one involves them all being closed. That’d be bad to predict that we’d build airplanes with open windows. Most possible ways of building cars don’t involve functional steering wheels that a driver can reach. Most possible ways of building buildings involved giant holes in the floor. There’s only one possible way to have the floor not have a hole in it.
It seems too often to be the case that this argument schema doesn’t necessarily work that well. Another case as well is that if you think about human evolution, there’s, for example, a lot of different preference rankings I could have over the arrangement of matter in this room. There’s a lot of different, in some sense, goals I could have about how I’d like this stuff in the room to be. If you really think about it, most different preferences I could have for the arrangement of matter in the room involve me wanting to tear up all the objects and put them in very specific places. There’s only a small subset of the preferences they have that involve me keeping the objects intact because there’s a lot fewer ways for things to be intact then to be split apart and spread throughout the room.
It’s not really that’s surprising, I don’t have this wild destructive preference about how they’re arranged. Let’s say the atoms in this room. The general principle here is that if you want to try and predict what some future technology will look like, maybe there is some predictive power you get from thinking about X percent of the ways of doing this involve property P. But it’s important to think about where there’s a process by which this technology or artifact will emerge. Is that the sort of process that will be differentially attracted to things which are let’s say benign? If so, then maybe that outweighs the fact that most possible designs are not benign.

Ben's overall perspective

There’s this big constellation of arguments that people have put forward. I think my overall perspective, at least on the safety side of things, is that I basically, at this point, though I found them quite convincing when I first encountered them, now obviously have a number of qualms about the presentation of the classic arguments.
I think there’s a number of things that don’t really go through, or that really need to be, maybe adjusted or elaborated upon more. And on the other hand, there are these other arguments that have emerged more recently, but I think they haven’t really been described in a lot of detail. A lot of them really do come down to, it’s a couple of blog posts written in the past year. And I think if, for example, if the entire case for treating AIs and existential risk hung on these blog posts, I wouldn’t really feel comfortable, let’s say advocating for millions of dollars to be going into the field, or loads and loads of people to be changing their careers, which seems to be happening at the moment.
I think that basically we’re in a state of affairs where there’s a lot of plausible concerns about AI. There’s a lot of different angles you can come from. Some say this is something that we ought to be working on from a long-term perspective, but at least I’m somewhat uncomfortable about the state of arguments that have been published. I think things that are more rigorous or more fleshed out, I don’t really agree with that much. And the things that I may be more sympathetic to, just haven’t been fleshed out that much.

Articles, books, and other media discussed in the show

Ben’s work

Everything else

What failure looks like
by Paul Christiano, and a response from Robin Hanson
What can the principal-agent literature tell us about AI risk? by Alexis Carlier and Tom Davidson
Why will we be extra wrong about AI values? and How far can AI jump? by Katja Grace
Reply to Bostrom’s arguments for a hard takeoff by Brian Tomasik
The Hanson-Yudkowsky AI-Foom Debate
I Still Don’t Get Foom by Robin Hanson
Artificial Intelligence and Economic Growth by Philippe Aghion, Benjamin F. Jones, and Charles I. Jones
Population growth and technological change: One million BC to 1990
Disentangling arguments for the importance of AI safety by Richard Ngo
Superintelligence: Paths, Dangers, Strategies by Nick Bostrom
Essays by Eliezer Yudkowsky
MIRI Publications
Enlightenment Now: The Case for Reason, Science, Humanism, and Progress by Steven Pinker
Reframing Superintelligence by Eric Drexler
Risks from learned optimization by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant
Wei Dei’s posts on LessWrong
Zeno’s paradoxes on wikipedia
The Boss Baby (2017)

Transcript

Table of Contents

1 Rob’s intro [00:00:00]
2 The interview begins [00:03:20]
3 Long-run implications of AI [00:06:48]
4 Political instability [00:15:25]
5 Lock-in scenarios [00:23:01]
6 Alignment problem [00:33:31]
7 High level overview of Bostrom [00:39:34]
8 Brain in a box [00:50:11]
9 Contenders for types of advanced systems [00:58:42]
10 Implications of smoothness [01:04:07]
11 Groups of people who disagree [01:18:28]
12 Orthogonality thesis [01:25:04]
13 Entanglement and capabilities [01:37:26]
14 Instrumental convergence [01:43:11]
15 Treacherous turn [02:07:55]
16 Where does this leave us? [02:15:19]
17 What role should AI play in the EA portfolio? [02:22:30]
18 Rob’s outro [02:37:51]

Rob’s intro [00:00:00]

Robert Wiblin: Hi listeners, this is the 80,000 Hours Podcast, where each week we have an unusually in-depth conversation about one of the world’s most pressing problems and how you can use your career to solve it. I’m Rob Wiblin, Director of Research at 80,000 Hours.

I’ve known Ben Garfinkel personally for a number of years now and can attest that he’s an incredibly bright and funny guy.

At EA Global 2018 he gave a talk called ‘How sure are we about this AI stuff?’, which created a stir. It threw a bit of cold water on some popular arguments for why artificial intelligence should be a particularly valuable thing to work on.

It’s great to see someone going back over classic arguments that are often repeated but seldom scrutinised, to see whether they really hold up to counterarguments.

Ben supports the continued expansion of AI safety as a field and believes working on AI is among the very best ways to have a positive impact on the long-term future. But he also thinks that many of these familiar arguments for focusing on artificial intelligence and how safely it’s developed are weaker than they look, and at least need further development or clarification to patch them against counterarguments.

All of us were keen to interview Ben to see where his thinking on this topic had gotten to.

We ended up sending my colleague Howie Lempel to Oxford to do the interview and he did a great job. A few of my colleagues said even though it’s only Howie’s second interview they thought it might be their favourite episode yet, which I guess I’ll try not to take personally.

Ben didn’t feel right jumping to objections before properly outlining the case for the importance artificial intelligence might play in shaping the future, and giving that case its due.

So the first 45 minutes are a careful outline of different aspects of that case, some of which regular listeners might be familiar with.

There’s important new ideas in there, but if you feel like you’ve heard the case for working on AI enough times already and could recite it back at this point, you might like to skip to the discussion of ‘How sure are we about all this AI stuff?’.

You can find that by selecting the chapter titled ‘Brain in a box’ in your podcast software, or just navigating 50 minutes into the episode.

We’ll link to some articles and presentations Ben has made on this topic in the show notes, and if you care about this topic I recommend checking them out, even more than I normally encourage people to look at those extra resources.

Before that, let me give you a quick reminder about our job board, which you can find at 80000hours.org/jobs.

It’s a shortlist of jobs we’re especially interested to see people consider applying for. It might be able to help you find the best opportunity for you to have a positive social impact, or save you dozens of hours of searching on your own.

Every few weeks María Gutierrez does the enormous work of checking hundreds of organisations’ vacancies page and a range of other clever sources so you don’t have to.

As I record this there’s 496 vacancies listed on there across a wide range of problems including AI but also biosecurity, COVID-19, ending factory farming, global poverty, priorities research, improving governance and more.

There’s also many roles that are there less so you can immediately do good, and more because they’ll leave you in a better position to get higher impact senior roles later on.

You can use the filters to see just vacancies relevant to the problems you want to work on, or position suitable for your field of expertise and level of experience.

That’s at 80000hours.org/jobs. If you’d like to get an email each few weeks when we update the jobs board with 100 or so new positions, you can sign up either on that page or at 80000hours.org/newsletter.

Alright, without further ado, here’s Howie Lempel interviewing Ben Garfinkel.

The interview begins [00:03:20]

Howie Lempel: Today, I’m speaking with Ben Garfinkel. Ben graduated from Yale University in 2016, where he majored in physics and in math and philosophy. After University, Ben became a researcher at the Centre for Effective Altruism and then moved to the Center for the Governance of AI at the University of Oxford’s Future of Humanity Institute. He’s now a Research Fellow there. In addition, he recently started a DPhil in International Relations also at Oxford.

Howie Lempel: Ben’s research is cross-disciplinary including work on AI governance, moral philosophy, and security. He’s published research on the security implications of emerging technology with Allan Dafoe, who Rob interviewed back in episode 31. And he was a co-author on an important group paper about potential malicious uses of AI. Thanks for coming on the podcast, Ben.

Ben Garfinkel: Thanks for having me.

Howie Lempel: Okay, so we’re going to focus today on talking to Ben about AI risk, but just to get started off, what are you working on right now?

Ben Garfinkel: Yeah, so I sort of have the bad habit of sometimes working on a couple of different projects that aren’t very related. So just recently this term I started a PhD program in international relations where I’m potentially doing some work related to the [inaudible 1:06] hypothesis or questions around how likely another large war is in the future. And for the past couple of years, I’ve done a lot of work around basically trying to think about the long run implications of progress in artificial intelligence and different risks or opportunities that might emerge from that.

Ben Garfinkel: And then I’ve also done… Just problematically, very unrelated things on the side as well. Just an example of something that I recently spent a couple of weeks on, not necessarily for extremely good reasons, is I ended up getting accidentally sucked into questions around what long run economic growth has looked like. By long run, I mean like since the neolithic revolution. Like 10,000 years ago long run. Because there’s this pretty influential paper from the mid nineties by Michael Kremer, who’s this Nobel Prize winning economist, that argues that a large part of the reason as to why growth is way faster than it used to be is that there’s this interesting effect where when you have a larger population of people, since there’s more people, there’s more potential opportunities for people to generate ideas that then increase productivity. You have this interesting feedback loop where the higher productivity is, the larger a population you can support. And the larger a population you can support, the quicker people can come up with ideas that increase productivity.

Ben Garfinkel: And so for most of human history, supposedly, there’s this interesting feedback loop where growth kept getting faster and faster as you had this “More people, faster productivity growth, higher productivity, more people” feedback loop. And that’s why growth is so much faster today than it used to be. I ended up just for several reasons getting very interested in whether this is actually true and spending a bunch of time reading papers like economic historians, many of whom were skeptical of it and then ended up going into these sort of recently generated archaeological data sets that are meant to try and estimate populations in different regions at different times on the basis of charcoal remains and things like this and trying to see like, “Oh, if you try and run the analysis on more recent, more accurate data, do you still get the same output out and like what’s going on with statistical whatever”. And yeah, I don’t have a really great explanation of why I was doing this, but this is in fact something that I recently spent a couple of weeks on.

Howie Lempel: So I assume that you now know what the causes of economic growth are?

Ben Garfinkel: Oh yeah. I’m planning on publishing a blog post soon that will clear it up.

Howie Lempel: Okay. Well, I wouldn’t want to preempt your blog post. I won’t force you to solve economics for us today on this podcast.

Ben Garfinkel: Yeah, but definitely forthcoming soon though.

Howie Lempel: Well, all of the listeners can be really excited for that.

Long-run implications of AI [00:06:48]

Howie Lempel:
You’ve spent most of the last couple of years working on some research on long run implications of AI, and so, do you want to give us a sense, pretend I’m someone who’s never really thought about AI before. Why would I think that this is one of the biggest priorities?

Ben Garfinkel: Right. So I think, just to start, I think there’s this pretty high-level abstract argument you might make. And the argument I think has a few steps. So the first one is, if you’re trying to decide an area to work on, or an area to focus policy on that might just really matter, then a good thing to look at is “Does this issue plausibly have implications that matter, not just for the present generation, but also for future generations”?

Ben Garfinkel: And this is obviously something that not everyone always takes into account, but for example, various issues like climate change, this is one of the main motivations like, “Oh, if that could carry forward into the future”. And then, if you’re trying to think, “Okay, maybe I’m interested in the issues that maybe matter for the long run future, maybe matter for future generations”, then it seems like one natural place to look at is technological progress.

Ben Garfinkel: Because you’re trying to pick out things that have mattered. That have had implications to carry forward into the future. It seems like historically, technology really pretty naturally falls into this category. The world is super different today because we have industrial technology and electricity and agricultural processes and things like that. There’s lots of technologies you can point to and say, “Oh, because that was introduced into the world, the lives of basically everyone who lived after that technology was introduced are very different”.

Ben Garfinkel: And then you can sort of look within technologies and ask, “Okay, what technologies that are emerging today might really matter? What seems really salient”? I think there’s a decent number of candidates here, but one that seems to stand out is AI. And the basic idea here is that it seems plausibly, any given thing that humans can cognitively do using their brains you can, in principle, create some sort of AI system that can accomplish the same tasks. If you survey AI researchers and ask them, “Do you think that we’ll eventually have AI systems, at least in aggregate, that can do pretty much all of the work that people can do”, they typically say yes. Some of them even say they think there’s a 50/50 chance within a century, or I think even 50 years in one survey. And it seems like if you’re managing a world where AI systems can do, at least in limit, pretty much all the stuff that people can do or a really large portion of it, then whatever that looks like, that’s a really substantially different world.

Ben Garfinkel: And it seems worth looking at that and saying, “What’s going on there? What might those changes look like? Are they good? Are they bad? Is there anything we might do to shape them”? And I think that’s a reason to pay attention to AI, at least to begin with.

Howie Lempel: Got it. So I guess that sounds like a reason to think that the development of AI might really change the world, and that the world post-AI might look really different from how it looks today. I feel like it’s a priority. There also have to be things that we can do about it. So is there a reason to think that somebody who is working on AI today has a shot at affecting how this goes?

Ben Garfinkel: Right. So I think that’s basically a really important point. I think that’s one pretty significant objection you can raise, at least this sort of high-level, somewhat vague pitch for working on AI. So just because something has some long run significance doesn’t necessarily mean that you can do much to affect the trajectory. So just a quick analogy; it’s quite clear that the industrial revolution was very important. The world is very different today because we have fossil fuels and electricity and large-scale industry.

Ben Garfinkel: At the same time though, if you’re living in 1750 or something and you’re trying to think, “How do I make the industrial revolution go well? How do I make the world better… let’s say in the year 2000, or even after that”, knowing that industrialization is going to be very important… it’s not really clear what you do to make things go different in a foreseeably positive way.

Ben Garfinkel: Or an even more extreme case. It’s really clear that agriculture has been a really important development in human history, but if you’re living in 10,000 BC, starting the neolithic revolution, it’s not really clear what you do to try and make a difference. To make things go one way versus another in terms of the long run impact.

Howie Lempel: Do you have any ideas at all for what our industrial revolution effective altruist would have tried to do?

Ben Garfinkel: Yeah, I think this is pretty unclear. So I think something that’s pretty significant about the industrial revolution is it really led to a divergence in terms of what countries were globally dominant. So England had this period of extreme dominance on the basis of industrializing first compared to the rest of Europe and especially compared to Asia. And you might have certain views about maybe what country would have been best to be in a leadership role in that position.

Ben Garfinkel: And maybe you could have tried to, let’s say, help technology diffuse quicker in certain ways. Another approach you may have taken is… So there’s some arguments that people make that speeding up economic growth has a positive long run impact. And in that case, you maybe would have tried to help things along, tried to help spread innovations quicker or yourself participating in it.

Ben Garfinkel: I’m not really sure though. I haven’t thought too much about this scenario, but I suppose those are some attacks you could have taken. But I think even then it’s a little bit unclear. So England was really dominant in the industrial revolution, but already obviously 200 years later its position has faded back quite a bit. And it’s also not really that clear that, let’s say, any individual, even large organization, really could have made that much of a difference. Maybe just the institutions that were in place would have been very hard to change and just, it was a little bit determined that it would go that way.

Howie Lempel: So there’s just this open question about even if we agree that AI might be one of these potentially history-changing events, similar to the industrial revolution, it’s a question of “Can we do anything to affect those”? And there’s a specific question about, in this instance, is this a good thing to work on? And so does the case for trying to affect the future by working on AI today depend on believing that the types of AI that might end up being industrial revolution level of impact are potentially going to be developed very soon? Or even if you think it will be a longer time until we see those types of systems, you still think that this is a good time to be working on AI?

Ben Garfinkel: So I think definitely thinking that interesting stuff will happen soon definitely helps the case for working on AI. I think there’s a couple of reasons for that. So one reason is just that there’s this clear intuition that for lots of technologies, there’s a period of time where it would have been probably just too early to have a large impact. So let’s say that you are back in ancient Greece or something, and you’ve noticed that certain eels are electrified and you have this sense of, “Oh, electricity is this interesting thing”. Or you have some sense that there are magnetic materials, probably not going to have an impact on electrification in the 1800’s and 1900’s.

Ben Garfinkel: Or, in the case of AI, if you want it to have an impact on, let’s say policy decisions that are made today or policy decisions that are made by governments or companies like Google. If you’re in the really, really early days of computing, you’re in the fifties or sixties, again it’s a little bit unintuitive to me that you could really perceivably influence things that much. So definitely there’s some intuition there.

Ben Garfinkel: Another issue is that the further things are away, there’s probably less of a sense we have of what the shape of AI progress will look like. So if there are really large changes over the next two decades, then we can probably imagine that a lot of the technology probably looks not that different than contemporary machine learning. Whereas if the changes that are really substantial mostly happen decades in the future, not that much interesting happens in the next few decades, then maybe the technology could just look quite different. The institutions could look quite different and we could just really not have a clear sense of what we even want to happen or even really have the right concepts for talking about AI.

Ben Garfinkel: I should say though that I think there’s also some case to be made for it being more useful to work on it early. So one potential reason is just if you’re thinking about impacts that probably won’t happen for a very long time, there are probably fewer people thinking about them. And so maybe you can have more influence. Maybe it’s more neglected to talk about stuff that won’t happen for a long time, and maybe there’s especially a good opportunity to sort of set the narrative or sort of frame the issues before they become maybe politicized or something that’s much closer to people’s lives and much more connected to present day issues.

Howie Lempel: Do we have any examples of people affecting the development of technologies in this kind of way, a decade or more before the technology is developed?

Ben Garfinkel: So I’m not very familiar… I think you’re probably more familiar with this case, but my impression is that as in the world of genetic engineering and biosecurity, people started thinking about ethical issue or management risks quite early on in the field before they were really very present. And this has maybe been helpful for establishing norms in the community and in institutions they would be useful for new risks as they emerge. And so I’m not very familiar with that case, but it seems like plausibly one where it was good for people to talk about something that wasn’t really going to present itself for several decades.

Howie Lempel: Yep, that seems like a good example to me. I am not as informed as I’d like to be about this case, but the quick summary is in the 1970s a bunch of scientists got together to discuss the ethics of genetic engineering. Sort of made some early on ethical codes that they put together before there were going to be very big impacts from the technology. And I think some of the norms from those have managed to stick around.

Howie Lempel: My sense is that some of them have become a bit outdated, but are still seen as the go-to ethics. And you do definitely see some experiments that I think a lot of people find pretty dangerous that aren’t covered by those codes. And so it’s a little hard to say how helpful it was actually, but it seems at least like a good attempt.

Political instability [00:15:25]

Howie Lempel: Okay. So one reason you might think to work on AI is by this analogy to the industrial revolution. We’ve only had so many events in human history that seem as though they really changed humanity’s trajectory for a decent period of time. It seems like AI has the potential to be one of those technologies, one of those transformations that might be a reason to think that affecting it could really change how other work goes. Are there arguments that engage more with AI as a specific technology, for why we should think that AI might be one of these points where you have a lot of influence?

Ben Garfinkel: Yeah. So I actually think there are a number of different, more specific arguments, that go beyond just the point that AI can be really transformative. So I think the first one is that there are a number of arguments that progress in AI could be politically destabilizing in a number of ways. One class of concerns people raise, is the concern that perhaps certain military or non-military applications of AI could, in some way, raise the risk of war and especially a war between great powers.

Ben Garfinkel: And the basic idea here is it seems, in general, the character of military technology might have some level of effect on the likelihood of war. And one concrete example of this, in the case of AI, is a lot of people have this view or a conventional view that it’s stabilizing if in the context of two nuclear powers, both powers feel secure in their ability to carry out a nuclear strike on the other power if they feel like a nuclear strike is going to target them. So if each power has the ability to retaliate credibly, then this creates a strong disincentive to initiate a war because both nuclear powers know it is going to end badly for both sides if a war happens, because they can both strike the other. And there’s certain applications of AI that some people suggest might make it harder to credibly deter attacks of nuclear weapons. It might make it easier, for example, to locate the other sides nuclear weapons, to be able to target them more effectively or, for example, certain underwater drones that are autonomous might be able to more easily track nuclear submarines that belong to the other side.

Ben Garfinkel: And so that’s one way in which, for example, AI could conceivably increase the risk of war by making it harder for states to deter attacks from one another. Or another specific version of this is people point out concerns that maybe if you introduce autonomous weapon systems. Systems that could plausibly have autonomously on their own, choose to initiate force in certain circumstances. Maybe that increases the odds of accidental use of force that could spark crises.

Ben Garfinkel: And maybe it means crises could more quickly spin up into something much more extreme. If systems can respond to other autonomous systems, any crisis gets really out of hand before any humans can actually intervene or really see what’s going on. Or just maybe certain applications make certain forms of offense easier in a way that makes it more tempting. And so sort of various concerns along these lines.

Ben Garfinkel: And even just aside from the fact that obviously war is really something that you’d like to avoid if you can; if you’re taking again this longtermist perspective, given the existence of nuclear weapons, it’s relatively plausible that if war, let’s say, were to occur between great powers and nuclear weapons were used, it’s more plausible than it was in the past that this could actually be something that’s really permanently damaging as opposed to something that is horrific, but isn’t necessarily something that carries forward many generations into the future.

Howie Lempel: Got it. And so if the mechanism that you’re going for is reducing war or the likelihood or impact of war or maybe in particular, great power wars. Is it clear that AI is the thing to work on?

Ben Garfinkel: Yeah, I think that is a really good question. So if let’s say you decide that you’re really just focused on reducing the risk of war and especially war between great powers. Then one point you can make is that there are multiple salient emerging forms of military technology. And another example that’s not AI is hypersonic missiles, which are basically missiles that are believed to be much harder to defend against. That can essentially move faster and more easily evade defenses.

Ben Garfinkel: And there’s some suggestion that if you make missile defense much harder in certain conventional contexts, then this can increase the risk of war. If, for example, the US comes to believe that it wouldn’t be able to survive certain attacks by missiles in potential future conflicts with let’s say China or Russia. And I know extremely little about hypersonics, but it’s not necessarily that clear to me that they’re less potentially stabilizing than at least the applications of AI we’ll see over the next couple of decades.

Ben Garfinkel: And I think it’s a little bit of an open question I’m uncertain about, of which of the suite of emerging military technologies matter most. You might also just think as well that maybe obviously changes in military technology are only one thing that accounts for a lot of the variation and the risk of war from one time period to another. Arguably, it’s not historically been the main thing. You might also just be very interested in things like US-China relations to a high level, or just what US foreign policy is, or who it is that gets elected if you think that certain potential leaders have a higher or lower chance of initiating war. And there might just be lots of other interventions that plausibly are higher leverage than thinking about emerging applications of AI that might not really have entered the scene in a big way for quite a while.

Howie Lempel: Great. And so talking about political instability, the main mechanism that I picked up on was largely political instability, leading to more wars and conflicts. Are there other reasons that, from a long-term perspective, we ought to be worried about political instability?

Ben Garfinkel: Yeah. I think that maybe one pathway is domestic politics in influencing international politics. I do think maybe a case could also be made that we have certain institutions that function, all things being considered, relatively well at the moment. So there’s a lot of democratic states in the world. Lots of economic growth is steady. There’s just various things that are functioning relatively okay. And I guess you could also make an argument that if some of these institutions are damaged, maybe they’re more fragile or more rare than one might intuitively think. Maybe it’ll come back in a form that you’d want sufficiently quickly, or maybe just institutions working more poorly makes it harder for them to deal with other risks or issues that arise. But I think that’d probably be the sort of argument you try to make here.

Howie Lempel: As we’ve talked through some of the ways in which the development of AI could be a source of either potential conflict or other sources of instability, if someone were really concerned about one of these areas, do you have a take on what they ought to do about it?

Ben Garfinkel: So I think it should definitely vary area by area. I think one obvious general point is I don’t think we have a very good sense on the nature or relative plausibility of a number of these different risks. So I think, just a general useful class of research is looking at some of these risks that people put forward and often, they’ve been put forward in the form of things like let’s say in a military context, things like “War on the Rocks” pieces or short articles, as opposed to really in depth explorations.

Ben Garfinkel: And just particularly trying to understand the arguments where we’re trying to understand if we can learn lessons from history to figure out whether they’re plausible or not. And just trying to get a sense of what we ought to be prioritizing. I think that’s a general thing that applies to all of these. And then when it comes to specific risk, I think, again it will vary a lot case by case.

Ben Garfinkel: So just to take one example, if one thing you’re interested in is maybe you think it’s significant if AI misinformation damages political decision-making or damages political processes in general. If you think that’s potentially plausible then potentially a thing you want to work on are maybe systems that are better able to distinguish fake media from real media or thinking of institutions you could put in place that could allow people to come to common sets of facts or things like that.

Ben Garfinkel: And thats’ just quite nebulous but that category of thing might be useful. Whereas if you’re interested in risk of great power war, then I think it’s a little bit less clear what the interventions are, but you could just, in general, want to try and pursue a career in US foreign policy or national security institutions, or just try and write pieces that potentially influence decision-making in a positive direction or help people identify what risks might be on the horizon.

Lock-in scenarios [00:23:01]

Howie Lempel: So, moving on, we talked about how you had, I think three reasons why it might be possible to have a positive influence on AI. We just talked about political instability. Do you want to move on to the next one?

Ben Garfinkel: Yeah. So I think another argument you can make is that maybe certain decisions about the design of AI systems or about the design of institutions that handle them, or various other features of political systems might be locked in at some point in the process of AI systems being developed in government. I think some of the intuition here is that if you look through history, there have been certain decisions or political outcomes that seem to have had a pretty stable lasting impact.

Ben Garfinkel: So if you look at, for example, the history of the United States, the period of time where the US constitution was being designed seems quite consequential. And it seems like lots of decisions that were made then have carried forward hundreds of years into the future and not just influence US politics today, but also the political systems of other countries that modeled themselves off of the US. Certain decisions about the design of technology also seem to have carried forward at least every several decades into the future.

Ben Garfinkel: So let’s say the fact that Microsoft Windows is still a fairly dominant operating system. It has certain features and certain security flaws. It seems like there’s some degree of path dependence there. Also in terms of the outcomes of certain competitions between different groups having relatively lasting impacts. We can take the example of, let’s say, the post-World War II world order, where the US obviously emerged from that as a dominant power, along with the Soviet Union. And then especially the US has had a really large and fairly lasting impact on the global political system on the basis of that. And again, I think some of the reaction to this is that it’s quite clear that there’s some things that lock-in about political systems for a long time, or that lock-in about the design of technology.

Ben Garfinkel: And again, it’s a question of exactly how long this lock-in effect lasts. So even in the context of, let’s say design of the constitution, if we’re thinking really, really long-term, as opposed to focusing on the present generation stuff, it still does seem like states typically have some sort of half-life. So Egypt was a really dominant power for a really long time, thousands of years, and now obviously ancient Egyptian culture mostly doesn’t seem to have a very large impact in the world. It doesn’t seem like that much has really been locked-in if you look thousands of years into the future. And I think similarly for technology, it’s a bit hard to, I think, identify really consequential decisions that have been made about technology that seemed to have maybe gone one way or the other, and have really stuck around for a long time. At least I’m not aware of, I think, any very clear examples of that.

Howie Lempel: I guess another question I have here are for the examples of lock-in. So even during the period of time where the institutions stuck around, it seems to me hard to decide to what extent were they leading to their sort of original values being more widely adopted? Were they actually doing their mission at that point? So the US constitution is still influencing the United States.

Howie Lempel: How would the founders who wrote the constitution feel about whether today’s US constitution was doing what they intended. Do you have a take on just how successful these sorts of institutions have been at not just locking into something but locking into the thing that they were aiming for?

Ben Garfinkel: Yeah. So I imagine that there’s probably some stuff if you take the founding fathers and sort of time travel them to the present day. I’d imagine there’s a lot of stuff they’re probably really horrified about. Like the executive branch, for example, is probably a lot more powerful. But it still does seem like there’s some aspects of the constitution that… So example the fact that separation of powers continues to exist.

Ben Garfinkel: It seems like that’s something that was very intentional and really core to their interest in designing the constitution. And I imagine that they’d continue to be happy that we continue to have these separate branches. Or, I guess another example as well is, let’s say, in the warring states period in ancient China where there’s a number of different philosophies proposed that were competing and, for example, Confucianism still today has a pretty influential role in Chinese society and politics. I think people’s interpretation of it is probably extremely different than it was at the period in time. I think the world is so different that it’s really ambiguous how to even really map it on, but I would tend to still think that if Confucius time traveled into the present he’d say that there’s some aspects of society that are more what he would have hoped had Confucianism not caught on.

Howie Lempel: Cool. That makes sense. So then maybe going from the historical examples of lock-in, there’s an argument that AI in particular might cause lock-in. So what do we mean by that? What are the ways in which AI might cause lock-in? What exactly might it be locking in?

Ben Garfinkel: Yeah, I think that this is a little bit ambiguous. I think a couple of broad categories are, on the first part, laws and institutions that were designed to govern AI. So again, I think this is quite vague, but you might think that for example, certain norms or even international agreements around, let’s say the use of certain kinds of AI systems in a military context could emerge. There are people who are trying to explore this today.

Ben Garfinkel: You might think that domestically, maybe certain governing bodies will be created that will have certain powers over certain issues, like software and machine learning. And so certain governing structures about who has the authority to pass or enforce certain regulations. That could be something that locks in. Or it could also just be certain norms about how AI is governed or used.

Ben Garfinkel: I think those are things that could plausibly carry on for at least some long period of time. You might also think that there are certain design decisions that could be made about machine learning systems that could carry forward. I don’t really have a very clear sense of what this would look like, but at least in the abstract, you could imagine that there’s multiple methods you could use to create certain kinds of AI systems or certain, let’s say, minimum safety standards or things like this, or methods of testing things. You might think that there are certain options and then eventually people settle on a certain way of doing things and that could carry forward. Although, I’m quite vague on what that would look like.

Howie Lempel: So I guess for these examples, I’m not totally sure on how to think about the extent to which they are important for the long-term future. It’s certainly possible that the laws and institutions designed to govern AI today really early on could just have persistent effects, but I don’t have a sense of whether or not I should think that these are going to keep mattering. Do you just have a take on are there particular reasons to think that whatever framework is set up initially, has a really good shot at having lasting effects? And how confident are you in some of these lock-in effects?

Ben Garfinkel: So I would definitely say I’m not very confident that you’ll have that many things that could go one way or the other that will lock-in and be very important for a very long period of time. I do also think there’s probably another class of intuitions. If you look forward far enough, and you’re thinking about a world where we eventually approach a point where AI systems can do all this stuff that people can do. Then at that point, to some extent, people will be embedding their values or design of certain institutions into software. And sort of stepping back from being in the loop, to a large extent, about stepping back from really active engagement in terms of economic or political processes.

Ben Garfinkel: And so if you have the idea that in the long run we’ll be, in some sense, passing off control or increasingly embedding values in institutions and software, then maybe there’s some intuition at that point. Once you’ve put this stuff into code that could be a lot more stable and a lot harder to change than traditional institutions or pieces of technology are. So I think there’s that long run perspective. If you take that perspective though, there’s also the question of, if you think it will be a gradual process or if you think it would take us a really long time to get to that point, how much does that imply that stuff we do today will lock in? And that I’m a lot more unsure of.

Howie Lempel: Okay. So we’ve talked a bunch about the types of institutions that might create lock-in. But can you talk a little bit more about how concerned exactly should we be about this issue of lock-in and what are some of the bad scenarios that we want to avoid here?

Ben Garfinkel: Yeah, so I think these kinds of concerns are definitely a lot more nebulous than concerns about the risk of war increasing, where it’s easy to paint at least a somewhat concrete picture of what that looks like and why it’s bad. Whereas in the case of lock-in, I think probably I imagine people probably typically having much less specific concerns.

Ben Garfinkel: I think maybe the perspective, to a large extent, we could base it on historical analogy. So one thought is that we’ve seen lots of pretty significant technological transitions throughout history. And these transitions often seem to be in some way tied to shifts in political institutions or shifts in values or shifts in living standards.

Ben Garfinkel: And often they seem quite positive. For example, I would probably consider the industry revolution positive in that it seems to have been at least linked in some sort of way to obviously a rise in living standards, an increase in democracy and things like that. There’s also some that seem more negative. So often a perspective that people have on, for example, the neolithic revolution or the period of time where people first shifted from hunter-gatherer societies to agricultural civilizations is that it seems like the shift may have had lots of downstream negative social effects.

Ben Garfinkel: So one effect seems to have been to an increase in political and social hierarchy and moved from collective decision-making to often autocracies. Seems to have led to greater division in gender roles based on the sorts of labor that’s relevant. Institutions like slavery seem to have become a lot more prominent as people gathered in more dense communities. Disease being more salient and living standards seem to have dropped.

Ben Garfinkel: And a lot of this stuff seems somewhat downstream of the shift to agriculture. A lot of these effects seem probably quite difficult to predict ahead of time. Probably people weren’t really aware these shifts were happening because they were so slow. But if there’s just, in some ways, sort of a natural course that things evolved into. And so you might, when you look at the case of AI, you might just have this background position of, okay, there’s going to be quite substantial changes probably that will have lots of political and social effects and the historical record definitely doesn’t say, “Oh, these are always fine”.

Ben Garfinkel: So you might just begin with a suspicion, maybe it’s not very concrete, but maybe there’ll be some natural shifts here that if there’s anything we can do to prevent them from moving in a negative direction, anything we can do to steer things more actively, maybe that’s a positive dynamic.

Ben Garfinkel: I do think there’s also, I suppose, somewhat still quite vague, but still somewhat more concrete concerns you might have. So again, if we’re thinking very long term, we’re eventually moving to a world where potentially machines can do most of the stuff that people can do or functionally all of it. Then, in that sort of world, most people really don’t have a lot of economic value to offer. The value of your labor is very low and maybe in a world like that, it’s not that hard to imagine that’s bad for political representation.

Ben Garfinkel: That’s potentially bad for living standards if people don’t have anything to offer in exchange for income. Potentially, there’s lots of changes here that are just very weird and not intuitively positive. I also think, for example, that there will be certain ethical decisions that we need to make. For example, about the roles of AI systems in the world, or even potentially in the long run, the moral status of AI systems. Maybe those decisions won’t be made in sensible ways. And maybe there’s potentially some opportunity for them to go one way or the other and maybe just the wrong decisions will be made.

Alignment problem [00:33:31]

Howie Lempel: Cool. So that’s on the lock-in side. And I think you had a third category of arguments for why the way that we handle the development of AI might have a big long-term effect?

Ben Garfinkel: Yeah. So there’s a lot of concerns that people have about unintended consequences from AI development and specifically on the technical side. So there’s a broad category of concerns that people have that we’ll, in some sense, design AI systems that behave in ways that we don’t anticipate and cause really substantial or lasting harm. And this is just to sketch out this concern in very general terms. We know today that it’s sometimes pretty hard to get AI systems to behave exactly as you want or avoid unintended harms, although they’re often on a relatively non-catastrophic scale. So we have examples of things like self-driving cars crashing or systems that are used to make decisions or informed decisions about, let’s say, granting parole that have unintended disparate harm effects on different communities.

Ben Garfinkel: And this intuition that as AI systems become more pervasive in the world, as they’re involved in more and more consequential decisions, exponentially more autonomous, and as they become more and more capable, that the consequences of these sorts of failures become more and more significant. And on the extreme end you might make the case that if we’re imagining ourselves as gradually approaching this long run point in the future where just AI systems do most of this stuff. They’re way more capable than we are at most tasks. And then maybe in that world, failures can actually rise to the level of being catastrophic or just really hard to come back from, in some sense.

Howie Lempel: So that feels like a good high-level overview of reasons why you might be worried about unintended consequences from technology this powerful in general. But do you want to talk about some of the more specific arguments that give a sense of what we might be worried about going wrong with AI in particular?

Ben Garfinkel: Right. So I think it’s just been in, say, the past couple of decades that people have started to really present more specific versions of this concern. You can go back and find early AI pioneers like Alan Turing and I.J. Good, gesturing at concerns that maybe when you develop very sophisticated AI systems, we’ll find it in some way hard to control their behavior and there may be negative effects. It’s really, especially, I would say mostly mid-2000s, people like Eliezer and Nick Bostrom start to develop somewhat more specific category of arguments in detail, painting the picture of why things might go catastrophically wrong. And other people like Stuart Russell also helped develop these arguments as well.

Ben Garfinkel: But then since then I’d say there’s also been a less developed but maybe newer category of arguments, in some sense, a spinoff from the, let’s say, Bostrom/Yudkowsky arguments. And Richard Ngo, who’s an AI safety researcher actually has a really good blog post called “Disentangling Arguments for the importance AI safety”, which walks through the, I suppose, increasing larger taxonomy of different arguments people have presented.

Howie Lempel: And then you in particular, your research has focused on just one set of arguments, the Bostrom/Yudkowsky line of argument. Do you want to talk a little bit about why you focus so heavily on that one in particular?

Ben Garfinkel: Yeah, I think there’s a couple of reasons. So one reason is that I think that these are still by far the most fleshed out arguments. So Bostrom has this book “Superintelligence” that’s been quite influential. That’s a really long and detailed presentation of this particular version of the AI risk argument. The book also contains lots of other perspectives on what the future of AI could look like. But it’s more than a hundred pages devoted to exploring one relatively specific vision of the risk. And also Eliezer Yudkowsky also has a number of quite long essays. There’s also a number of white papers put out by the Machine Intelligence Research Institute as well, that explore the argument in a relatively high level of depth. Whereas some of these other newer arguments really will be a couple of blog posts or a single white paper presenting them.

Ben Garfinkel: And often they’re quite new as well. It’s really, maybe the past one or two years they’ve started to emerge. They’ve also just been a lot less vetted. So in some sense, just one argument is just it’s much easier to understand what these arguments actually say because they’ve been expressed clearly and they’ve, in some sense, been vetted a bit more. Another reason I focused on them, I think, is that they’ve also been a lot more influential. I think a lot of people who have decided to work on AI safety or be involved in AI governance from a long-term perspective have been really heavily influenced by this particular set of arguments. I think still today a lot of people who decide to transition to this area are influenced by them. So they’re also, I think, really concretely having an effect on people’s behavior. So it’s really important to understand exactly how well they work or what the potential weak spots are.

Ben Garfinkel: Yeah, and I think the last one as well is just, I think that these are probably the most prominent arguments in terms of the extent to which AI safety concerns have filtered into the public consciousness. I think the version of the concern that’s mostly filtered in has been this particular version of it. So it’s also useful from that perspective, insofar as this is influencing public discussions to really have a clear sense of how strong these arguments are. Yeah, I suppose just the last reason why I personally am especially interested in understanding the classic Bostrom/Yudkowsky arguments and it’s that these were actually, I suppose, the arguments that first got me really interested in artificial intelligence and really interested in the idea that this might be one of the most high priority topics to work on. Specifically reading Superintelligence is the first thing that got me into the area. And so I also feel, from a personal perspective, I suppose, pretty interested in understanding how strong these arguments in fact are and how justified my… I suppose, it was for me to initially be quite convinced by them.

Ben Garfinkel: And so that’s not to say that the other newer arguments for AI risk aren’t important to understand or plausibly correct, but I do think that there are currently some reasons to pay special attention to the classic Bostrom/ Yudkowsky arguments and prioritize examining them and vetting them over the new arguments at the moment.

Howie Lempel: So I think it would be really valuable to lay out both those more fleshed out arguments in detail? And also to talk and really dig in on the criticisms of those arguments that you’ve had and some of the disagreements that you’ve had with that line of reasoning. So we’re actually going to focus the rest of this podcast on that line of classic arguments for AI risk in particular.

High level overview of Bostrom [00:39:34]

Howie Lempel: So a lot of listeners might already be familiar with this Bostrom/Yudkowsky argument, but do you want to give a quick, high-level outline of exactly how that argument works?

Ben Garfinkel: Yes, so there is still a decent amount of heterogeneity in terms of how this risk is presented by different people or how different people’s vision of the risk has evolved over time. But even so, hopefully this will be a mostly representative, high-level gloss of the argument. I think the first step is, there’s a suggestion that at some point in the future, we may create an AI system that’s, in some sense, as smart as a person. Like its cognitive abilities are very similar to that of a person. And then there’s a suggestion that once we’ve done that we might expect that system to be followed very soon after by a system that’s, in some sense, much smarter than any person or any other AI system in existence. And some of the intuitive reasoning for that is, imagine you have on a hard drive somewhere some system that can do anything that a human brain can do. Then you can do stuff like just run it on way more computing power.

Ben Garfinkel: So it can think way faster than any person can or you can allow it to have a go at programming itself or changing relevant codes. And maybe there’s some interesting feedback loop there where as it gets smarter, it gets better at doing AI research, which then makes it smarter again. And maybe there’s some explicit feedback loop there. But it’s basically various arguments people present for this sudden jump from something that’s quite similar to a person into something that’s, in some sense, radically superintelligent.

Ben Garfinkel: And then there’s a second bit to the argument that says, “Okay, well, how do we expect, let’s say a radically superintelligent system to behave”? We should probably think of it as having some goal or objective it’s pursuing in the world. And there’s a suggestion that most goals we might choose to give to advanced AI systems might be subtly harmful or suddenly diverge from what goals we would give them if we really can understand all the implications of our requests. And there’s a… This is a little bit abstract, but there’s at least some intentionally silly thought experiments that people sometimes present to make this intuitive.

Ben Garfinkel: So one classic one is, people imagine that there’s some superintelligent AI system that’s been given the goal of maximizing paperclip production for some paperclip factory. And at first glance, it seems like a really benign goal. This seems like a pretty boring thing, but if you follow through on certain implications and, in some sense, you put yourself in the shoes of an incredibly competent agent who only cares about paperclip production, then maybe there are certain behaviors that will flow out of this. So one example is, from the perspective of this AI system, you might think, “Well, if I really want to maximize paperclip production, then it’s really useful to seize as many resources as I can to really just plow them into paperclip production”.

Ben Garfinkel: Maybe just like seize wealth from other people or seize political power in some way, and just really plow everything I can into putting out as many paperclips as possible. Or maybe you think, “Oh, wow. Once people realize that I’m just trying to do this horrible thing, I’m just plowing everyone’s resources into paperclip production, maybe then people will try and turn me off because they realize, “Oh, that’s actually not what I wanted”.” And then from that perspective, you have an incentive to try and stop people from turning you off, or reducing people’s power or even harming them, if that reduces the risk of them standing in the way of you pursuing your single-minded goal of maximizing paperclip production. And so maybe just this horrible thing comes out of it, that you have this particular goal, we don’t have these more nuanced goals about human values and sovereignty and things like that. And while no one presents these, I suppose, little experiments as… Or intend to be realistic presentations of the concern, they’re meant to illustrate this more general idea that supposedly most goals that you might give an AI system might have these unintended consequences when pursued by a sufficiently competent agent. That there may be some sense in which most goals you might give an AI system just are subtly bad in ways that are hard to see.

Howie Lempel: Cool. Now there’ve been a substantial number of people who have been somewhat convinced by this set of thought experiments and this argument, but there’s also been a lot of pushback. So do you want to talk a little bit about the pushback that this has gotten in general from people other than you and why you think it’s gotten so much pushback?

Ben Garfinkel: Yeah. So it’s a bit interesting. I think it’s probably fair to say that a large fraction of especially machine learning researchers who encounter these arguments don’t have a very positive reaction to them, or bounce off of them a bit, although obviously quite a few do you find them sufficiently compelling to be quite concerned. And I think often people don’t really have, I think, very explicit reasons for rejecting them or often at least people don’t really write up very detailed, explicit rebuttals. To some extent, it’s a bit of common sense intuitions. And I think, for example, a thing that people react to is that they’re quite aware that a lot of the thought experiments are used to illustrate the concern are very unrealistic. And even the proponents of the arguments agree that the thought experiments are unrealistic, there’s maybe something a bit suspicious about the fact that no one seems to be able to paint a picture of the risk that seems really grounded in reality.

Ben Garfinkel: And this doesn’t really seem true of other risks that people worry about. Like, let’s say, pandemics or climate change or things like that. It is distinctive, that this is supposedly a really large existential risk, but no one can really describe what the risk scenario looks like in quite concrete terms. I think that’s one thing people react to. I think another thing that people sometimes react to is a lot of these arguments are presented in fairly abstract terms using concepts like goals and intelligence and other things like this, that are a little bit fuzzy. And partly because when they were written, often more than a decade ago, or at least more than five years ago, they don’t really engage very much with what AI research looks like today. So today most AI research or research that’s labeled AI research is based on a machine learning paradigm.

Ben Garfinkel: And for example, Superintelligence really doesn’t discuss machine learning very much. For example, there’s a chapter on giving AI systems goals or loading values in the AI systems, and I believe reinforcement learning techniques only get about two paragraphs. And that’s really the main set of techniques that people use today to create advanced systems which people see as along the pathway to developing these really advanced agents. And so I think that’s another thing, that people feel like… I don’t know, it’s not really directly speaking to what AI research looks like and they feel a little bit uncomfortable about, what are these fuzzy abstractions? This maybe doesn’t really fit with my picture of what AI research is.

Ben Garfinkel: And then I think maybe a last one that maybe feeds just intuitive, skeptical reactions is a generic thing that I often hear people raise, like, “Oh, okay, well lots of technologies have safety issues, but typically we solve these safety issues or incentives to not have your technology do horribly destructive stuff”. I think this is an argument people like Steven Pinker for example have made that there’s really strong incentives for bridges not to fall down. So just generically, without looking too much in this specific argument, you should start from maybe a baseline in thinking incentive structures and feedback loops are such that for most technologies, safety issues are mostly prevented from being catastrophic, and maybe unless you really, really get convinced by strong arguments to the contrary, you should be inclined to think that in the case of AI as well.

Howie Lempel: So I’m sure that you’re aware of all these counterarguments at the time that you were really bought into Superintelligence at the point where you were making decisions about what to focus on based on it. So why didn’t you find them convincing at the time?

Ben Garfinkel: Yeah, I mean, I do think to some extent I did have these concerns at least at a lower level. So I definitely have always become uncomfortable with the use of intentionally silly thought experiments or a little bit uncertain about the fact that no one could paint a very specific picture of the risk. I suppose the main reaction that I had and I think probably lots of people have is that, mainly these are arguments about, in some sense, what your prior should be, or what attitudes you should have before engaging with the arguments. So, for example, is your argument that generically, before you look at the specifics of any given technology, you should have an initial expectation that probably most technologies end up having their safety issues resolved? So probably you should expect that to begin with, or prior to that, if an argument doesn’t really engage very heavily with the technical details of the technology, then maybe there’s some sort of abstraction failure or something that makes your argument not really properly go through or grip onto reality.

Ben Garfinkel: And I think that these counters don’t really engage with your arguments themselves, so much as they say why you should have an initial tendency towards skepticism. But the thing that they don’t do is actually look at the arguments in detail and then say, “Oh, here’s what a mistake is, or here’s where the gap is, or here’s where things somewhat come apart from the technical reality”. And I guess it’s easy to feel like… I guess a lot of critics don’t actually really engage with the arguments or they don’t really make the effort to take them seriously rather than just waving them off on the basis of, I guess common sense intuition. And so, yeah, I think in that sense, it’s easy to dismiss critics as often just, “Oh, they don’t really get it after all the amount we were digging in, they’re just being dismissive”.

Howie Lempel: So it’s easier to dismiss them for that reason. Do you think it’s also the case that some of them were just being dismissive?

Ben Garfinkel: I mean, I do think it’s a mix. I think I have pretty mixed feelings on this. So I think that a lot of these arguments are pretty reasonable. I do think you should be really suspicious if a realistic vision of the risk isn’t being painted and you should be suspicious if the arguments don’t seem to really grip onto the technical details of what the technology looks like and you should be suspicious if the conclusion is something that’s quite unusual. If you just look at what’s driven most technologies. And then, I think I really don’t necessarily begrudge anyone who’s just like… They’re busy doing other stuff, they hear these arguments, they have this extremely strong prior that probably there’s something fishy here. And then, I basically don’t think that that’s very unreasonable.

Ben Garfinkel: At the same time though, I also think it’s not necessarily unreasonable for someone who is convinced by your arguments to go, “I understand why someone would react this way, but I’ve actually really looked into the arguments and I don’t see the flaws and so therefore I’m going to continue to be convinced by them and not take the fact that people are intuitively dismissive without looking in depth into the arguments as that much evidence against them. So I guess, maybe I’m sympathetic about the people who react dismissively and the people who have ignored the fact that other people react dismissively.

Howie Lempel: Yeah. I think another thing that I certainly felt at the time was that I was bothered by the level of confidence that critics seemed to have that the argument was wrong. Or at least among the critics who hadn’t really engaged with it. So it’s one thing to say, “This isn’t worth my time to look into”. But it’s an additional thing to say, “This isn’t worth my time to look into, but I’m going to write up an op-ed or a piece saying that it’s totally wrong”.

Ben Garfinkel: Yeah, I’ve also been really bothered by that as well. This may be overly harsh, but I do actually think there’s something irresponsible to some extent, if someone says that they’re really concerned about this large-scale risk, and often people raising this concern are quite credible figures like Stuart Russell, who’s obviously an extremely prominent AI researcher. And then I think it’s okay to just say, “Well, I’m not really going to look into this, I’m a bit skeptical of it”, but if you’re going to go out and try and write stuff and convince people that these arguments are false, then I do really think you have some obligation to really, really make sure that you’re right about this and really make sure that you’re accurately representing what the argument is and that you’re being fair to the people you’re criticizing, because thinking about it for something else, like someone who’s convinced that climate change probably isn’t a real phenomenon, but they don’t read a paper on climate change before they go out and write an op-ed. It just seems… I don’t know, I guess not very sensible.

Brain in a box [00:50:11]

Howie Lempel: Cool. So I want to transition from those outside view criticisms of the Yudkowsky/Bostrom line of thinking to some of the criticisms or objections that much more closely actually engage with the line of argument. So I know that over time you’ve come to have three major objections to the Yudkowsky/Bostrom line, so do you want to start talking through some of those in turn?

Ben Garfinkel: Yeah. So I suppose the first objection that I would raise, is it seems like a lot of classic presentations of AI risk seem to presume a relatively specific AI development scenario, which I’ll call the brain in the box scenario, actually borrowing a term from Eliezer Yudkowsky and I think the basic idea is that for some period of time, there may be some amount of AI progress, but it won’t really transform the world all that much. You’ll continue to have maybe some new common applications like self-driving cars, or maybe image recognition systems for diagnosing diseases and things like that. But you’ll basically only have very narrowly competent AI systems that do very specific things. And in the aggregate, these systems just won’t have that large an impact on the world. They won’t really transform the world that much. And then relatively abruptly, maybe over the course of a single day, maybe over the course of something like a month, you’ll go from this world where they’re only relatively inconsequential AI systems to having one individual AI system that’s, in some sense, very cognitively similar in its abilities to a human.

Ben Garfinkel: So about as smart as a person is along most dimensions. Not that much better at a given cognitive task, not that much worse at any given cognitive task. You have something that’s almost like a human brain sitting on a computer somewhere. And I think that’s often almost implicitly background to a lot of these presentations of AI risk. And then there are more concrete arguments that are made after that, where there’s this argument that once you have this thing, that’s like a brain in a box, this thing that’s, in some sense, as smart as a person, then you’ll get this really interesting jump to a radically superintelligent system. And with a lot of arguments there, it often seems to be taken as almost background that the lead-up to that point will look like this.

Howie Lempel: Okay, cool. So when you talk about the brain in a box scenario, are you primarily talking about the fact that prior to this intelligence explosion, there aren’t world changing ML systems that have already been deployed? Or are you talking about the arguments that once you get to that point, you end up having this really fast explosion? Or does brain in the box mean both of those together?

Ben Garfinkel: So I’m mostly referring to the first one. So I’d separate out… Even if all AI progress up to a certain point looks like nothing interesting happening then pretty suddenly you have this brain in the box style system. And then there’s a second question to ask of, conditional on us ending up with this brain in the box system pretty suddenly, will there be a sudden jump to something radically smarter? And I want to separate out that second question, which I do think is discussed pretty extensively. And I want to focus on the first question of like, what would progress look like? This sudden jump to something that’s cognitively like a human without really interesting precedents.

Howie Lempel: Okay. So if that doesn’t happen, what do the alternatives actually look like?

Ben Garfinkel: So I think there’s a pretty broad range of possibilities. So here’s one I might paint, which is you might think that currently the way AI progress looks like at the moment, to some extent, is year by year AI systems, at least in aggregate become capable of performing some subset of the tasks that people can perform that previously AI systems couldn’t. And so there will be a year where we have the first AI system that can beat the best human at chess or have a year where we have the first AI system that can beat a typical human at recognizing certain forms of images. And this thing happens year by year of there’s this gradual increase in the portion of relevant tasks that AI systems can perform.

Ben Garfinkel: And you might also think that, at the same time, maybe there’d be a trend in terms of the generality of individual systems. So this is one thing that people work on in AI research, is trying to create individual AI systems which are able to perform a wider range of tasks, as opposed to relying on lots of specialized systems. It seems like generality is more of a variable than binary, at least in principle. So you could imagine that the breadth of tasks an AI system can perform will become wider and wider. And you might think that there are other things that are fairly gradual, like the time horizons that systems act on, or the level of independence that AI systems exhibit might also increase smoothly. And so then maybe you end up in a world where there comes a day where we have the first AI system that can, in principle, do anything a person can do.

Ben Garfinkel: But at that point, maybe that AI system already radically outperforms humans at most tasks. Maybe the first point where we have AI systems that can do all this stuff that people can do, they’ve already been able to do most things better than people can before that point. Maybe this first system that can do all this stuff that an individual person can do also exist in a world with a bunch of other extremely competent systems of different levels of generality and different levels of competence in different areas. And maybe it’s also been preceded by lots of extremely transformative systems that in lots of different ways are superintelligent.

Howie Lempel: Okay. So that’s one potential scenario that looks different from the brain in the box scenario. Are there others you want to talk about?

Ben Garfinkel: Yeah. So I might maybe label the one I just described, let’s call it the smooth expansion scenario, where you just have this gradual increase in the things AI systems can do, but also gradual increase in the generality of individual AI systems.

Ben Garfinkel: But you might also imagine that maybe we don’t even really see that much of an increase in how general a typical AI system is. Maybe we don’t even really end up with very general systems that play that larger role in whatever it is that AI systems are doing. So I have a colleague at FHI, Eric Drexler, who has this really good paper called “Reframing Superintelligence” that, at least as I understand it, argues that people maybe are inclined to overestimate the role that very general systems might play in the future of AI. And the basic argument is that, so first of all, today we mostly have AI systems that aren’t that general. We mostly have pretty specialized systems. So maybe you just have some sort of prior that, given that this is what AI systems look like today, maybe this will also be true in the future. In the future we’ll have more systems that can do way more stuff in aggregate, and maybe they’ll still be relatively narrow.

Ben Garfinkel: Another argument for it is that it seems like specialized systems often outperform general ones, or it’s often easier to make, let’s say two specialized systems, one which performs task A and one that performs task B pretty well, rather than a single system that does both well. And it seems like this is often the case to some extent in AI research today. It’s easier to create individual systems that can play a single Atari game quite well than it is to create one system that plays all Atari games quite well. And it also seems like it’s a general, maybe economic or biological principle or something like that in lots of current cases. There are benefits from specialization as you get more systems that are interacting. So biologically, when you have a larger organism, cells tend to become more specialized, or economically, as you have a more sophisticated complex economy that does more stuff, it tends to be the case that you have greater specialization in terms of worker’s skills.

Ben Garfinkel: So post-industrial revolution, there’s a really substantial increase in terms of how specific the tasks that one person versus another perform. Or if you have an assembly line, there’s benefits to having people that could just do one specific bit of it, rather than doing all of it. Or on an international level, there’s an increase in terms of benefits like one country, I mean, certain specialization versus another. So maybe there’s something you can argue from this general principle. That if we’re imagining future economic production being really driven by these AI systems that can have different levels of generality, maybe the optimal level of generality actually isn’t that high.

Ben Garfinkel: And I guess the last argument, at least as I understand it, Eric puts forward that insofar as you actually buy some of these safety concerns that people have for these classic arguments, they often seem to focus on these very general systems causing havoc. So if you have any ability to choose, maybe if you’re safety conscious, you will have incentives to push things towards more narrow systems. So I’m not obviously very sure what’s more likely, but I think this is also another possibility is maybe we don’t even really end up with very general systems that play a really large role.

Howie Lempel: Cool. So just making sure I follow, the two alternatives to brain in the box that you laid out, one of them, which you labeled as gradual emergence, is there are over time more and more general, more and more capable systems and we eventually get to human level AGI and that ends up being important. Or there are AGI systems that are important, but it’s not a huge jump from what existed before. And the second scenario, which is the one that Drexler lays out, is one where general systems don’t end up being that important, even later on, because the specialized systems end up being more relevant or more powerful. Is that right?

Ben Garfinkel: Yeah, I think so. I think you would also… It’s also plausible, we just have some really weird thing that’s hard to characterize where you have lots of systems of different levels of generality and agency and all of this stuff and it’s gradual in some way, but it’s just very hard to describe, which seems maybe even most likely to me.

Contenders for types of advanced systems [00:58:42]

Howie Lempel: Cool. Got it. So I guess maybe going back to the first alternative that you talked about, the gradual emergence alternative: can you give some sense of, at the point that human level AGI’s developed, what are some examples of these contenders? Like possibilities for the types of advanced AI systems that might already exist at that point? And how the world might be different?

Ben Garfinkel: Yeah, I think it’s probably pretty difficult to paint any really specific picture. So I think there’s some high-level things you could say about it. So one high-level thing you might say about it is that, take a list of economically relevant tasks that people perform today. Take Bureau of Labor statistics database, and then just cross off a bunch and assume that AI systems can do them, or they can either do certain things that make those tasks not very economically relevant. We can also think that there’s some stuff that people just can’t do today. That’s just not really on people’s radar as an economically or politically or militarily relevant task that maybe AI systems will be able to perform.

Ben Garfinkel: So one present day example of something that people can’t do, that AI systems can do, is generating convincing fake images or convincing fake videos. These things called deepfakes. That’s not really something the human brain is capable of doing on its own. AI systems can. So just at a very general abstract level, imagine lots of stuff that’s done today by humans is either done by AI systems or made irrelevant by AI systems and then lots of capabilities that we just wouldn’t even necessarily consider might also exist as well.

Ben Garfinkel: It’s probably lots of individual domains. You can maybe imagine research looks very different. Maybe really large portions of scientific research are automated, or maybe just the processes are maybe just in some way AI-driven in the way that they’re not today. Maybe political decision-making is much more heavily informed by outputs of AI systems than it is today, or maybe certain aspects of things like law enforcement or arbitration or things like this are to some extent automated and just the world could be just quite different in lots of different ways.

Howie Lempel: So I guess self driving cars seems like… Transportation is a decent sized part of the economy, so self-driving cars feels like it’s taking one real slice of things that humans can do and automating that away. So should I just think of self-driving cars but across many other industries and activities all sort of exist?

Ben Garfinkel: I mean, I think to some extent that’s reasonable. But I think it probably would give us a distorted picture. So one analogy is, if you think about, let’s say, automation in the context of agriculture. So if you look at the set of stuff that people were doing 20 odd years ago, a huge amount of human labor was tied up in just working fields and things like that. And then it probably would have been wrong if you’re trying to think about what’s the significance of industrial automation to just imagine, “Oh okay, the economy is the same. It’s just instead of people plowing fields, there’s tractors”. So a couple of things happen. So one thing that happens is as you automate certain tasks, new tasks emerge and become relevant for people. Another thing that happens is lots of the applications of technologies aren’t really well thought of as just replacing people in roles that currently exist.

Ben Garfinkel: They often introduce these new capabilities. Like industrialization is super relevant for railroads and tanks and lots of crazy stuff. Or computers, if you think about the significance of computers in general, the existence of things like social media websites and stuff like this, it’s not really just replacing a thing that was already happening. And I mean, I think just basically you should have this prior that it’s very, very difficult to actually have a concrete image. You should maybe imagine in this scenario that loads of stuff that’s done by people today is no longer done by people, but there’s lots of crazy stuff that’s being done by people that no one’s doing today. There’s also lots of crazy capabilities that just there’s no close analog to in terms of the technology we have today.

Howie Lempel: Okay, cool. And then if I am someone who buys the story that you’re going to have an intelligence explosion once you reach some point that might be human-level, “AGI-ish”, and I hear the description that you just gave, one thing that you threw out there was AI playing a really big role in science. Why don’t I just think that at the point that we have that level of technology, we’ll necessarily already have human-level AGI, right? Assume that among the things that these narrow AIs are really good at doing, one of them is programming AI and so you end up with that leap from getting one of those technologies quickly to human-level AGI and then take off from there?

Ben Garfinkel: So I think the basic reaction I have to this is if you think of, let’s say applied science, there’s really a lot of different tasks that feed into making progress in engineering or areas of applied science. There’s not really just this one task of “Do science”. Let’s take, for example, the production of computing hardware. Obviously this is not at all a clean picture of this, but I assume that there’s a huge amount of different works in terms of how you’re designing these factories or building them, how you’re making decisions about the design of the chips. Lots of things about where you’re getting your resources from. Actually physically building the factories. I assume if you take the staff of Intel that’s relevant to the production of new chips, or staff from Nvidia, and then you do some labeling of what are the cognitive tasks that people need to perform. You’re probably going to come up with a pretty long and pretty diverse list of things.

Ben Garfinkel: And then you could obviously react like, “Oh, for whatever reason I have the view that I think there’s going to be some sudden jump that allows AI systems to discontinuously perform a way larger number of tasks than they could in the previous year”. But I suppose if you’re imagining this more gradual scenario where this trend continues of some set of tasks unlocked each year, then intuitively I think you’d think maybe over time a larger portion of the work that goes into producing progress in applied science or engineering, a larger portion of these tasks are automated, but there’s no year where we’ve just switched from where we don’t have automated science to we do have automated science.

Implications of smoothness [01:04:07]

Howie Lempel: Let’s assume that the first alternative that you described ends up playing out. I forget if you called it smooth expansion or gradual emergence, but it’s the one where we end up with systems that are both general and superintelligent. There’s a whole bunch of other really important transformations that are good at AI that lead up to it. What does that mean for the overall AI risk argument?

Ben Garfinkel: Right. I think if things are gradual in this way, if we have lots of systems that have intermediate generality and competence, and the world doesn’t just radically transform over a short period of time but things move in some sense gradually, I think this has a few implications relative to what you’d expect if the brain in the box scenario came true.

Ben Garfinkel: I think one of the first implications is that people are less likely to be caught off guard by anything that happens. If you have the brain in the box model of AI development, then we really don’t know when it’s going to happen. Maybe someone has some crazy math breakthrough tomorrow and we just end up with AGI; you have some individual system that’s human-level or human-like in its intellect.

Ben Garfinkel: That just seems very hard to predict in the same way that, for example, it’s hard to predict when someone will find a new proof like a mathematical theorem. Whereas if stuff is very piecemeal, then it seems like you can extrapolate from trends. For example, progress in making computers faster is something that’s not extremely unpredictable. It’s sufficiently gradual. You can extrapolate and you can know, “Oh okay, I’m not going to wake up tomorrow and find that computers are 10,000 times faster, or that solar panels are 10,000 times more efficient, or that the economy is X times larger”. That’s one significant thing, is that people are more likely to just not be caught off guard, which is quite useful I think.

Howie Lempel: Why is it important for people not to be caught off guard? What’s going to happen in this time where we realize, “Oh shit, this might be coming”, that really makes a change to the story?

Ben Garfinkel: One implication is that people are more likely to do useful work ahead of time or more likely to have institutions in place at that time to deal with stuff that will arise. This definitely doesn’t completely resolve the issue. There are some issues that people know will happen ahead of time and are not really sufficiently handling. Climate change is a prominent example. But, even in the case of climate change, I think we’re much better off knowing, “Oh okay, the climate’s going to be changing this much over this period of time”, as opposed to just waking up one day and it’s like, “Oh man, the climate is just really, really different”. Definitely, the issue is not in any way resolved by this, but it is quite helpful that people see what’s coming when the changes are relatively gradual. Like institutions, to some extent, are being put in place.

Ben Garfinkel: I think there’s also some other things that you get out of imagining this more gradual scenario as opposed to the brain in the box scenario. So another one that I think is also quite connected is people are more likely to also know the specific safety issues as they arise, insofar as there any unresolved issues about making AI systems behave the way you want them to do. Or, let’s say not unintentionally or, in some sense, deceiving the people who are designing them. You’re likely to probably see low-level versions of these before you see very extreme versions of these. You’ll have relatively competent or relatively important AI systems mess up in ways that aren’t, let’s say, world-destroying before you even get to the position where you have AI systems whose behaviors are that consequential.

Ben Garfinkel: If you imagine that you live in a world where you have lots and lots of really advanced AI systems deployed throughout all the different sectors of the economy and throughout political systems, insofar as there are these really fundamental safety issues, you should probably have noticed them by that point. There will have probably been smaller failures. People should also be less caught off guard or less blindly oblivious to certain issues that might arise with the design of really advanced systems.

Howie Lempel: One counterargument that I hear sometimes boils down to, “You won’t get any of these types of warnings because we should expect that if this is going to end up being a truly superintelligent AI, it’s going to realize that it should not give any warnings”. So you actually won’t start seeing any of the lying manipulation until it’s at a point where that’s actually beneficial for that AI. What do you think of that response to that concern?

Ben Garfinkel: I think this is also something that’s, to some extent, resolved by stuff being gradual and deploying lots of AI systems throughout the world well before that point. Insofar as the methods we’re using, in some sense, incline ML systems to hide certain traits that they have or engage in deception. We shall also notice less damaging forms of deception, or less competent forms of deception probably well before that point if we have loads and loads of ML systems out in the world.

Ben Garfinkel: We already actually have some relatively low-key versions of this. I think there’s a case where someone was training, I think a robotic gripper in simulation, and it learned that it could… I think it would place itself between the virtual camera and the digital object in such a way that it looked like it was, I think, manipulating it or touching it and accomplishing some tasks, when in fact it just had done the thing where people would make it seem like they’re leaning against the Leaning Tower of Pisa.

Ben Garfinkel: We do have some examples of low-level forms of deception. I think if stuff is sufficiently gradual, we’ll probably continue to notice those. If they start to become really serious, we’ll start to notice that they’re becoming really serious and we shouldn’t be just totally caught off guard by the fact that this is a capability that AI systems have.

Howie Lempel: Is the argument that the solutions that we find to handle deception on now or less powerful systems will then at least sometimes be scalable to more advanced systems? Or, is it more just like we got a warning notice that this is something that AI sometimes does, so we’re not going to end up employing advanced systems in the wild until we figure out under what circumstances do these systems end up being intransparent?

Ben Garfinkel: I think it’s a little bit of both. Definitely one aspect is just that people are less likely to be caught off guard and you might be safe, or less likely to deploy something being completely oblivious to it potentially having damaging properties. The other one, like you were suggesting, is that we might have lots of opportunities for trial and error learning, like using techniques on relatively simple or crude systems, figuring out what works there. Then we have systems that operate in different domains that are, in some sense, more sophisticated and we maybe realize that some of the techniques don’t perfectly scale, we adjust them and do all of that.

Ben Garfinkel: I think generally speaking, if stuff is sufficiently continuous up until the point where we have systems that can do all the stuff that people can do, then I would surprised if the techniques were completely unscalable. Maybe you transition from one technique to another. But, if things are continuous with regard to capabilities, I’d be surprised if safety considerations, in some sense, don’t at all carry forward in a relatively smooth way.

Howie Lempel: All right. We were going through some of the implications of this brain in the box scenario, and had talked about trial and error, not being caught off guard. I think you had something to say about AI tools.

Ben Garfinkel: Yeah. I think this is just one more way in which I think things being fairly gradual could be helpful. This is another one that I think, for example, Eric Drexler has raised, is that if we’re imagining that progress is unfolding in this gradual way, then it’s also probably the case that lots of the intermediate systems we develop along the way could be quite useful for designing other AI systems or making sure that other AI systems are safe or don’t get out of hand, and so we should also be probably imagining when we think about the future, our ability to avoid accidents and stuff like that, we shouldn’t just imagine our current level of ability to keep watch on things or probe systems. We should imagine ourselves having probably various AI enabled capabilities that we don’t have today that could be quite helpful.

Howie Lempel: Is there something you could learn that really increases your credence in the possibility that you’re going to see some of these huge jumps and discontinuities?

Ben Garfinkel: There’s this intuition I think some people have from evolutionary history where it seems like there’s some sense in which the evolution of human intelligence seems to have had some interesting discontinuity in it. On aggregate, chimpanzees and the common ancestors of humans basically can’t do a lot of interesting stuff in the world. They can make extremely rudimentary tools and that’s basically it. Whereas on aggregate, humans can do a bunch of crazy stuff in the world together, like go to the moon. It seems like there actually aren’t that many generations separating this extremely accomplished species from this extremely unaccomplished species. There’s not that many genetic mutations that are probably involved. And if you map that onto AI development, if you imagine, let’s say, compressing the evolution of human intelligence into a hundred years and retracing it in some sense in the context of AI development, then maybe it’s nothing that interesting happening for the first 100 years, and then in the last year, maybe even the last month or something, suddenly you have things that can go to the moon.

Ben Garfinkel: I think that’s the rough intuition that leads some people to think that you could actually have something that looks quite discontinuous, at least on normal human timescales. I think the thing that I’d be most interested in is basically… I’d be really interested in an argument that tries to look at what happened in the evolution of human intelligence, the extent that we actually have any sense of what happened there, and tries to map it on the way AI development is going and tries to see if there’s any analogy you can draw. Where you can strengthen the analogy that should give us more reason to think that thing that happened in human evolutionary history, that looked quite discontinuous on evolutionary timescales, something very analogous to that might happen in the development of AI systems that will look quite discontinuous like human timescales. I think that would probably be the main intellectual track to go down that could lead me to assign a higher probability to a sudden jump to much more advanced systems.

Howie Lempel: The other thing that I hear a bit, I don’t even know if this is going to end up conceptually holding together, but often this point that’s focused on as being the point where there would be a jump, is something around human-level AGI defined as something like an AGI that can do basically all tasks as well as humans can and therefore, superintelligent on certain tasks. I guess it’s possible the more natural point to expect the jump would be something at the point where some particular AI system has more optimization power, or however you want to define it, than all of the other AI that’s out there. I can imagine that that would be the point in which you’d start seeing really big feedback loops.

Ben Garfinkel: I think the issue here is that if we’re imagining a world where there is one single AI system whose ability to contribute to AI progress outweighs, in aggregate, all the other AI systems that exist in the world, then it seems to imply that some really interesting jump has already happened. I think I do agree that conditional on us ending up in that scenario, it does seem like there’s something to be said for like, “Oh man, we have this one single AI system that’s way out ahead of everything else” and they already do contribute a massive amount to AI research and maybe its lead even just gets amplified. I guess it seems like you still need to think there’ll be some sort of discontinuity to even end up in that world in the first place.

Howie Lempel: Cool. That makes sense.

Howie Lempel: Okay. We’ve talked through a few of the implications that come out of this question of whether or not there are these big discontinuities in AI development. The world looks a bit different if people are caught off guard versus they have time to prepare. Looks different depending on, “Are there other AI systems, AI tools around already to help with the development”? And so those are the stakes. Now the question is, how likely is it that we will be in the world foreseen or described by the classic arguments where you had these big discontinuities versus a world where things are going more gradually?

Ben Garfinkel: Yeah. I think I don’t have an extremely robust view on this. I would put it below, let’s say, as a somewhat made up number of below 10% that we end up with something that looks like a really large jump that sort of evocative of the sort of thing that’s imagined in these classic arguments.

Howie Lempel: Just for context, how large exactly are we talking about there?

Ben Garfinkel: Yeah. I think it’s a sufficiently vague thing. It’s difficult to put numbers on it. I know that, for example, Paul Christiano has sometimes tried to use the economic growth rate as a proxy. Obviously this doesn’t exactly capture what we care about, but there’s some sort of intuition that the more, in some sense, useful or advanced the AI the systems we have are, the more that if they’re directed appropriately, they could increase GDP. You can imagine, maybe if we end up in the brain in the box scenario, probably GDP is not going to contain the same 3% growth thing.

Ben Garfinkel: I think this is one where you can operationalize it. Where I think, for example, Paul Christiano has said that his view is it’s really unlikely that we’ll suddenly jump to a year where there’s a 50% or above economic growth rate for a single year as one way of saying, “Oh, it probably won’t be this extreme thing.” Obviously even a jump to 25% or something is a historically unprecedented discontinuity. But I think that’s one way of operationalizing it. I might maybe implicitly be using something like that when I’m talking about discontinuity.

Howie Lempel: All right. Are we saying something like, two years later, people are in a world that they fundamentally recognize, but has also changed more than they’ve ever seen the world change over two years before?

Ben Garfinkel: I think that’s probably fairly reasonable. There’s also this… This is maybe a bit in the weeds, there’s also this distinction you can draw between any given point in time, is the world, in some sense, changing much faster? And then the second order question of how quickly did the pace of change itself change? With analogy of economic history, the economy changes way faster today than it did hundreds of years ago. Technology changes way faster. But the change in the rate itself was relatively smooth. It happened over hundreds of years. I’d maybe be more inclined to focus on this second order change in the rate of change, even if the pace of change goes faster, I do think that process will itself be fairly smooth and people can adapt to things moving faster.

Howie Lempel: Okay. You’re giving your distribution of your beliefs over these scenarios?

Ben Garfinkel: Right. I think there’s some case for just stuff eventually becoming much faster. Some of the arguments here have to do with economic growth models and economic growth is really, really poorly understood. At least if you take certain models at face value, one argument that people make is that there’s this relationship between, say, capital and labor and economic growth, where you combine basically technology and stuff with human workers and you get outputs. Research is the same way. We combine researchers with the tools they have and you get out intellectual progress.

Ben Garfinkel: One issue that keeps the pace of things relatively slow is that you have, in any given year, a relatively fixed number of researchers. Human population growth is, to some extent, exogenous. Even if you have technological progress, you can’t feed it back directly into technological progress. That there’s this, in some sense, human labor bottleneck. There’s a number of papers by prominent growth economists that look at this idea and say, “Oh, if you eventually get to the point where just humans no longer really play a prominent role in economic growth or in scientific research, maybe have a much, in some sense, tighter feedback loop where technological progress or economic output just feeds back directly into itself”. It may be, in some sense, that things move quite a bit faster.

Ben Garfinkel: I would say vaguely, I don’t know enough about the area, but I would give at least a one in three chance that at some point, if we’re imagining the long run stuff becomes pretty radically faster in the same way that stuff is radically faster today than it was a few hundred years ago, although it’s very non-robust.

Ben Garfinkel: Then again, quite vague, there’s a question of let’s condition on stuff, in some sense, becoming faster. How smooth is that transition to a world where things are faster? Is it that the transition happens over the course of what feels like, or what could be considered, let’s say, two years or less or something? I would probably give it, again, I don’t really know where these numbers are coming from; I give it below 5%.

Groups of people who disagree [01:18:28]

Howie Lempel: You’ve pushed back a little bit on this model that the classic arguments seem to use that has brain in the box type of model of AI development. That basically a main characteristic is the discontinuities over the course of development. And we talked about some of the implications of ways in which you might be a bit less worried of these discontinuities. Are there people who disagree with you and who really would want to push back on you and say that there’s a good reason to think that there are going to be big discontinuities over time?

Ben Garfinkel: Yeah. I think there are certainly people who would disagree with me. One example would just be, I think, that a number of different people working on AI safety, I think especially probably researchers at the Machine Intelligence Research Institute probably have a pretty different model. Something that’s a lot more jumpy in terms of what progress will look like. There’s also a lot of researchers who are obviously working towards eventually developing AGI and are often relatively optimistic about the timelines for doing this which implies some degree of discontinuity. If we get there, let’s say, in the next 10 years, then that implies something pretty jumpy has happened if 10 years from now we’re in the world where human labor is, let’s say, irrelevant. Certainly there are a lot of people who disagree.

Howie Lempel: Do you have a sense of why they disagree?

Ben Garfinkel: Yeah, I think there’s actually a lot of heterogeneity here. I think one thing that’s unfortunate is really not that much has been written up arguing for something like a brain in the box scenario. There is a bit of this in, for example, this series of back and forth blog posts between Eliezer Yudkowsky and Robin Hanson in 2008 called “The AI-Foom Debate”. There are little bits of a few paragraphs in a few different white papers here and there, but really not that much. For example, if I remember correctly, I don’t think that there is anything, for example, in the book Superintelligence that directly addresses the relative plausibility of something like the brain in a box scenario versus something like this smooth expansion scenario, or presents an argument for why you’d expect something like the brain in the box scenario.

Ben Garfinkel: My impression of the arguments for this position are a lot more through conversation and informal bits and pieces I’ve gotten here or there. I think there are probably a few main reasons. So one is this evolutionary analogy. I think a lot of people do find it compelling that there was something like a discontinuity in the evolution of human cognition and think that maybe this suggests there’ll be something that looks like a discontinuity on human timescales when it comes to AI capabilities. I think some people, especially in the past, have this intuition that maybe there is one intellectual breakthrough. That if only we had this relatively simple intellectual breakthrough we could just suddenly jump from AI systems that can’t do much of interest to AI systems that can do all this stuff that people can do.

Ben Garfinkel: This is, I think, for example, the position that Yudkowsky argues in The FOOM Debate. I think I don’t entirely understand the reasoning for it. It seems like to some extent there’s a little bit of counter-evidence from… It seems like AI progress is, to a large extent, driven by lots of piecemeal improvements by computing power gradually getting better and stuff like that. It seems like that’s often the case for other technologies that’s pretty piecemeal. A lot of people do seem to have the intuition that maybe there’s something that looks like a single intellectual insight, that if we could have it, it would allow us to jump forward.

Ben Garfinkel: I think there’s also almost the opposite line of argument, where I think some people have the intuition that in some sense, AI progress is mostly about having enough computing power to do interesting things. Maybe the more intellectual stuff, or maybe the more sort of algorithmic improvements stuff doesn’t really matter that much. Or maybe it’s almost downstream of computing power improvements.

Ben Garfinkel: I think for people who have this view, I think there’s a few different ways it can go. I think one intuition that some people have is if in some sense computing power is the main thing that drives AI progress, then at some point there’ll be some level of computing power such that when we have that level of computing power, we’ll just have AI systems that can at least, in aggregate, do all the stuff that people can do. If you’re trying to estimate when that point will be, maybe one thing you should do is make some sort of estimate of how much computing power the human brain uses and then notice the fact that the amount of computing power we use to train the ML systems isn’t that different and think, “Well, maybe if we have the amount of computing power that’s not much larger than what we have now, maybe that would be sufficient to train AI systems to do all the stuff that people can do”.

Ben Garfinkel: You can almost backwards extrapolate where it’s like, there’ll be some amount of computing power that will allow us to have really radically transformative systems. The amount of computing power is going to be reached relatively soon in the future. Therefore, I guess through backwards extrapolation, it must be the case that progress is relatively discontinuous.

Ben Garfinkel: I think there’s a few, maybe wonky steps in there that I don’t really understand. But I do think something like that argument is compelling to a subset of people. I suppose, just more generally, if you have some reason for thinking that we’ll have AI systems that can do stuff that people can do, not that far in the future, maybe five or 10 or 15 years, then if you hold that view and think that there are arguments for it, then that seems to imply that there’s got to be some level of discontinuity. The sooner you think it might happen, the greater the discontinuity necessarily will be.

Howie Lempel: It sounds like there are people taking up the opposing side. Why haven’t they made you change your mind?

Ben Garfinkel: I think there’s a couple of reasons. One reason is I think I just haven’t really encountered, I think, any very thorough versions of these arguments yet. It’s just the case, for the most part, that they haven’t yet been at least publicly written up in a lot of detail. For example, I’m not aware of any, say, more than a page long piece of writing that uses the evolutionary analogy to try and argue for this discontinuity or really looks into the literature in terms of what happened with humans. I guess I haven’t really maybe encountered sufficiently fleshed out versions of them. I guess, in the absence of that, I’m more inclined to fall back on just impressions of what progress has looked like so far and this general really recurring phenomenon that technological progress tends to be pretty continuous and give more weight to that and view that as more robust than the somewhat more specific arguments I haven’t really seen in detail.

Howie Lempel: Sounds like a lot of the arguments of this space haven’t been written up. Seems to me like some of yours have. Have people who disagreed with you read any of your writing on discontinuities and have you gotten direct feedback back on that?

Ben Garfinkel: Yeah. I basically haven’t gotten that much pushback. I haven’t really maybe gotten that many comments in general yet. I think one bit of pushback is, for example, that I may be unfairly devaluing arguments on the basis of them not having been sufficiently written up. That maybe I should still give more weight to them than I’m giving them. And that maybe it should be more a mistake of giving them a lot of weight until proven otherwise as opposed to… I suppose what I’m doing is I’m not giving them that much weight until I feel like I’ve really gotten them in detail. I think that’s one piece of pushback I’ve gotten. Although I think, in general, I haven’t received that many counterarguments. I think I’ve also circulated this stuff, or had it read by many fewer people than, for example, Robin Hanson or Paul Christiano, who have written relatively similar connected things, pushing back against the view that things will be pretty discontinuous. And then, at least going by things like comments on their blog posts, I don’t think I’ve seen any very sort of detailed rebuttals.

Orthogonality thesis [01:25:04]

Howie Lempel: Yeah. One concern people often raise when they are looking at the classic AI arguments is that they just can’t understand why a superintelligent AI would do something as stupid as whatever the thought experiment is, “Turn all the humans into paperclips”. Anything smart enough to do that should know that it shouldn’t do that. To what extent does this just shut down the argument and just prove that we really don’t have to be worried?

Ben Garfinkel: Yeah. I definitely actually disagree with that objection. I think this is definitely something you see pretty often. Sometimes you have op-ed pieces like, “Oh, if the thing is so smart, it should know that we didn’t want it to make paperclips”. Or, if it’s so smart, it should know the right thing to do. I think that the common response to this is what’s sometimes called, the orthogonality thesis. Where two things being orthogonal just means that they’re basically completely independent. The idea here is that basically any given goal can be pursued extremely effectively or intelligently. Or almost any given goal can. That its at least, in principle, possible to create an AI system that extremely effectively tries to just maximize amount of paperclips or tries to stack chairs as tall as it wants, or just do any given bizarre thing. That at least it’s, in principle, possible to design a system that has some really weird goal that a human would never have and pursues it very effectively.

Ben Garfinkel: And essentially what the orthogonality thesis does is it says we shouldn’t think that just because something is in some intuitive sense smart, that it’ll be trying to do something that any human would ever want to do. It could be, in some sense, intuitively smart, but doing something that’s just quite bizarre and quite at odds with what humans prefer.

Ben Garfinkel: I do actually though have, I suppose, another objection, which is in the vicinity of the orthogonality thesis. I think something that sometimes people have in mind when they talk about the orthogonality of intelligence and goals is they have this picture of AI development where we’re creating systems that are, in some sense, smarter and smarter. And then there’s this separate project of trying to figure out what goals to give these AI systems. The way this works in, I think, in some of the classic presentations of risk is that there’s this deadline picture. That there will come a day where we have extremely intelligent systems. And if we can’t by that day figure out how to give them the right goals, then we might give them the wrong goals and a disaster might occur. So we have this exogenous deadline of the creep of AI capability progress, and that we need to solve this issue before that day arises. That’s something that I think I, for the most part, disagree with.

Howie Lempel: Why do you disagree with that?

Ben Garfinkel: Yeah. I think if you look at the way AI development works and you look at the way that other technologies have been developed, it seems like the act of, in some sense, making something more capable and the after project of making it have the right kinds of goals often seem really deeply intertwined. Just to make this concrete, let’s say what it means for something to be pursuing a goal is that it’s engaging in some behavior that’s, from a common sense perspective, explained in terms of it trying to accomplish some task. If we take a really simple case of, let’s say a thermostat, it’s sometimes useful to think of it as trying to keep the temperature steady as the goal of this piece of technology. Or, something that plays chess, it’s sometimes useful to think of it as trying to win at chess. That’s sometimes a useful frame of mind to have when trying to explain its behavior.

Ben Garfinkel: It seems like, at least on this perspective, often its not a separate project of giving something right goals and making it more capable. Just to walk through a few examples, the thermostat, when you make a thermostat, there’s this thing where you make a metal strip that expands in a certain way and cuts off current if it’s at a certain temperature. The act of doing that makes a thermostat both behave in a way that acts like it’s trying to keep the temperature steady, but also makes it capable of achieving that.

Ben Garfinkel: When keyboard programming old chess engines, like Deep Blue, they put in a bunch of rules for what it should do in certain circumstances or what sorts of search procedures it should do. The act of doing that both made it act as though it was trying to win at chess and made it capable of playing chess.

Ben Garfinkel: When people are designing reinforcement learning systems, they start out just behaving basically randomly. This is the way that this often works in machine learning. Is that something has a policy that just basically engages really poorly. If you look at it, it’s really not useful to think of it as reaching any sorts of goals.

Ben Garfinkel: Then there’s this feedback process. We give it feedback on the basis of the actions it’s performed. Over time, goals essentially start to take shape and it starts to act as though it’s trying to do something that is compatible with rewards you’re giving it and also starts to be more competent at it. Often has this interesting intertwined process that we shift what the things instrumental goals should be in the process of trying to make it more competent.

Ben Garfinkel: I think a last example along these lines is if we look at human evolutionary history, there’s not really this capability goal of genes and intelligence genes. It’s all mixed together. There’re things that you can do that affect a things behavior. Some of the things you do to affect this behavior can be seen as making it smarter or making it more like it’s trying to pursue a certain goal, but there’s not really a very clean separation.

Howie Lempel: Okay. So you’ve given a few examples of systems where the goals of the system and the capability of the system seems necessarily connected. Then you also made a claim about how that might affect AI development going forward. Can you talk a little bit more about that second bit? What does that actually mean for how AI is going to be developed?

Ben Garfinkel: Yeah. I think it’s useful to maybe illustrate this with a concrete case. Let’s talk about, I guess, a problem that is sometimes raised as a useful thought experiment in the AI safety literature. Let’s say you’re trying to develop a robotic system that can clean a house as well as a human house-cleaner can, or at least develop something in simulation that can do this. Basically, you’ll find that if you try to do this today, it’s really hard to do that. A lot of traditional techniques that people use to train these sorts of systems involve reinforcement learning with essentially a hand-specified reward function. What that means is, let’s say you have a simulation of a robot cleaning a house, you’ll write down some line of codes that specifies a specific condition under which you give the simulated robot positive feedback. Then, the robot will, over time, act more and more in a way that is consistent with the feedback you’re giving it. One issue here is it’s actually pretty nuanced what you want the robot to be doing.

Ben Garfinkel: Let’s say you write down a really simple condition for giving the robot feedback. You just say, “Oh, one thing I care about in the context of cleaning a house is I prefer for dust to be eliminated. I don’t want my house to be dusty”. And you write down a reward function. It just says, “The less dust is in a house, the better the feedback I give the robot”. I automate this process. I run a simulation a bunch of times and I get out a robot that’s acting in accordance with this feedback.

Ben Garfinkel: One issue you’ll find is that the robot is probably doing totally horrible things because you care about a lot of other stuff besides just minimizing dust. If you just do this, the robot won’t care about, let’s say throwing out valuable objects that happened to be dusty. It won’t care about, let’s say, ripping apart a couch cushion to find dust on the inside. There’s a bunch of stuff you probably care about, like straightening the objects in your house, or just making sure things are neat, trash is thrown out, other stuff isn’t thrown out, that just won’t be handled if you just make this dust minimization function.

Ben Garfinkel: Then, you can try and make it a bit more complicated. Like, “Oh, I care about dust, but I also care about, let’s say, the right objects being thrown out and the wrong objects not being thrown out”, but you’ll find it really, really hard to actually write down some line of code that captures the nuances of what counts as an object to throw out and an object not to throw out. You’ll probably find any simple line of code you write isn’t going to capture all the nuances. Probably the system will end up doing stuff that you’re not happy with.

Ben Garfinkel: This is essentially an alignment problem. This is a problem of giving the system the right goals. You don’t really have a way using the standard techniques of making the system even really act like it’s trying to do the thing that you want it to be doing. There are some techniques that are being worked on actually by people in the AI safety and the AI alignment community to try and basically figure out a way of getting the system to do what you want it to be doing without needing to hand-specify this reward function.

Ben Garfinkel: There’re techniques where it, for example, you give it feedback by hand and then the system tries to figure out the patterns by which you’re giving it feedback and learns how to automate the process of giving feedback itself. There’re techniques where, for example, you watch a human clean a house and you’re trying to figure out what they care about. And on that basis, you try to automate the process of giving feedback to a learning system.

Ben Garfinkel: These are all things that are being developed by basically the AI safety community. I think the interesting thing about them is that it seems like until we actually develop these techniques, probably we’re not in a position to develop anything that even really looks like it’s trying to clean a house, or anything that anyone would ever really want to deploy in the real world. It seems like there’s this interesting sense in which we have the storage system we’d like to create, but until we can work out the sorts of techniques that people in the alignment community are working on, we can’t give it anything even approaching the right goals. And if we can’t give anything approaching the right goals, we probably aren’t going to go out and, let’s say, deploy systems in the world that just mess up people’s houses in order to minimize dust.

Ben Garfinkel: I think this is interesting, in the sense in which the processes to give things the right goals bottleneck the process of creating systems that we would regard as highly capable and that we want to put out there. Just to, I guess, contrast the picture I’ve just painted with some of the ways in which these issues are talked about and the classic AI risk arguments is there’s at least a framing that at least Eliezer Yudkowksy, for example, has used relatively frequently. And it’s this framing of, “How do we get an extremely superintelligent system to move a strawberry to a plate without blowing up the world”? I think that this is basically a framing that just doesn’t really seem to conduct the way that machine learning research works. It’s not like for the house cleaning case, we’ll create a system that’s, in some sense, superintelligent and then we just figure out, “Okay, how do we make it clean houses”? It’s really this pretty intertwined thing. It’s not like you have this thing that’s, in some sense, superintelligent and then you need to slot in some behavior. It’s all tangled up.

Howie Lempel: I guess how much of the work here is being done by the narrowness of the cleaning bot? If you want to design a bot that’s going to clean a house, then you’re going to heavily focus on evaluating it based on how good it’s doing at housecleaning. If there are ways in which it’s a little bit misaligned with housecleaning, then that’s a major problem and so you’re going to fix that. If you wanted to have a substantially more general robot, will it still be the case that you’ll have as strong a process that will move you towards alignment as you develop?

Ben Garfinkel: Yeah. I don’t think it’s really a big difference. It seems like generality is more of a continuum than a binary variable. Something is more general the more tasks it can perform, or a wider range of tasks it can perform, or more environments it can perform them in. You might imagine, for example, that let’s take the housecleaning robot case and turn up the generality knob. It’s not just a housecleaning robot, it’s also a robot where it can take certain verbal instructions and go up to the store to buy things. Or, we can throw stuff on top of it, or it also has certain personal assistant functionality, like in the process of cleaning the house, it also checks your email and puts things on your calendar. We can keep throwing things into the basket. It’s also a thing that if you give it certain verbal commands, it helps you plan military invasions and things like that. I guess it’s not really clear to me why the process of just making a things behavior more flexible or making it able to engage in a wider range of behavior cleaves apart this deep entanglement between the process of giving it goals and the process of making it capable?

Howie Lempel: Huh. Okay. So if we take a case where it’s just way harder to tell if it’s getting the goals that you want, you would still describe that as the goals and the capabilities are still entangled? How would you describe that case?

Ben Garfinkel: Yeah. I would still say in that case that the process of giving it goals and the process of giving it capabilities are entangled. I think that there can be some scenarios where it’s more likely you won’t notice that you’re very gradually imbuing it with goals that are pretty terrible. So I think for the housecleaning case, it’s relatively clear if you tried to use the dust minimization reward function, it’s going to be this gradual process of like, “Oh, now it’s starting to tear up our couches to find dust and stuff like that”. It’s going to be clear a priori and it’s going to be increasingly clear through the process that this is bad. You’re going to notice it before it’s a superintelligent dust destroyer.

Ben Garfinkel: I think if you’re imagining the more nuanced cases where maybe you’re doing something where the behavior is a lot harder to evaluate. I know the example that, for example, Paul Christiano has sometimes used, is a city planning AI system. It seems to be structurally similar where, just standard reinforcement learning with a hand-specified reward function just isn’t going to give you the ability to train an AI system to plan cities well. Or, doing supervised learning over examples of 100 big, good planned cities is not going to be sufficient. It seems like to even have something that’s trying to do, let’s say city planning, and you’d ever plausibly actually want to try and use, and it would look even remotely coherent to what it’s doing, you’d probably need some progress on alignment techniques. Given that lots of the downsides of, let’s say, certain city plan designs are maybe harder to see than certain downsides of, let’s say, a robot messing up the house. This could be a case where, to some extent, alignment techniques serve as a bottleneck of trying to deploy any city planning system. Maybe there’s some unhappy valley where you’ve worked out alignment techniques well enough that it’s not just behaving incoherently. It’s actually acting like it’s trying to do something roughly in line with what a city planner would be trying to do. But it’s not really all the way there and you aren’t quite able to notice that it’s not all the way there. That it’s actually doing something that’s not fully onboard because it’s just so nuanced in terms of what the consequences are.

Entanglement and capabilities [01:37:26]

Howie Lempel: You’ve argued that the process of giving an AI goals, and the process of increasing its capabilities, are entangled and blurs together the projects of working on alignment or goal-giving and the ones of increasing capabilities. What does this mean? What’s the upshot? Why is this important?

Ben Garfinkel: One thing that it means is that it at least somewhat turns down the urgency knob. Where if progress on alignment techniques is itself a bottleneck for creating systems and putting them out into the world that we would intuitively regard as very intelligent and have any inclination to actually use. Then this means that there’s not this exogenous time pressure of, “Oh, we need to work out these techniques before this external event happens”, which I think smooths things out a bit. It also means to some extent that some of the conventional stories that people tell about how AI risk could happen. This thing that you give the goal of paperclip maximizing and then you’re caught off guard by it all of a sudden having this goal and engaging in this radically inappropriate way. This looks even more in what machine learning progress will probably look like than one might naively have assumed a priori.

Howie Lempel: Are there people who have good objections to this line of argument?

Ben Garfinkel: It’d be relatively hard to argue, at least against the idea that there’s this deep entanglement between advancing of goals and making it act in a way we’d intuitively regard as intelligent. Probably the main objection you could raise would be to say basically, “Yeah, there’s this entanglement, but even if there’s this entanglement, that doesn’t resolve the issue”. I think the way you’d need to frame this is that there’s a bit of this unhappy valley where, again, let’s say the city planner case. If we basically don’t have any progress on alignment techniques, we’re stuck using hard-coded reward functions or supervised learning. Like we’re just not going to use AI systems that were trained using those methods to try and plan cities. I think the thing that you need to think is that we’ll make some amount of progress basically on alignment techniques or what are currently called alignment techniques.

Ben Garfinkel: It’ll seem okay at the time. It will not be behaving incoherently. It will look it’s trying to do the same thing that a normal city planner’s roughly trying to do, but we won’t be all the way there and then disaster could strike in that way. That’s the main avenue you could go through. A second related concern, which is a little bit different, is that you could think this is an argument against us naively going ahead and putting this thing out into the world that’s as extremely misaligned as a dust minimizer or a paperclip maximizer, but we could still get to the point where we haven’t worked out alignment techniques.

Ben Garfinkel: No sane person would keep running the dust minimizer simulation once it’s clear this is not the thing we want to be making. But maybe not everyone is the same. Maybe someone wants to make a system that perceives some extremely narrow objective like this extremely effectively. Even though it would be clear to anyone with normal values that you’re not in the process of making a thing that you want to actually use. Maybe somebody who wants to cause destruction could conceivably plough ahead. So that might be one way of rescuing a deadline picture. The deadline is not when will people have intelligent systems that they naively throw out into the world. It’s when do we reach the point where someone wants to create something that, in some sense, is intuitively pursuing a very narrow objective, has the ability to do that.

Howie Lempel: If I summed up part of the takeaway as the following… I want to see if I’m following it. I might say that there is an entanglement between capabilities research and research on alignment and goals, such that if you’re trying to make some project, so maybe it’s a robot vacuum, you’re both going to need some level of capabilities and you’re going to need some minimum level of alignment. There’s going to be an interaction loop between the two and you’re going to need both. It’s still possible that the level of alignment research that you would need in order for that to work out is not at the same level as you would end up needing, like being a hundred percent sure that it’s always safe. Or you can imagine systems where the amount of alignment that you need for it to 99% of the time work fine is not enough alignment. It’s like there’s still that possibility that you fall into that valley.

Ben Garfinkel: Yeah, actually one thing I think that is probably important to clarify is I’ve been talking a lot about alignment risk where basically that’s one way of defining misalignment risk where an AI system does something that we intuitively regard as very damaging and the best explanation for it is that it was basically competently pursuing a goal or set of goals that’s a bit different than a set of goals that we have. That’s definitely not the only way that an AI system can cause harm. You can also have a system cause harm through some means that’s not best thought of as misalignment. Let’s say a self driving car veering off the road. That seems like not really a misalignment risk.

Ben Garfinkel: One way you can still definitely continue to have risks from systems, even if capabilities and goals are entangled, is a system basically just messing up in some weird way that doesn’t have to do with misalignment per se, but it’s not very robust. Imagine a housecleaning robot. It’s in a sense trying to do what a normal housekeeper would do, but sometimes house cleaners accidentally set houses on fire, for example. That’s one avenue you can go through and then the other one is that it can be the case that you’ve created something that has roughly similar goals. The sorts of goals you’d want it to be having, or it has goals which look functionally similar within some narrow environment, but actually you shift to some other environment and actually the fact that they’re subtly different becomes relevant. Those are two ways in which you could still arrive at some degree of risk even if you accept this point that these things are entangled.

Howie Lempel: We’ve now talked through the point that you raised, that there actually are some links between the process by which capabilities are instituted in a AI system and the process by which goals or alignment, and this being engineered to an AI system. That might mean that the model where, at some particular point in capabilities, it’s necessary for the people working on alignment to make sure that they catch up doesn’t actually make that much sense because in order to hit a certain point in capabilities, you need that to be enabled by AI being aligned as well.

Instrumental convergence [01:43:11]

Howie Lempel: Stepping back from that set of issues, you had a third set of concerns that you want to raise about the classic AI arguments.

Ben Garfinkel: So I suppose I have a third and last objection to the way that some of these classic arguments are sometimes presented. And it has to do with what’s sometimes known as the instrumental convergence thesis. The idea here is basically that for most goals you might have, if you pursue them sufficiently effectively, it will tend to lead you to engage in behaviors that most people would consider abhorrent. Again, to return to the paperclip thought experiment, you have this goal which is maximize paperclip production. Given this goal, certain things seem to follow from it. Insofar as other people can prevent you from maximizing paperclips, that’s a threat to your ability to pursue the goal. So you have some instrumental reason trying to accrue power to try to have influence over people. To try and stop them from standing in your way.

Ben Garfinkel: You also have instrumental reasons to try and acquire resources in general. Anything that might be useful for making paperclips and the act of pursuing power and pursuing resources without regard to things besides paperclip production will probably lead you to do things that most people would consider abhorrent. Same thing for if you imagine pushing a dust minimizer case really far forward. If people are going to shut you down before you can get the last little bit of dust out of the crawlspace or something, then you have reason to try and have power over these people or trying to acquire resources that can prevent you from being shut down. If you push it, to the extreme level, if you are trying to accrue resources, you’re trying to accrue power and you have relatively narrow goals and you’re sufficiently effective at doing what you’re trying to do, then it seems in loads of different cases, you’ll be doing something abhorrent. That’s the basic idea of instrumental convergence.

Howie Lempel: That’s the concept of instrumental convergence and what role does it play in the classical arguments?

Ben Garfinkel: There is an implicit argument structure that goes something like this: we can think of any advanced AI system as, in some sense, pursuing some set of goals quite effectively. Most set of goals it has, given that it’s pursuing them very effectively, have the property that they imply the system ought to be engaging in really abhorrent behaviors. Let’s say maybe harming humans to prevent humans from shutting the system down. And therefore, because we’re eventually going to create systems that can pursue goals very effectively, and most goals when pursued sufficiently effectively imply abhorrent behaviors, there’s a good chance that we will eventually create systems that engage in really abhorrent behaviors.

Ben Garfinkel: To illustrate a bit how this works with a quote, there’s a quote from a number of different Eliezer Yudkowsky essays, where he says for most choices of goals, instrumentally rational agencies will predictably wish to obtain certain generic resources such as matter and energy. The AI does not hate you, but neither does it love you. And you are made of atoms that can be used for something else. Another presentation from another MIRI talk is most powerful intelligences will radically rearrange their surroundings because they’re aimed at some goals or other. And the current arrangement of matter that happens to be present is not the best possible arrangement for suiting their goals. They’re smart enough to find some in access by their arrangements. That’s bad news for us because most of the ways of rearranging our parts into some other configuration would kill us. So the suggestion again seems to be that because most possible, in some sense, very intelligent systems you might create will engage in, perhaps omnicidal behaviors, and so we should be quite concerned about creating such systems.

Howie Lempel: Is it true that we should assume that if there is an AI and it’s smart enough and it has a goal it’s going to start going after us because we are all threats to turning it off?

Ben Garfinkel: The main objection that I have to this line of thought is that if you’re trying to predict what a future technology will look like, it’s not necessarily a good methodology to try and think, “Here are all of the possible ways we might make this technology. Most of the ways involve property P so therefore we’ll probably create a technology with property P”. Just as some simple silly illustrations. Most possible ways of building an airplane involve at least one of the windows on the airplane being open. There’s a bunch of windows. There’s a bunch of different combinations of open and closed windows. Only one involves them all being closed. That’d be bad to predict that we’d build airplanes with open windows. Most possible ways of building cars don’t involve functional steering wheels that a driver can reach. Most possible ways of building buildings involved giant holes in the floor. There’s only one possible way to have the floor not have a hole in it.

Ben Garfinkel: It seems too often be the case that this argument schema doesn’t necessarily work that well. Another case as well is that if you think about human evolution, there’s, for example, a lot of different preference rankings I could have over the arrangement of matter in this room. There’s a lot of different, in some sense, goals I could have about how I’d like this stuff in the room to be. If you really think about it, most different preferences I could have for the arrangement of matter in the room involve me wanting to tear up all the objects and put them in very specific places. There’s only a small subset of the preferences they have that involve me keeping the objects intact because there’s a lot fewer ways for things to be intact then to be split apart and spread throughout room.

Ben Garfinkel: It’s not really that’s surprising, I don’t have this wild destructive preference about how they’re arranged. Let’s say the atoms in this room. The general principle here is that if you want to try and predict what some future technology will look like, maybe there is some predictive power you get from thinking about X percent of the ways of doing this involve property P. But it’s important to think about where there’s a process by which this technology or artifact will emerge. Is that the sort of process that will be differentially attracted to things which are let’s say benign. If so, then maybe that outweighs the fact that most possible designs are not benign.

Howie Lempel: Got it. Then in this particular case of AI development, do we have a good reason to think that researchers will be attracted to the possible AI systems that are more benign?

Ben Garfinkel: This connects a lot to the previous two objections that we talked about. One reason is I expect things to be relatively gradual. Insofar as we’re creating systems that are not super benign or harmful, I expect us to notice that we’re going down that path relatively early. And just noticing various issues like maybe AI systems are engaging in deceptive behaviors or AI systems are being nonrobust in important ways. So we’re likely to have, at least assuming some degree of continuity, we’re likely to have feedback loops which would be helpful for us, steering us towards things that we want. And I think the last point we discussed as well, the entanglement of capabilities and goals makes me think that we won’t end up necessarily accidentally in lieu of the ability to create an aligned and capable system.

Ben Garfinkel: There’s not necessarily a strong reason to think that we’ll go ahead and create something that’s a dust minimizer or paperclip maximizer that’s quite capable, just because we don’t know how to make something that’s both aligned and capable. So I think those two considerations, both pushing the direction of the engineering processes we’re following being of the sort that we won’t end up crazy, crazy far away from something that’s doing roughly what we want it to be doing. There’s some intuition of just the gap between something that’s going around and let’s say murdering people and using their atoms for engineering projects and something that’s doing whatever it is you want it to be doing seems relatively large. It seems you’ve probably missed the target by quite some way if you’re out there.

Howie Lempel: One of the things that the traditional instrumental convergence argument was arguing is that this space is very dense with really, really bad scenarios. So if you missed by a little bit, you’d end up in an awful scenario. It seems you think that the ones where the AI takes all of our atoms are not going to be close at all to the ones where the AI successfully cleans our room?

Ben Garfinkel: Yeah, I think this isn’t definitely the case, but one intuition here is we think about machine learning systems using neural nets and you imagine tweaking some of the parameters. I’m not an expert, but my impression is that you typically won’t get behaviors which are radically different or that seem like the system’s going for something completely different. In some senses, it’s almost the premise behind current machine learning methods is that you can make some small tweak to the parameters of a model, see how it goes and it won’t be crazy different. It won’t be off the wall. It will be close to what it’s already doing. You can say, is it a little bit better? Is it a little bit worse? Then you follow your way down the path towards something that’s doing what you want it to be doing in these small increments. It does seem it’s a property of machine learning today, at least at the weight level or the parameter level that small tweaks don’t lead to systems which are doing something that’s radically, radically different. Although I should also say that I’m definitely not an expert on issues around robustness.

Howie Lempel: Would we be more worried if we both tweaked one of the ML systems and then also changed the context it’s operating on?

Ben Garfinkel: Changing the environment that an AI system is operating in can definitely have a substantial impact on its behavior as well. There’s this general concern that AI systems often are not robust to changes in distribution. And what that means in concrete terms is that if you develop an AI system by training it on experiences or data pertaining to a certain type of environment or a certain type of scenario, and then you expose it to an environment or scenario that’s very different than any of the one’s it was trained, then its behavior might look fairly different and especially its behavior might look a lot more incompetent, or foolish, or random. So to put this in concrete terms, if you develop, let’s say a self-driving car, by training it on lots of data pertaining to let’s say, good road conditions, and then you take your self-driving car and you put it in off-road conditions, or you put it on, let’s say an ice road or something that’s very different than any of your roads it was trained on.

Ben Garfinkel: Then there’s a heightened chance that it’ll do something like veer off the road. Or if you train a text completion system on English language texts. So you train a system that’s meant to predict the second half of a sentence, let’s say, from the first half, given only English sentences, and then after it’s trained, you give it the first half of a sentence in Spanish, there’s obviously a heightened chance that it won’t really know what to do with that. It probably won’t give a very sensible output, in response to this Spanish half sentence. So it definitely is the case that putting an AI system in a new type of environment can substantially change its behavior. At the moment, though, what this tends to look like is just sort of incompetent behavior, or foolish behavior, or random behavior, as opposed to behavior, which is very, let’s say coherent and competent and potentially concerning.

Ben Garfinkel: It should be said though, that a number of researchers, especially over the past year, and especially at the Machine Intelligence Research Institute have been developing this concept that they call mesa-optimization. And I won’t really get into the concept here. But one of the ideas which is associated with the concept is this thought that, as machine learning systems become more sophisticated in the future, maybe when they’re exposed to new environments, they won’t just behave in a way that looks, sort of incompetent or foolish to us. They might behave in a way that still looks competent and goal directed and intelligent, but nonetheless, the systems might act like they’re pursuing a different goal than it seemed like they were pursuing in the initial class of environments that they were trained on. So if that’s true, then that would be something that would exacerbate the risk that, um, you know, maybe you can get out fairly different goal directed behavior than the behavior you saw during training.

Howie Lempel: You’ve now argued that even if it is the case that most theoretical AIs one could imagine would have these instrumentally converging goals, we still wouldn’t be in a place where we should expect it to be too likely that we end up with some omnicidal AI because engineers are going to use processes that are more likely to land on the things that we want than the things that we don’t want. Are there counterarguments or reasons that we still have to be concerned?

Ben Garfinkel: There are probably a few different classes of counterarguments. One might be that we have these, again, toy thought experiments we talked about earlier that might be in some way suggestive. We have all these thought experiments where you give an AI system, in some sense, a goal that corresponds to maximizing paperclip production and that initially seems okay to you. Then you realize that if you follow out the implications then it’s quite negative. For example, there’s a 10 page stretch in Superintelligence where Nick goes through a bunch of English language commands, where at first glance it seemed benign things make us happy or things like this. And then it describes even though these seem benign, if you actually interpret them somewhat literally, and single-mindedly try and fulfill this objective, then a bad thing happens.

Ben Garfinkel: Sometimes these sorts of thought experiments are used to suggest that sometimes something that seems benign to you with regards to an AI systems behavior will turn out not to be so benign. It’s hard to tell what’s what.

Ben Garfinkel: The response I have to at least that line of thinking is I still don’t necessarily, I suppose, buy the idea that these thought experiments tell us that much about what’s likely to happen in the context of AI systems. Often it seems to be… The point is that if you take an English language sentence, and then you give an agent the goal of interpreting that sentence literally and single-mindedly without any common sense then trying to fulfill that literal single-minded interpretation, that sentence as confidently as possible, then you get bad outcomes out.

Ben Garfinkel: I still have trouble mapping it on to anything we might see in the context of AI development because obviously, we almost agree that if we’re training the AI systems, it’s going to be based probably on some machine learning thing or something that’s more code-based as opposed to feeding in English language sentences. Even if you’re trying to think of a close analogy, so let’s say we’ve created an AI system that takes English language commands and it goes off and engages in some sort of behavior, it’s hard to imagine how we could have done that such that it tries to do the literal minded thing as opposed to having commonsensical responses to requests. If you imagine we’re training it on, let’s say, data sets of people receiving English language commands and then responding to them, obviously the responses will be responses that go to the commonsense interpretation.

Ben Garfinkel: Someone being asked to, on camera, help maximize a paperclip factory’s production isn’t going to be doing the literal-minded thing. Or if you trained it on, for example, Explorers movies or something that, where people do receive requests and then horribly misinterpret them, then obviously you can get that behavior out. But it doesn’t seem like whatever the relevant data is, you wouldn’t presumably get that sort of thing. Or you might imagine training it on human feedback where you give an AI system some verbal command and it does something. Then you say how much it fulfilled what you wanted it, or how happy you are with the behavior exhibited once you gave it the command. Presumably you’re rewarding it for stuff that exhibits commonsense interpretations of your requests as opposed to literal-minded ones.

Ben Garfinkel: So again, if we’re trying to imagine a close, concrete analog of an AI system that takes English language requests and then does stuff, presumably it’s not engaging in this literal-minded behavior. Obviously no-one is suggesting that again, AI design will really look this, but I have this, I suppose, issue that whenever I try to map on these thought experiments, it’s something that might actually happen in the context of AI development. I don’t understand what the mapping is meant to be or what this actually suggests about AI design in practice.

Howie Lempel: The thought experiments are one response to your argument. What are some of the others?

Ben Garfinkel: Probably actually the one I imagine most people would make today or at least most people in certain sections of the AI safety community would be one has to do with this idea of mesa-optimization that I briefly alluded to before, and I actually still don’t properly understand this.

Ben Garfinkel: There’s the white paper that came out a few months ago. It’s the first rigorous presentation of it and I think the idea is still, to some extent, being worked out. But hopefully this is relatively accurate. The basic idea is that, let’s say that you’re training a machine learning system and you’re giving it some feedback. Again, let’s return to the housecleaning robot case. You’re giving it feedback in terms of that’s meant to correspond to good housecleaning behavior. Intuitively, you’d expect that insofar as the system takes on goals, it’s goals will map on pretty well to the sorts of feedback you’re giving it. If you’re rewarding it for cleaning a house well, then you should expect the goal that develops to essentially be to clean a house as well. Then if you keep doing this, it will, to a greater and greater degree have a high-fidelity understanding of the goal that it pursues.

Ben Garfinkel: The idea with mesa-optimization is that there might be some weird phenomenon that happens during the training process where systems take on goals that are almost, in a sense, perhaps fairly random or quite different than whatever it is the feedback you’re giving that corresponds to. And that the systems might have these almost random or quite strange goals and at some point in the future be sophisticated enough to know that they should keep the goals that they actually have secret. Because if they reveal that they have these goals, then people won’t use the systems. So they play along as though they have the expected goals, but then they have these randomly generated ones. I don’t understand the intuitions for this that much.

Ben Garfinkel: The main analogy that people use to try and drive this is if you think of the case of natural selection, obviously the thing that people were basically being selected for historically is reproductive fitness. Like how good are you at passing your genes to the next generation? You can think of natural selection as this feedback process that’s, in some way, a little bit analogous to machine learning.

Ben Garfinkel: And the interesting thing is that humans today can be said to have certain goals, but these goals are not “Maximize your own reproductive fitness”. People have other goals. In fact, people willingly choose to use birth control or to sacrifice their lives for causes greater than themselves and things like this.

Ben Garfinkel: People have ended up with this somewhat strange and diverse suite of goals rather than the single goal of reproductive fitness which was the thing the evolutionary feedback process was selecting for. People use this analogy to say we should expect future AI systems to also, in a similar way, end up with a strange suite of goals rather than just the thing we’re training them for. And something happens where they hide these and we don’t realize it. Then these randomly chosen goals might, by the instrumental convergence hypothesis, be of the sorts that lead them to do horrible things.

Howie Lempel: In the case of evolution, the goals that we ended up with at least seem probably still correlated with reproductive fitness and in the context that evolution was training us for are probably even more highly correlated with reproductive fitness. It feels like they’re not randomly chosen, but actually are at least pointing us towards the original underlying goal. In the case of AI, how do they end up prioritizing these other goals that don’t even seem to be putting them towards the thing that we actually want?

Ben Garfinkel: Yeah, this connects to what we previously discussed a little bit, this idea of the significance of distributional shift when you encounter some experience or some environment that’s different than the one that you’ve been trained in. Some of the idea here is that during the training process, the system has these weird goals that aren’t exactly the thing that you’d expect, but functionally speaking, the most effective way to pursue those goals is basically the same as the most effective way to pursue the goal that you would have expected to have, or that you wanted it to have. So these things pretty much line up. So it’s not that big a deal.

Ben Garfinkel: Then you place it in a new environment that’s pretty different than the one that you trained it in and then in this new environment the behaviors suggested by the goal diverge in a way that they didn’t previously.

Howie Lempel: Got it. Then is there a reason why that agent would end up learning these subgoals that happened to overlap with what we want in the training distribution instead of learning the actual goal that we’re trying to teach it?

Ben Garfinkel: This is something I don’t actually understand all that well. In the context of evolution, it seems you can tell a little bit of a story of why is it that humans have all these other goals rather than having a goal of have as many kids as possible, which it seems like what evolutionarily would be selected for. One potential story is not that far back in human evolution, we probably didn’t have the cognitive capacity to actually represent this abstract idea of passing on your genes, which is obviously the thing you most want.

Ben Garfinkel: You also care if your siblings pass on their genes because they’re statistically likely to have similar genes as you. It’s pretty clear a chimpanzee can’t represent or understand this abstraction. It’s not useful or even feasible for a chimpanzee to have the goal of maximizing the probability that its genes will carry forward into the future. Insofar as you have goals that make sense that they should be things that your brain can actually represent to some extent, which might be have more food, have sex, and things that.

Ben Garfinkel: In the concept of evolution, you might imagine that if you let natural selection run for a longer period of time now that we have the cognitive abilities to represent the concept of passing on your genes, we know genes exist and stuff like that, maybe you’d expect there to be some answer that washes out. You run the selection process long enough and then eventually we’ll actually end up with people who have the goal that was being selected for.

Ben Garfinkel: In the context of machine learning I didn’t have a great sense of how analogous or disanalogous. I know it seems you could run through some similar analogy of there’s certain goals that you might want to train AI systems to, in a sense have, but the goal is, in some sense, quite abstract. It’s like maximize the enactment of justice or something like that. So you can’t actually directly, if you want an AI system that helps you achieve just outcomes in terms of criminal justice sentencing or something, you can’t actually create an AI system that has that goal effectively in a nuanced way. You need to create an AI system that has some cruder set of proxy goals that functionally get it to do the same thing as though it had this more nuanced one. Maybe that’s an issue that’s in some sense recurring, but I really am quite vague in terms of what this in practice looks or how the analogy works.

Howie Lempel: Is there an efficiency argument or something where if in the training distribution, there are these instrumental goals that keep getting it exactly what it wants, then in the training distribution, it won’t make sense for it to think from first principles over and over again what is the best way to get the actual reward? So it might be a reason why they would evolve to prioritize those instrumental goals over the more fundamental thing.

Ben Garfinkel: Yeah. That’s another good reason to be sympathetic to this argument as well. Let’s imagine that you were living your life and the entire thing you’re thinking throughout your life was how to maximize the passing on of my genes or something. Or let’s say you’re utilitarian. It’s pretty well accepted that if you want to maximize global happiness or something, probably the right way to live is not constantly every decision you make, try and run this very expensive calculation. In terms of the flow-through effects of me deciding to mow my lawn or buy a red shirt versus a blue shirt, probably you don’t actually want to be very directly trying to figure out what actions fulfill this goal or don’t. It’s probably more useful to have lots of heuristics or proxy goals or rules of thumb that you use.

Ben Garfinkel: That’s definitely one intuitive argument for why you might end up with a divergence is maybe just there’s lots of simpler goals that it’s easier to think about how good a job you’re doing at to fulfill them. These goals are much simple, so it’s less expensive and also they mostly overlap so it’s pretty fine to have these goals as well.

Howie Lempel: What might objections be to this mesa-optimizer argument?

Ben Garfinkel: I’ll list a few different ones. One is just the classic really general… I think this argument still hasn’t been laid out that fully or clearly. In some sense it’s come onto the scene recently in the past year in this online paper. And it’s still a little bit tenuous as an argument.

Ben Garfinkel: It’s not exactly clear that the concepts work or what the nature of the mapping is between evolution of machine learning. That’s, I suppose, a generic one that doesn’t target it very effectively, but it’s some initial consideration.

Ben Garfinkel: More specifically one response you might have is that, as we discussed, it seems one hostile view is yeah, this thing will maybe happen where AI systems, will have goals that are different than the ones you expect or different than the thing you’re directly selecting for. Also the main thing that’s driving us is this thing that an insufficiently sophisticated system can’t appropriately represent necessarily the goals you want to be training it for. If you keep training it long enough, then eventually you pass through this unhappy valley where it’s sophisticated enough to mess around, but not sophisticated enough to represent the concept and eventually you just converge on a system that’s trying to do the thing you want.

Ben Garfinkel: That’s one possible view which is basically, if you train a thing for long enough, or if you train a thing on enough different environments to cover issues around robustness/distributional shift. Or you run adversarial training where you intentionally train it on environments which are supposed to maximally reveal possible divergences, maybe this thing sort of washes out.

Ben Garfinkel: The last one, which probably is most relevant for me because I don’t have super firm views in terms of how likely this is or what the solutions to it are is a general point. If you expect progress to be quite gradual, if this is a real issue, people should notice that this is an issue well before the point where it’s catastrophic. We don’t have examples of this so far, but if it’s an issue, then it seems intuitively one should expect some indication of the interesting goal divergence or some indication of this interesting phenomenon of this new robustness of distribution shift failure before it’s at the point where things are totally out of hand. If that’s the case, then people presumably or hopefully won’t plough ahead creating systems that keep failing in this horrible, confusing way. We’ll also have plenty of warning you need, to work on solutions to it.

Treacherous turn [02:07:55]

Howie Lempel: So there’s a set of counterarguments usually grouped under the phrase, “Treacherous turn”. Do you want to explain what’s going on there?

Ben Garfinkel: Yeah, so this is one potential counter that I think you might make to the thing I just said of like, “Oh, well if mesa-optimization is a problem, but stuff is gradual, then you should notice that this is an issue, and you should have time to work on it”. And also not plough ahead blindly being oblivious to this. And treacherous turn arguments are just basically arguments that systems have incentives to hide the fact that their goal is to verge from the ones you’d like them to have.

Ben Garfinkel: Insofar as they think that, if you knew the goals that they had were different than the ones you have, you’d try and shut them down and change them. And so the suggestion here is sort of like, well maybe actually you wouldn’t notice that this is an issue because any system that has goals are importantly different to the mesa-optimization will have incentives to hide that fact because it doesn’t want to be shut down and wants to be able to left free to pursue that goal. And so it will hide that. And you won’t really notice that mesa-optimization is an issue until it’s too late.

Ben Garfinkel: I think just basically, yeah. I guess the response here, which is similar to something I talked about earlier in the podcast is if you do imagine things would be gradual, then it seems like before you encounter any attempts at deception that have globally catastrophic or existential significance, you probably should expect to see some amount of either failed attempts at deception or attempts at deception that exist, but they’re not totally, totally catastrophic. You should probably see some systems doing this thing of hiding the fact that they have different goals. And notice this before you’re at the point where things are just so, so competent that they’re able to, say, destroy the world or something like that.

Howie Lempel: And does the treacherous turn argument depend on the system being deceptive, or could it just be that it has goals that we’re perfectly happy with in the context of systems at one level of capabilities, but then there sort of is a distributional shift when you move to a higher level of capabilities. So that same set of goals is something that we’re unhappy with.

Ben Garfinkel: So, I don’t think that the issue depends on, I guess, there being this intentional hiding of the divergence. The thing is, if there’s not intentional hiding of the divergence, it seems even dramatically more likely that you should notice this issue when you expose systems to changes in distribution. If there’s no intentional hiding of this, then it seems surely, if stuff is gradual, you’ll notice this kind of interesting way of responding to distributional shift before you’re at the point where individual systems can destroy the world. And so the system actually hiding it has this ingredient that’s sort of necessary to make it somewhat plausible that you can.

Howie Lempel: How about if things are less gradual?

Ben Garfinkel: Yeah. I mean, I think just in general, if you turn up the discontinuity knob, then all of these considerations become weaker and weaker. So the more rapid things are, the less of a window of time there is to notice that there are issues, or the fewer systems that exist in the world that are quite significant, the fewer instances or opportunities to notice this phenomenon exists, or the fewer possible chances of catching something when deceptive behavior exists.

Ben Garfinkel: And also it’s like less time you have to work on or find solutions. Just in general, the more discontinuous you make things, the less certain it becomes that like, “Oh yeah, you’ll notice this issue in a weaker form”, or like, “Oh, you’ll have time to do stuff that’s relevant to this”.

Howie Lempel: I guess also it might be the case that if things are more discontinuous, then the more advanced system that you deploy will be more and more dissimilar from the previous ones.

Ben Garfinkel: Right. So it’s definitely less likely… it’s more likely any problem you see will have a close analog in the previous versions. And I guess just one last point to throw in as well that we discussed before, is if stuff’s more continuous, then probably there’s a lot more AI systems that exist in the world. And probably some of them are useful for constraining the behavior of others, or noticing issues or helping you do AI research in a not super messed up way.

Ben Garfinkel: And again, yeah, I think that this just really comes down to if stuff is sufficiently continuous and capabilities are sufficiently distributed. I have a lot of trouble understanding how this would happen. How you’d get to the point where an individual system is powerful enough to destroy the world. You haven’t already noticed that you have this issue with AI design where you’re creating things which are accidentally extremely deceptive in this way.

Ben Garfinkel: I think this is something that makes maybe more sense when you’re imagining something like the brain in the box scenario, where you have this thing that’s very different than anything you’ve ever had before that’s sort of sitting on a hard drive. You have no experience actually using this sort of thing out in the world. You maybe run some tests in a lab or something, and the thing is, it’s already… it’s jumped to clever in the way that humans clever, and sort of hides what’s going on there. But when I imagine progress, it’s this very distributed thing. It’s very gradual. Then I have a lot of trouble basically understanding how this works, and how this sort of deception goes unnoticed.

Howie Lempel: What’s the argument that there should have been deception in earlier versions of AI where this would be the case? Like why should this deception have been noticed earlier?

Ben Garfinkel: I mean, so we already have, I guess, as discussed a bit earlier, some low-level versions of deception. Like this system, this robotic gripper system that pretended to be gripping a thing when really it wasn’t. And it seems like if this is just kind of a characteristic of AI research which is like, “Oh, accidentally when you’re training a system, this is a thing that happens quite frequently. It’s like, it has some other goal. And then it hides that from the overseer”.

Ben Garfinkel: That seems like if this is a likely enough characteristic, and you have loads and loads of systems out in the world, you have millions of various ML systems, with various uses and capabilities, it seems like you should be noticing that that’s a property of how you’re developing AI systems. I have trouble imagining a world where, let’s say if things unfold over decades and it’s this gradual expansion of capabilities that no one really notices., that this is a serious issue up until the point where you live in a world when the individual AI system can sort of, let’s say, destroy the world on its own.

Howie Lempel: There’s something good you could say, which is you don’t start seeing lying until you start to get AIs that are capable enough that they’re able to with high confidence get away with their lies. Does that seem more possible to you?

Ben Garfinkel: I don’t really think so. So to some extent the scraper case is an example of a thing lying and not getting away with it. And even if, let’s say, things can get away with it. Let’s say 90% of the time. If you have loads and loads of systems, then eventually you’re going to see some that catch it. And yeah, I mean, maybe this is not a fair analogy, but children, for example, are examples of slightly cognitively impaired agents that lie a whole bunch of the time, even though they quite frequently get caught.

Ben Garfinkel: One thing you probably learned to do was actually like learn how to lie. Probably it’s just something that you don’t get good at until you’ve actually had experience lying, I would imagine. I think there’s also this other sort of objection to the treacherous turn idea, is it seems to be this story where if we end up with a system that is in a position to destroy the world, and it has a goal to do something that implies it should destroy the world, then we may have trouble running tests, and notice that. But it’s not in and of itself an explanation of why we should even have ended up in that scenario in the first place. Like why we were in this world where we have a thing that wants to destroy the world and is that capable as well?

Howie Lempel: So at most it says it would be very difficult to actually test it, because it will be behaving in the lab just as it’s best to get its goals to behave in the lab, which does not involved doing anything terrible.

Ben Garfinkel: Yeah. I think another element as well… definitely think this is like a less robust point is again, if you’re imagining at least the fairly continuous scenario, then at least in that world maybe you have lots of tools that we don’t have today to describe the behavior of AI systems. You have lots of AI-enabled capabilities where you can use AI systems to sort of like suss out certain traits of other ones which might also help with the issue and make actually…

Ben Garfinkel: You know, I have no idea, but it’s not necessarily clear just because if an AI system drops on Earth today that was superintelligent, it would be hard to figure out if it was deceptive. But that doesn’t actually mean that in the world we’d be in the future, that it actually would be that hard to… we actually would lack the tools to figure out what’s going on there.

Where does this leave us? [02:15:19]

Howie Lempel: Okay. So we’ve now been through three different objections that you’ve had to some of the premises of the classic AI risk arguments. Do you want to take a step back and ask what did this all mean and figure out where this leaves us?

Ben Garfinkel: Yeah. So first of all, there’s this high-level landscape of arguments that AI poses other risks or opportunities that have long-run significance that warrant us really prioritizing it over other areas today. And there’s a few different obvious reasons for that, a lot of which which we discussed earlier don’t necessarily have to do with technical safety issues.

Ben Garfinkel: So AI being destabilizing, for example, doesn’t really hinge on these prickly arguments about AI safety posing an existential risk. And then within the context of safety or technical considerations, there’s a little bit of a landscape of arguments. So here I’ve been basically mostly focused on what I’m calling the classic arguments. It’s really like a class of arguments that was mostly worked out in the mid 2000’s, and that’s most prominently talked about in Superintelligence.

Ben Garfinkel: Though Superintelligence also talks about other things related to AI. Also, I think a class of arguments that has, I think, been especially influential. I think a lot of people I know who took an interest in AI risk took an interest in it because they read Superintelligence or because they read essays by Eliezer Yudkowsky , or they read other things which are, in some sense, derivative of these. So we’ve sort of been, in some sense, been talking about what I think are the most extensively described and most influential arguments.

Ben Garfinkel: But I think it’s also important to say that beyond that there’s also other classes of arguments that people are toying with about other arguments that people have gestured at. So, for example, I would list Paul Christiano as an example of an AI safety researcher who does reject a number of the premises that show up in the classic arguments.

Ben Garfinkel: For example, I know that he expects progress to be relatively continuous, and has expressed the view that you might expect AI to present an existential risk, even if stuff is relatively continuous, even if you don’t really have this strong separation between capabilities and goals. And then there’s also this other research, just in the past year, by MIRI researchers on this idea of mesa-optimization, which I sort of vaguely gestured at, which does go above-and-beyond the arguments that show up in Superintelligence and all the essays by Eliezer Yudkowsky. You also have people like Wei Dai who’s also written some blog posts expressing somewhat different perspectives on why alignment research is useful.

Ben Garfinkel: So you want to say there’s this big constellation of arguments that people have put forward. I think my overall perspective, at least on the safety side of things is that I basically, at this point, though I found them quite convincing when I first encountered them, now obviously have a number of qualms about the presentation of the classic arguments.

Ben Garfinkel: I think there’s a number of things that don’t really go through, or that really need to be, maybe adjusted or elaborated upon more. And on the other hand, there are these other arguments that have emerged more recently, but I think they haven’t really been described in a lot of detail. A lot of them really do come down to, it’s a couple of blog posts written in the past year. And I think if, for example, if the entire case for treating AIs and existential risk hung on these blog posts, I wouldn’t really feel comfortable, let’s say advocating for millions of dollars to be going into the field, or loads and loads of people to be changing their careers, which seems to be happening at the moment.

Ben Garfinkel: And I think, yeah. I think that basically we’re in a state of affairs where there’s a lot of plausible concerns about AI. There’s a lot of different angles you can come from. Some saying this is something that we ought to be working on from a long-term perspective, but at least I’m somewhat uncomfortable about the state of arguments that have been published. I think things are more rigorous or more fleshed out, I don’t really agree with that much. And the things that I may be more sympathetic to, just haven’t been fleshed out that much.

Howie Lempel: Just getting down to it a bit more precisely. You found some objections to the Superintelligence classic arguments. How decisive do you think those are? How important to the story are they? Like, they might just be making the story more plausible, and it’s like, “Oh, well now there are a bunch of different ways that the story could go wrong”, versus they might just mean there’s no reason at all to think that we’re going to end up in this scenario. I’m going to ask something like how much weight are you putting on these?

Ben Garfinkel: Yeah. So I think maybe a rough analogy is like, imagine that a proof has been published for some mathematical theorem and you look at the proof and you notice, “Well, okay. So here’s an assumption that’s used in the proof. And actually, I don’t think that assumption is right”. And then here’s this place where some conclusion’s derived, but actually it’s not really clear to me. There’s some missing steps, maybe. The author had a clear sense of what the steps were, but at least me as a reader, I can’t quite fill them in.

Ben Garfinkel: And so I think if you find yourself in a position like that, with regard to mathematical proof, it is reasonable to be like, “Well, okay. So like this exact argument isn’t necessarily getting the job done when it’s taken at face value”. But maybe I still see some of the intuitions behind the proof. Maybe I still think that, “Oh okay, you can actually like remove this assumption”. Maybe you actually don’t need it. Maybe we can swap this one out with another one. Maybe this gap can actually be filled in.

Ben Garfinkel: So I definitely don’t think that it’d be right in the context to say like, “Oh, I have qualms. I think there are holes. I think there are assumptions to disagree with, therefore the conclusion is wrong”. I think the main thing it implies though, is that we’re not really in a state where at least if you accept the objections I’ve raised, or really have good, tight, rigorous arguments for the conclusion that AI presents this large existential risk from a safety perspective.

Ben Garfinkel: I think basically what that essentially means in practice, is that we ought to be putting a lot of resources into trying to figure out, is there a risk here? What does the risk look like? We ought to be extremely wary anytime that someone raises a concern that something could pose an existential risk, you ought to take that extremely seriously. Even if you don’t necessarily think that the argument is right, but at the same time, you shouldn’t necessarily, I suppose, treat it as settled. Or, at the same time, necessarily commit to putting too many resources into the area before you’ve really worked out the structure of the arguments.

Ben Garfinkel: Yeah. Just as a quick analogy, you can imagine in the context of an issue like climate change, or pandemic preparedness, we’d obviously feel quite worried about it if you have the argument for like climate change being a real phenomenon, hinged on a couple of blog posts or conversations of generalists, or one book that a lot of people have sort of moved away from. We’d obviously view that, and lots of areas, a not really satisfactory state of affairs. And so I think probably we should have the same attitude in the context of AI.

Howie Lempel: How much less weight do you put on the AI risk argument now than you did a year and a half ago or something?

Ben Garfinkel: I would say probably a lot less. I don’t know exactly what that means. So I first became interested basically in the area by actually reading Superintelligence. Actually, at the time finding it quite… at least, there’s a lot of stuff in there on basically different ways that AI could be important from a long-term perspective. But at least I found the arguments about some single agent causing the havoc quite convincing.

Ben Garfinkel: And that was my main reason basically for getting interested in the topic and thinking like, “Oh wow, this could be a really, really large thing”. And yeah, I don’t really know exactly how to put a number on it, but I think maybe it’s roughly an order of magnitude drop in terms of how likely it is, I think, that there will be existential, essentially safety failures. My level of interest and concern about other pathways by which AI could be significant from a long-run perspective have, I think, gone up a decent amount, but I think not enough to compensate from the drop in my credence in existential accidents happening.

Ben Garfinkel: And that’s obviously, for the most part, like a super good thing. Like an order of magnitude drop in terms of your credence in doom is obviously pretty nice.

Howie Lempel: Are the classical arguments doing any work for you at all at this point? If you learned actually, there’s just nothing worth writing a book on anymore in Superintelligence. Do you think that that would much change your views?

Ben Garfinkel: I don’t think it would do that much. I think the weight of my concern about existential safety failures really does come down to this really general high-level argument that I sketched at the beginning of just that we know there are certain safety issues. It seems relatively likely these advanced systems will be crazy powerful in the future. Maybe the danger scales up in some way.

What role should AI play in the EA portfolio? [02:22:30]

Howie Lempel: Yeah. So you started to talk a little bit about how much sense it makes to put EA resources into AI safety, given that the field has changed a bunch. You’re working on AI work. Do you have an opinion about the ways in which this should change the role AI plays in EA’s portfolio?

Ben Garfinkel: So I think a view that I probably have at the moment is… Imagine choosing a topic to work on. There’s a big multiplier if you’re working on an area that you find intrinsically fascinating, or that you feel like your natural talent or skills are well set up to do. Or, if you already have really good networks in the area and a lot of background in it.

Ben Garfinkel: And so something that I feel pretty comfortable saying is, if someone is, let’s say, an ML researcher or they’re really interested in machine learning and they are really interested in sort of the long-term perspective in terms of choosing their career. I think it’s probably a really good decision for them to take an interest in alignment research and work on it.

Ben Garfinkel: And similarly for people working in the governance space who have a strong interest in computer science and machine learning, or who’ve already built up careers there. I think the point I would probably go to is, when it comes to people who have, let’s say, a number of different options before them, and maybe working on something having to do with AI governance or AI safety isn’t the most natural thing for them. Maybe they have a strong interest in like biology and they are maybe leaning towards working on biorisk.

Ben Garfinkel: Maybe they have a certain interest in, let’s say, the economics of institutions, or trying to think about better ways to design institutions or make government decisions go better. Or they’re working on reducing, let’s say, tail risks from climate change or something like that. I would probably be uncomfortable given the current state of arguments of encouraging someone to switch from the area where they feel a more natural inclination to work on AI stuff.

Ben Garfinkel: And I’m a bit worried that sometimes the EA community sends a signal that, “Oh, AI is the most important thing. It’s the most important thing by such a large margin”. That even if you’re doing something else that seems quite good that’s pretty different, you should switch into it. I basically feel like I would really want the arguments to be much more sussed out, and much more well analyzed before I really feel comfortable advocating for that.

Howie Lempel: Do you know what it would mean for the arguments to be more sussed out?

Ben Garfinkel: Yeah, so I think there’s a few things. So one is just, as I sort of gestured, there are a number of alternative presentations of the risk that have appeared online in various blog posts or short essays that haven’t yet received the treatment that’s anywhere near as long a rigorous as, for example, Superintelligence. And obviously, a full length academic book is a high bar, but I do think it shouldn’t be that arduous to go beyond what exists right now.

Ben Garfinkel: So one example of a research trend I’d be really, really interested in is, for example, Paul Christiano has a blog post presenting an alternative framing of AI risk called, I think, “What Failure Looks Like”. And there’s a response to it, a criticism by the economist, Robin Hanson, basically saying that it seems like what this argument is doing is it’s taking this classic sort of economics problem of principal-agent problems, where the idea here, just for background, is there are often cases we want to delegate some tasks to some other worker.

Ben Garfinkel: Like rather than fixing your own health, you hire a doctor who’s meant to help diagnose you. Or rather than building a house yourself, you hire some builder. And sometimes the person you’ve hired has goals that were a little bit different than yours, or incentives that are a bit different than yours.

Ben Garfinkel: And you’re not in a perfect position to monitor their behaviors or figure out how good a job they’re doing, and sometimes issues arise there. The doctor will overprescribe pills, or a mechanic will ask you to pay for some servicing that your car doesn’t actually need. And the way Robin Hanson characterized Paul’s presentation of AI risk is it’s basically saying, “Take principal-agent problems. Imagine agents who are just way, way smarter than the agents we have today. That means we’re in a world where you have principal-agent problems where the agents are smarter. That’s a potentially disastrous world”. And Hanson’s criticism was that there’s nothing in the principal-agent literature that suggests it’s worse to have a smarter agent. It’s not necessarily worse to have a smarter doctor or smarter mechanic.

Ben Garfinkel: Paul responded with a critique of that critique of his arguments, and then things were fizzled out to some extent. And, for example, both those people are extremely busy researchers in different fields, but I do think this is the sort of thing where this sort of argument shouldn’t just consist of a couple of blog posts that sort of fizzle out. If we’re really going to be, for example, using this framing as the main justification for treating misalignment as an existential risk, it’s worth someone actually like diving into the principal-agent literature, or trying to write it up in a way that if this is a mischaracterization of it, it’s clear that that’s a mischaracterization of it. It shouldn’t just sort of stop at the level of one 10 minute read blog post, and then one 5 minute read criticism of that blog post.

Howie Lempel: But the structure of this, if somebody wanted to go ahead and do this, would it be “Apply for an EA grant” or something? Then it’d be a person who’s just going across the literature and trying to write up stuff that’s half written up, or resolve these tensions between different posts. It that the type of thing?

Ben Garfinkel: Yeah. So I think that could be a really good way to do it. I think even if it’s not a really full on project, I actually do just think if there’s anyone who, let’s say happens to be sufficiently familiar with the existing debates, or happens to be relatively familiar with basic economics concepts and happens to have five hours of free time a week, then even just having a 20 page blog post might be better than a 10 page one. Or either multiple blog posts trying to illuminate the same thing. I don’t think it’s that arduous to necessarily go beyond that.

Howie Lempel: Are there things you have in mind just for things that would make a big improvement to the AI risk discussion?

Ben Garfinkel: Yeah, so this is truly, I think, quite connected to the thing I just said. I do think it would actually be kind of useful even if there’s some reinventing the wheel phenomenon of just more people trying to write down the way that they understand the case for AI risk, or what motivates them personally. I think that there’s actually a lot of heterogeneity today in terms of AI safety researchers and how they frame the problem to themselves.

Ben Garfinkel: Even though I’ve just spent a whole bunch of time talking about these classic arguments, I think that lots of people have very different pictures of the risk. And a lot of these things are, I think, in people’s heads or in private Google docs or things like that.

Ben Garfinkel: And this is obviously a really large ask, because people are doing important time-consuming work, but I do think it’d be really cool if more people took the time to try and write up basically their own personal worldviews around the AI risk, or their own personal version or narrative just to try and get a clearer sense of what actual arguments are out there. I think there’s also something really nice that happens when debates move from the level of conversation to the level of written arguments.

Ben Garfinkel: I think it’s a really common phenomenon that there’s something that you think is clear or something that you think is a really tight argument, and then you try and write it down formally. It’s like, “Oh, there’s actually this thing that’s a little bit of a missing gap”. Or you put it out in public and there’s someone who read it who you would never would have talked to in person, who has some interesting angle or objection. So really I just would like to see more stuff making its way into basically text.

Howie Lempel: Okay. So maybe to sum up the situation: are there still some very high-level arguments for why it might be really valuable for people who care about the long-term future to work on AI? A lot of them just come from the fact that it’s not that often that there’s a technology that comes around that even has the potential of having the impact of something like the industrial revolution.

Howie Lempel: That’s still the case, even if some of the specific arguments for AI being an extinction risk look a lot less strong after they’ve been examined. Because that’s my summary, at least, of the story on the AI risk part. I think that there’s just a second part to this story where I think people that I’ve talked to, in my experience, was like reading a bunch of your work and feeling like I couldn’t understand how I hadn’t seen some of these issues before, or just how given that this is a book that’s gotten a lot of attention, and that a lot of people have made big life changes on, I hadn’t seen more discussion of these counterarguments.

Howie Lempel: And so just curious if you have thoughts on like, I don’t know, like how surprised should we be? Was it reasonable to say there haven’t been any counterarguments over the last five years that seem that persuasive? I should be pretty certain. Is it even true that there weren’t persuasive counterarguments? Yeah, so what’s your take on this whole situation?

Ben Garfinkel: Yeah, so I think one thing that’s a bit surprising is when Superintelligence came out, I think there actually weren’t that many sort of detailed, good faith critiques of it. So there definitely were a number of people who wrote articles criticizing it. I think often it didn’t really engage closely or they, for example, made arguments I think can be relatively easily swallowed down.

Ben Garfinkel: So one that we, for example, talked about earlier that you often see is people saying, “Oh, if it’s so smart, why is it doing this dumb thing of turning the world into paperclips”, which is sort of knocked down and discussed in depth in the context of the orthogonality thesis. I think there are probably a few people I would call out for having written, I suppose, much more substantive critiques of Superintelligence than most other people.

Ben Garfinkel: I think these would be especially Robin Hanson and Katja Grace. So Robin Hanson, he engaged in, back in 2008, this really long series of back-and-forth blog posts with Eliezer Yudkowsky on questions around AI development trajectories. And I think he made a number of arguments that actually still hold up really well. He presented the case, for example, of AI progress being largely driven by things like compute and gradual piecemeal algorithmic improvements which seems, in my mind, to have held up pretty well.

Ben Garfinkel: Katja Grace also has a couple of blog posts that I think make arguments that are relatively similar to some of the arguments I’ve just made in the context of this interview that I think, unfortunately, for whatever reason, don’t seem to have gotten that much attention. One other person I supposed I would mention as well is, I think Brian Tomasik also has one fairly long blog post which I think also makes somewhat similar arguments who I think unfortunately also isn’t all that well known.

Ben Garfinkel: I actually think I don’t really have too clear of a sense of why some of these critiques haven’t had more influence than they have. I think maybe just for whatever reason, most people didn’t end up becoming aware of them or reading them. Although that’s a bit mysterious to me as well. I think there’s also some aspect as well of, why is it that there weren’t more active criticisms or critiques of these arguments in the community?

Ben Garfinkel: I think one dynamic at play is that, and I think this is a bit unfortunate, is that people often have the sense that in terms of these sorts of arguments, that maybe they’re not an expert enough on the topic to sort of really understand it. Or maybe they’re missing pieces, but there’s this sense that there’s other people in the community who have probably thought of those missing pieces and it just hasn’t really made its way into the work.

Ben Garfinkel: And so I think sometimes people, they maybe just may have not felt sufficiently secure to move from, I don’t quite understand this, to maybe there’s actually a missing piece. And also there may have been some sense of… Just as a common dynamic in the community of not all of the arguments really making their way into print. That like, “Oh, maybe there’s some Google doc or somewhere that covers it a lot. So it’s not really worth me raising because the people with access to this stuff understand it”, which I think is a really unfortunate dynamic.

Howie Lempel: Yeah. The fact that nobody can know if they’ve done a thorough job catching up on the literature, I think, for a lot of people who don’t feel comfortable putting their neck out unless they know that nobody else has made this point.

Ben Garfinkel: Yeah.

Howie Lempel: That’s a really big problem.

Ben Garfinkel: Yeah. I think it was probably a somewhat cyclical thing as well, where people notice that these arguments have been pretty widely accepted. And so they’re like, “Oh, I guess they’ve been really heavily vetted, I suppose”.

Howie Lempel: Yeah. It was like an information cascade type of thing going on I think.

Ben Garfinkel: Yeah, and then you have some sort of feedback loop. I think probably a third thing I’d point to, which is a generic property, is I think a lot of these arguments are talking about something that’s opaque to us: the future of AI and what these AI systems look like. And they’re using concepts which we really don’t have a very great grip on like intelligence and goals, optimization power, and things like this. I think it’s actually often really hard when an argument is using fairly abstract, slippery concepts about something that’s mysterious to us, to really be able to quite articulate when you’re suspicious what exactly the root of your suspicion is.

Ben Garfinkel: Or to really cleanly say what the issue is? And one analogy that I sometimes use, is if you look at Zeno’s paradox, there’s this argument from ancient Greece that motion is impossible. And this is an argument where it’s really, really clear the argument’s wrong. If you look into it, it’s a three-stage argument. It’s not that many moving pieces. And it wasn’t really until circa 1900 that I think there’s a really clear description of what was wrong with the argument. And the thing that it took was the introduction of certain concepts like Cartesian coordinates and infinite sums to actually really be able to say what was going on.

Ben Garfinkel: Partly people just didn’t really have the mathematically crisp concepts to talk about motion, and until they had those, you just couldn’t really say what was wrong with it. I guess if we imagined a version of Zeno’s paradox where the conclusion wasn’t obviously wrong, and it wasn’t just a three-step thing, you can easily imagine this sort of thing just going with the flaws being unnoticed for a really long time. So I think, yeah, sometimes I think people underestimate how hard it is to actually articulate issues with arguments that use somewhat fuzzy concepts.

Howie Lempel: Does that mean that we should have been more sympathetic to some of the earlier criticism of Superintelligence, where it just seemed like there were a lot of people who really disagreed with the book, and the arguments that they gave just didn’t seem very good arguments? It felt like they didn’t quite understand where Superintelligence was coming from. And so I think I felt pretty dismissive of them, because I felt like I could point to the problems in the arguments. Should we be thinking of those more as gestures at some real problem that might be there, and take those types of things more seriously?

Ben Garfinkel: Yeah. I think I’m a bit conflicted here. So on the one hand, I do definitely have sympathy to the idea that if someone makes a criticism, and the response to that criticism is in your book of like, “Oh, here’s the orthogonality thesis”, you shouldn’t necessarily give that criticism much weight. But I do think you should probably take it seriously that it’s at least evidence that people have this intuitive common sense like, “Oh, there’s something wrong here. This is a bit off”. Especially, it seems like the extent to which people have this reaction also seems, to some extent, correlate with how familiar people were with the machine learning.

Ben Garfinkel: And obviously not perfectly. There’s loads of extremely prominent machine learning researchers who are quite concerned about this stuff. And I suppose I do have this intuition that the more an argument relies on fuzzy concepts or the more an argument discusses something that we really have very little familiarity with, or very little contact with, the less we should be inclined to give weight to these arguments, the more suspicious you should be like, “Oh, maybe we’re just not using the right concepts, or maybe there’s some flaw that we can’t quite articulate”, and the more we should potentially lean back on arguments that come from common sense intuition or reference class forecasting. And these sorts of arguments, for example, we’ve seen the argument, “Well, we normally work out safety issues every time as things go on”, which a lot of people have made or like, “That seems really bad. Why would we make something that’s that bad”?

Ben Garfinkel: Which, in some sense, just sounded very dumb because anyone can make these arguments. But I do actually think that they do actually deserve weight, and they especially deserve weight when they’re up against things that maybe they’re more rigorous or more sophisticated, but they’re relying on these concepts we haven’t really worked out.

Howie Lempel: I think we’re going to have to leave it there for now. Thank you so much, Ben Garfinkel, for being on the podcast.

Ben Garfinkel: Yeah. Thank you. This has been really fun.

Howie Lempel: So we were going back over the recording of this episode, and Ben wanted me to inform you that there was a point that he really regretted not making. And so we added it in at the end.

Ben Garfinkel:
Yeah, thank you. I just want to add in a little bit of additional perspective that I didn’t get a chance to slip in. And it’s that, back in 2017, there was this movie called ‘The Boss Baby’ that came out, starring Alec Baldwin as a talking baby, who’s also a boss. And I believe if you add up all of the money that was spent making this movie and marketing it, then it adds up to quite a bit more than has ever been spent on long-term oriented AI safety or AI governance research. And so I do want to make it clear that insofar as I’ve expressed, let’s say, some degree of ambivalence about how much we ought to be prioritising AI safety and AI governance today, my sort of implicit reference point here is to things like pandemic preparedness, or nuclear war or climate change, just sort of the best bets that we have for having a long run social impact. So I think if you ask the separate question, are we putting enough resources as a society or as a species into AI safety and AI governance research? I think that the answer is just obviously, obviously no. I think it’d be really hard to argue that they warrant, let’s say, less than five Boss Babies worth of funding and effort. And so I do just want to make that clear that when it comes to how much we ought to prioritise these areas, at least five Boss Babies is my stance.

Howie Lempel: Thank you, Ben.

Rob’s outro [02:37:51]

I hope you enjoyed that episode.

Just a reminder that we’ve put links to a few presentations by Ben Garfinkel in the show notes which I’ve enjoyed reading. And there’s more resources in the page associated with this episode.

There’s also always our job board at 80000hours.org/jobs if you’re looking to find the next step in your career that can help you improve the world in a big way.

The 80,000 Hours Podcast is produced by Keiran Harris.

Audio mastering by Ben Cordell.

Full transcripts are available on our site and made by Zakee Ulhaq.

Thanks for joining, talk to you again soon.

Learn more

Working in US AI policy

AI safety syllabus

Related episodes

June 22, 2020

#80 – Stuart Russell on the flaws that make today’s AI architecture unsafe, and a new approach that could fix them

Listen now

May 22, 2020

#78 – Danny Hernandez on forecasting and the drivers of AI progress

Listen now

March 19, 2019

#54 – Askell, Brundage & Clark on whether policy has a hope of keeping up with AI advances

Listen now

October 2, 2018

#44 – Paul Christiano on how OpenAI is developing real solutions to the ‘AI alignment problem’, and his vision of how humanity will progressively hand over decision-making to AI systems

Listen now

August 21, 2018

#40 – How well can we actually predict the future? Katja Grace on why expert opinion isn’t a great guide to AI’s impact and how to do better

Listen now

May 18, 2018

#31 – Allan Dafoe on trying to prepare the world for the possibility that AI will destabilise global politics

Listen now

About the show

The 80,000 Hours Podcast features unusually in-depth conversations about the world's most pressing problems and how you can use your career to solve them. We invite guests pursuing a wide range of career paths — from academics and activists to entrepreneurs and policymakers — to analyse the case for and against working on different issues and which approaches are best for solving them.

Get in touch with feedback or guest suggestions by emailing [email protected].

Our crash course on transformative AI

We've carefully selected 10 key episodes to help listeners get to grips with the potential upsides and downsides of powerful, transformative AI.

Check out 'The 80,000 Hours Podcast on AI'

Listen here, or anywhere you get podcasts:

If you're new, see the podcast homepage for ideas on where to start, or browse our full episode archive.

On this page:

Highlights

AI development scenarios

Contenders for types of advanced systems

Implications of smoothness

Instrumental convergence

Ben's overall perspective

Articles, books, and other media discussed in the show

Transcript

Rob’s intro [00:00:00]

The interview begins [00:03:20]

Long-run implications of AI [00:06:48]

Political instability [00:15:25]

Lock-in scenarios [00:23:01]

Alignment problem [00:33:31]

High level overview of Bostrom [00:39:34]

Brain in a box [00:50:11]

Contenders for types of advanced systems [00:58:42]

Implications of smoothness [01:04:07]

Groups of people who disagree [01:18:28]

Orthogonality thesis [01:25:04]

Entanglement and capabilities [01:37:26]

Instrumental convergence [01:43:11]

Treacherous turn [02:07:55]

Where does this leave us? [02:15:19]

What role should AI play in the EA portfolio? [02:22:30]

Rob’s outro [02:37:51]

Learn more

Working in US AI policy

AI safety syllabus

Related episodes

#80 – Stuart Russell on the flaws that make today’s AI architecture unsafe, and a new approach that could fix them

#78 – Danny Hernandez on forecasting and the drivers of AI progress

#54 – Askell, Brundage & Clark on whether policy has a hope of keeping up with AI advances

#44 – Paul Christiano on how OpenAI is developing real solutions to the ‘AI alignment problem’, and his vision of how humanity will progressively hand over decision-making to AI systems

#40 – How well can we actually predict the future? Katja Grace on why expert opinion isn’t a great guide to AI’s impact and how to do better

#31 – Allan Dafoe on trying to prepare the world for the possibility that AI will destabilise global politics

About the show

Our crash course on transformative AI