#81 – Ben Garfinkel on scrutinising classic AI risk arguments

80,000 Hours, along with many other members of the effective altruism movement, has argued that helping to positively shape the development of artificial intelligence may be one of the best ways to have a lasting, positive impact on the long-term future. Millions of dollars in philanthropic spending, as well as lots of career changes, have been motivated by these arguments.

Today’s guest, Ben Garfinkel, Research Fellow at Oxford’s Future of Humanity Institute, supports the continued expansion of AI safety as a field and believes working on AI is among the very best ways to have a positive impact on the long-term future. But he also believes the classic AI risk arguments have been subject to insufficient scrutiny given this level of investment.

In particular, the case for working on AI if you care about the long-term future has often been made on the basis of concern about AI accidents; it’s actually quite difficult to design systems that you can feel confident will behave the way you want them to in all circumstances.

Nick Bostrom wrote the most fleshed out version of the argument in his book, Superintelligence. But Ben reminds us that, apart from Bostrom’s book and essays by Eliezer Yudkowsky, there’s very little existing writing on existential accidents. Some more recent AI risk arguments do seem plausible to Ben, but they’re fragile and difficult to evaluate since they haven’t yet been expounded at length.

There have also been very few skeptical experts that have actually sat down and fully engaged with it, writing down point by point where they disagree or where they think the mistakes are. This means that Ben has probably scrutinised classic AI risk arguments as carefully as almost anyone else in the world.

He thinks that most of the arguments for existential accidents often rely on fuzzy, abstract concepts like optimisation power or general intelligence or goals, and toy thought experiments. And he doesn’t think it’s clear we should take these as a strong source of evidence.

Ben’s also concerned that these scenarios often involve massive jumps in the capabilities of a single system, but it’s really not clear that we should expect such jumps or find them plausible.

These toy examples also focus on the idea that because human preferences are so nuanced and so hard to state precisely, it should be quite difficult to get a machine that can understand how to obey them.

But Ben points out that it’s also the case in machine learning that we can train lots of systems to engage in behaviours that are actually quite nuanced and that we can’t specify precisely. If AI systems can recognise faces from images, and fly helicopters, why don’t we think they’ll be able to understand human preferences?

Despite these concerns, Ben is still fairly optimistic about the value of working on AI safety or governance.

He doesn’t think that there are any slam-dunks for improving the future, and so the fact that there are at least plausible pathways for impact by working on AI safety and AI governance, in addition to it still being a very neglected area, puts it head and shoulders above most areas you might choose to work in.

This is the second episode hosted by our Strategy Advisor Howie Lempel, and he and Ben cover, among many other things:

  • The threat of AI systems increasing the risk of permanently damaging conflict or collapse
  • The possibility of permanently locking in a positive or negative future
  • Contenders for types of advanced systems
  • What role AI should play in the effective altruism portfolio

Get this episode by subscribing to our podcast on the world’s most pressing problems and how to solve them: type 80,000 Hours into your podcasting app. Or read the transcript below.

Producer: Keiran Harris.
Audio mastering: Ben Cordell.
Transcriptions: Zakee Ulhaq.

Highlights

AI development scenarios

You might think that currently the way AI progress looks like at the moment, to some extent, is year by year AI systems, at least in aggregate become capable of performing some subset of the tasks that people can perform that previously AI systems couldn’t. And so there will be a year where we have the first AI system that can beat the best human at chess or have a year where we have the first AI system that can beat a typical human at recognizing certain forms of images. And this thing happens year by year of there’s this gradual increase in the portion of relevant tasks that AI systems can perform.

And you might also think that, at the same time, maybe there’d be a trend in terms of the generality of individual systems. So this is one thing that people work on in AI research, is trying to create individual AI systems which are able to perform a wider range of tasks, as opposed to relying on lots of specialized systems. It seems like generality is more of a variable than binary, at least in principle. So you could imagine that the breadth of tasks an AI system can perform will become wider and wider. And you might think that there are other things that are fairly gradual, like the time horizons that systems act on, or the level of independence that AI systems exhibit might also increase smoothly. And so then maybe you end up in a world where there comes a day where we have the first AI system that can, in principle, do anything a person can do.

But at that point, maybe that AI system already radically outperforms humans at most tasks. Maybe the first point where we have AI systems that can do all this stuff that people can do, they’ve already been able to do most things better than people can before that point. Maybe this first system that can do all this stuff that an individual person can do also exist in a world with a bunch of other extremely competent systems of different levels of generality and different levels of competence in different areas. And maybe it’s also been preceded by lots of extremely transformative systems that in lots of different ways are superintelligent.

Contenders for types of advanced systems

I think it’s probably pretty difficult to paint any really specific picture. So I think there’s some high-level things you could say about it. So one high-level thing you might say about it is that, take a list of economically relevant tasks that people perform today. Take the Bureau of Labor statistics database, and then just cross off a bunch and assume that AI systems can do them, or they can either do certain things that make those tasks not very economically relevant. We can also think that there’s some stuff that people just can’t do today. That’s just not really on people’s radar as an economically or politically or militarily relevant task that maybe AI systems will be able to perform.

So one present day example of something that people can’t do, that AI systems can do, is generating convincing fake images or convincing fake videos. These things called Deepfakes. That’s not really something the human brain is capable of doing on its own. AI systems can. So just at a very general abstract level, imagine lots of stuff that’s done today by humans is either done by AI systems or made irrelevant by AI systems and then lots of capabilities that we just wouldn’t even necessarily consider might also exist as well.

It’s probably lots of individual domains. You can maybe imagine research looks very different. Maybe really large portions of scientific research are automated, or maybe just the processes are maybe just in some way AI-driven in the way that they’re not today. Maybe political decision-making is much more heavily informed by outputs of AI systems than it is today, or maybe certain aspects of things like law enforcement or arbitration or things like this are to some extent automated and just the world could be just quite different in lots of different ways.

Implications of smoothness

One implication is that people are more likely to do useful work ahead of time or more likely to have institutions in place at that time to deal with stuff that will arise. This definitely doesn’t completely resolve the issue. There are some issues that people know will happen ahead of time and are not really sufficiently handling. Climate change is a prominent example. But, even in the case of climate change, I think we’re much better off knowing, “Oh okay, the climate’s going to be changing this much over this period of time”, as opposed to just waking up one day and it’s like, “Oh man, the climate is just really, really different”. Definitely, the issue is not in any way resolved by this, but it is quite helpful that people see what’s coming when the changes are relatively gradual. Like institutions, to some extent, are being put in place.

I think there’s also some other things that you get out of imagining this more gradual scenario as opposed to the brain in the box scenario. So another one that I think is also quite connected is people are more likely to also know the specific safety issues as they arise, insofar as there are any unresolved issues about making AI systems behave the way you want them to do. Or, let’s say not unintentionally or, in some sense, deceiving the people who are designing them. You’re likely to probably see low-level versions of these before you see very extreme versions of these. You’ll have relatively competent or relatively important AI systems mess up in ways that aren’t, let’s say, world-destroying before you even get to the position where you have AI systems whose behaviors are that consequential.

If you imagine that you live in a world where you have lots and lots of really advanced AI systems deployed throughout all the different sectors of the economy and throughout political systems, insofar as there are these really fundamental safety issues, you should probably have noticed them by that point. There will have probably been smaller failures. People should also be less caught off guard or less blindly oblivious to certain issues that might arise with the design of really advanced systems.

Instrumental convergence

If you’re trying to predict what a future technology will look like, it’s not necessarily a good methodology to try and think, “Here are all of the possible ways we might make this technology. Most of the ways involve property P so therefore we’ll probably create a technology with property P”. Just as some simple silly illustrations. Most possible ways of building an airplane involve at least one of the windows on the airplane being open. There’s a bunch of windows. There’s a bunch of different combinations of open and closed windows. Only one involves them all being closed. That’d be bad to predict that we’d build airplanes with open windows. Most possible ways of building cars don’t involve functional steering wheels that a driver can reach. Most possible ways of building buildings involved giant holes in the floor. There’s only one possible way to have the floor not have a hole in it.

It seems too often to be the case that this argument schema doesn’t necessarily work that well. Another case as well is that if you think about human evolution, there’s, for example, a lot of different preference rankings I could have over the arrangement of matter in this room. There’s a lot of different, in some sense, goals I could have about how I’d like this stuff in the room to be. If you really think about it, most different preferences I could have for the arrangement of matter in the room involve me wanting to tear up all the objects and put them in very specific places. There’s only a small subset of the preferences they have that involve me keeping the objects intact because there’s a lot fewer ways for things to be intact then to be split apart and spread throughout the room.

It’s not really that’s surprising, I don’t have this wild destructive preference about how they’re arranged. Let’s say the atoms in this room. The general principle here is that if you want to try and predict what some future technology will look like, maybe there is some predictive power you get from thinking about X percent of the ways of doing this involve property P. But it’s important to think about where there’s a process by which this technology or artifact will emerge. Is that the sort of process that will be differentially attracted to things which are let’s say benign? If so, then maybe that outweighs the fact that most possible designs are not benign.

Ben's overall perspective

There’s this big constellation of arguments that people have put forward. I think my overall perspective, at least on the safety side of things, is that I basically, at this point, though I found them quite convincing when I first encountered them, now obviously have a number of qualms about the presentation of the classic arguments.

I think there’s a number of things that don’t really go through, or that really need to be, maybe adjusted or elaborated upon more. And on the other hand, there are these other arguments that have emerged more recently, but I think they haven’t really been described in a lot of detail. A lot of them really do come down to, it’s a couple of blog posts written in the past year. And I think if, for example, if the entire case for treating AIs and existential risk hung on these blog posts, I wouldn’t really feel comfortable, let’s say advocating for millions of dollars to be going into the field, or loads and loads of people to be changing their careers, which seems to be happening at the moment.

I think that basically we’re in a state of affairs where there’s a lot of plausible concerns about AI. There’s a lot of different angles you can come from. Some say this is something that we ought to be working on from a long-term perspective, but at least I’m somewhat uncomfortable about the state of arguments that have been published. I think things that are more rigorous or more fleshed out, I don’t really agree with that much. And the things that I may be more sympathetic to, just haven’t been fleshed out that much.

Articles, books, and other media discussed in the show

Ben’s work

Everything else

Related episodes

About the show

The 80,000 Hours Podcast features unusually in-depth conversations about the world's most pressing problems and how you can use your career to solve them. We invite guests pursuing a wide range of career paths — from academics and activists to entrepreneurs and policymakers — to analyse the case for and against working on different issues and which approaches are best for solving them.

Get in touch with feedback or guest suggestions by emailing [email protected].

What should I listen to first?

We've carefully selected 10 episodes we think it could make sense to listen to first, on a separate podcast feed:

Check out 'Effective Altruism: An Introduction'

Subscribe here, or anywhere you get podcasts:

If you're new, see the podcast homepage for ideas on where to start, or browse our full episode archive.