#158 – Holden Karnofsky on his four-part playbook for AI risk

Back in 2007, Holden Karnofsky cofounded GiveWell, where he sought out the charities that most cost-effectively helped save lives. He then cofounded Open Philanthropy, where he oversaw a team making billions of dollars’ worth of grants across a range of areas: pandemic control, criminal justice reform, farmed animal welfare, and making AI safe, among others. This year, having learned about AI for years and observed recent events, he’s narrowing his focus once again, this time on making the transition to advanced AI go well.

In today’s conversation, Holden returns to the show to share his overall understanding of the promise and the risks posed by machine intelligence, and what to do about it. That understanding has accumulated over around 14 years, during which he went from being sceptical that AI was important or risky, to making AI risks the focus of his work.

(As Holden reminds us, his wife is also the president of one of the world’s top AI labs, Anthropic, giving him both conflicts of interest and a front-row seat to recent events. For our part, Open Philanthropy is 80,000 Hours’ largest financial supporter.)

One point he makes is that people are too narrowly focused on AI becoming ‘superintelligent.’ While that could happen and would be important, it’s not necessary for AI to be transformative or perilous. Rather, machines with human levels of intelligence could end up being enormously influential simply if the amount of computer hardware globally were able to operate tens or hundreds of billions of them, in a sense making machine intelligences a majority of the global population, or at least a majority of global thought.

As Holden explains, he sees four key parts to the playbook humanity should use to guide the transition to very advanced AI in a positive direction: alignment research, standards and monitoring, creating a successful and careful AI lab, and finally, information security.

In today’s episode, host Rob Wiblin interviews return guest Holden Karnofsky about that playbook, as well as:

  • Why we can’t rely on just gradually solving those problems as they come up, the way we usually do with new technologies.
  • What multiple different groups can do to improve our chances of a good outcome — including listeners to this show, governments, computer security experts, and journalists.
  • Holden’s case against ‘hardcore utilitarianism’ and what actually motivates him to work hard for a better world.
  • What the ML and AI safety communities get wrong in Holden’s view.
  • Ways we might succeed with AI just by dumb luck.
  • The value of laying out imaginable success stories.
  • Why information security is so important and underrated.
  • Whether it’s good to work at an AI lab that you think is particularly careful.
  • The track record of futurists’ predictions.
  • And much more.

Get this episode by subscribing to our podcast on the world’s most pressing problems and how to solve them: type ‘80,000 Hours’ into your podcasting app. Or read the transcript below.

Producer: Keiran Harris
Audio Engineering Lead: Ben Cordell
Technical editing: Simon Monsour and Milo McGuire
Transcriptions: Katy Moore

Highlights

Explosively fast progress

Holden Karnofsky: One of the reasons I’m so interested in AI safety standards is because kind of no matter what risk you’re worried about, I think you hopefully should be able to get on board with the idea that you should measure the risk, and not unwittingly deploy AI systems that are carrying a tonne of the risk, before you’ve at least made a deliberate informed decision to do so. And I think if we do that, we can anticipate a lot of different risks and stop them from coming at us too fast. “Too fast” is the central theme for me.

You know, a common story in some corners of this discourse is this idea of an AI that’s this kind of simple computer program, and it rewrites its own source code, and that’s where all the action is. I don’t think that’s exactly the picture I have in mind, although there’s some similarities.

The kind of thing I’m picturing is maybe more like a months or years time period from getting sort of near-human-level AI systems — and what that means is definitely debatable and gets messy — but near-human-level AI systems to just very powerful ones that are advancing science and technology really fast. And then in science and technology — at least on certain fronts that are the less bottlenecked fronts– you get a huge jump. So I think my view is at least somewhat more moderate than Eliezer’s, and at least has somewhat different dynamics.

But I think both points of view are talking about this rapid change. I think without the rapid change, a) things are a lot less scary generally, and b) I think it is harder to justify a lot of the stuff that AI-concerned people do to try and get out ahead of the problem and think about things in advance. Because I think a lot of people sort of complain with this discourse that it’s really hard to know the future, and all this stuff we’re talking about about what future AI systems are going to do and what we have to do about it today, it’s very hard to get that right. It’s very hard to anticipate what things will be like in an unfamiliar future.

When people complain about that stuff, I’m just very sympathetic. I think that’s right. And if I thought that we had the option to adapt to everything as it happens, I think I would in many ways be tempted to just work on other problems, and in fact adapt to things as they happen and we see what’s happening and see what’s most needed. And so I think a lot of the case for planning things out in advance — trying to tell stories of what might happen, trying to figure out what kind of regime we’re going to want and put the pieces in place today, trying to figure out what kind of research challenges are going to be hard and do them today — I think a lot of the case for that stuff being so important does rely on this theory that things could move a lot faster than anyone is expecting.

I am in fact very sympathetic to people who would rather just adapt to things as they go. I think that’s usually the right way to do things. And I think many attempts to anticipate future problems are things I’m just not that interested in, because of this issue. But I think AI is a place where we have to take the explosive progress thing seriously enough that we should be doing our best to prepare for it.

Rob Wiblin: Yeah. I guess if you have this explosive growth, then the very strange things that we might be trying to prepare for might be happening in 2027, or incredibly soon.

Holden Karnofsky: Something like that, yeah. It’s imaginable, right? And it’s all extremely uncertain because we don’t know. In my head, a lot of it is like there’s a set of properties that an AI system could have: roughly being able to do roughly everything humans are able to do to advance science and technology, or at least able to advance AI research. We don’t know when we’ll have that. One possibility is we’re like 30 years away from that. But once we get near that, things will move incredibly fast. And that’s a world we could be in. We could also be in a world where we’re only a few years from that, and then everything’s going to get much crazier than anyone thinks, much faster than anyone thinks.

AI population explosion

Rob Wiblin: I think some people are sceptical of this superintelligence story, because they think you get really declining returns to being smarter, and that there’s some ways in which it just doesn’t matter how smart you are, the world is too unpredictable for you to come up with a great plan. But this is a different mechanism by which you can get the same outcome, which is just that you have this enormous increase in the number of thoughts that are occurring on computer chips, more or less. And at some point, 99% of the thoughts that are happening on Earth could basically be occurring inside artificial intelligences. And then as they get better, and they’re able to make more chips more quickly, again, basically just gets a population explosion.

Holden Karnofsky: Yeah. That’s exactly right. And I think this is a place where some people get a little bit rabbitholed on the AI debates, because I think there’s a lot of room to debate how big a deal it is to have something that’s “extremely smart” or “superintelligent” or “much smarter than human.” It’s like, maybe if you had something that was like a giant brain or something, and way way smarter (whatever that means) than us, maybe what that would mean is that it would instantly see how to make all these super weapons and conquer the world and how to convince us of anything. There’s all this stuff that that could mean, and people debate whether it could mean that, but it’s uncertain.

And I think a thing that’s a lot less uncertain — if you’re finding yourself sceptical of what this “smart” idea means, and where it’s going to go, and what you can do with it — if you find yourself sceptical of that, then just forget about it. I believe you can make the entire case for being extremely concerned about AI, assuming that AI will never be smarter than a human. Instead, it will be as capable as the most capable humans, and there will be a tonne of them — because unlike humans, you can just copy them. You can use your copies to come up with ways to make it more efficient, just like humans do, then you can make more copies.

And when we talk about whether AI could defeat humanity — and I’ve written one blog post on whether AI could kind of take over the world — they don’t have to be more capable than humans. They could be equally capable, and there could be more of them. That could really do it. That could really be enough that then humans wouldn’t be in control of the world anymore.

So I’m basically generally happy to just have all discussions about AI and what the risks are, just in this world where there’s nothing more capable than a human — but it’s pretty scary to have a lot of those that have different values from humans, and that are kind of a second advanced species. That’s not to rule out that some of these superintelligence concerns could be real. It’s just that they’re not always necessary, and they can sideline people.

Rob Wiblin: Yeah. You can just get beaten by force of numbers more or less. I think it’s a little bit of a shame that this sheer numbers argument hasn’t really been made very much. It feels like the superintelligence story has been very dominant in the narrative and media, and many people get off the boat because they’re sceptical of this intelligence thing. I think it kind of is the fault of me and maybe people who’ve been trying to raise the alarm about this, because the focus really has been on the superintelligence aspect rather than the super-numerousness that you could get.

Holden Karnofsky: And I don’t know. I mean, I think there’s valid concerns from that angle for sure, and I’m not trying to dismiss it. But I think there’s a lot of uncertainty about what superintelligence means and where it could go. And I think you can raise a lot of these concerns without needing to have a settled view there.

Misaligned AI might not kill us all, and aligned AI could be catastrophic

Holden Karnofsky: A vibe I pick up is this kind of framework that says, “If we don’t align our AIs, we’re all going to die. And if we can align our AIs, that’s great, and we’ve solved the problem.” And that’s the problem we should be thinking about, and there’s nothing else really worth worrying about. It’s kind of like alignment is the whole game, would be the hypothesis.

And I disagree with both ends of that, especially the latter. So to take the first end — if we don’t align AI, we’re all dead — first off, I just think it’s really unclear. Even in the worst case — where you get an AI that has its own values, and there’s a huge number of them, and they kind of team up and take over the world — even then, it’s really unclear if that means we all die. I know there’s debates about this. I have tried to understand. The MIRI folks, I think, feel really strongly that clearly, we all die. I’ve tried to understand where they’re coming from, and I have not.

I think a key point is it just could be very cheap — as a percentage of resources, for example — to let humans have a nice life on Earth, and not expand further, and be cut off in certain ways from threatening the AI’s ability to do what it wants. That can be very cheap compared to wiping us all out.

And there could be a bunch of reasons one might want to do that, some of them kind of wacky. Some of them like, “Well, maybe in another part of the universe, there’s someone like the AI that was trying to design its own AI. And that thing ended up with values like the humans, and maybe there’s some kind of trade that could be made, using acausal trade” — and we don’t need to get into what all this means — or like, maybe the AI is actually being simulated by humans or something, or by some smarter version of humans, or some more powerful version of humans, and being tested to see if it’ll wipe out the humans or be nice to them. It’s just like, you don’t need a lot of reasons to leave one planet out if you’re expanding throughout the galaxy. So that would be one thing, is that it’s just kind of uncertain what happens even in the worst case.

—-

The other part — if we do align the AI, we’re fine — I disagree with much more strongly. The first one, I mean, I think it would be really bad to have misaligned AI. And despite the feeling that I feel it is fairly overrated in some circles, I still think it’s the number one thing for me. Just the single biggest issue in AI is we’re building these potentially very powerful, very replicable, very numerous systems — and we’re building them in ways where we don’t have much insight into whether they have goals, or what the goals would be; we’re kind of introducing the second advanced species onto the planet that we don’t understand. And if that advanced species becomes more numerous and/or more capable than us, we don’t have a great argument to think that’s going to be good for us. So I’m on board with alignment risk is the number one thing — not the only thing, but the number one thing.

But I would say, if you just assume that you have a world of very capable AIs, that are doing exactly what humans want them to do, that’s very scary. And I think if that was the world we knew we were going to be in, I would still be totally full time on AI, and still feel that we had so much work to do and we were so not ready for what was coming.

Certainly, there’s the fact that because of the speed at which things move, you could end up with whoever kind of leads the way on AI, or is least cautious, having a lot of power — and that could be someone really bad. And I don’t think we should assume that just because that if you had some head of state that has really bad values, I don’t think we should assume that that person is going to end up being nice after they become wealthy, or powerful, or transhuman, or mind uploaded, or whatever — I don’t think there’s really any reason to think we should assume that.

And then I think there’s just a bunch of other things that, if things are moving fast, we could end up in a really bad state. Like, are we going to come up with decent frameworks for making sure that the digital minds are not mistreated? Are we going to come up with decent frameworks for how to ensure that as we get the ability to create whatever minds we want, we’re using that to create minds that help us seek the truth, instead of create minds that have whatever beliefs we want them to have, stick to those beliefs and try to shape the world around those beliefs? I think Carl Shulman put it as, “Are we going to have AI that makes us wiser or more powerfully insane?”

So I think there’s just a lot. I think we’re on the cusp of something that is just potentially really big, really world-changing, really transformative, and going to move way too fast. And I think even if we threw out the misalignment problem, we’d have a lot of work to do — and I think a lot of these issues are actually not getting enough attention.

Getting the AIs to do our alignment homework for us

Holden Karnofsky: It’s this idea that once you have human levelish AI systems, you have them working on the alignment problem in huge numbers. And in some ways, I hate this idea, because this is just very lazy, and it’s like, “Oh, yeah. We’re not going to solve this problem until later, when the world is totally crazy, and everything’s moving really fast, and we have no idea what’s going to happen.”

Rob Wiblin: “We’ll just ask the agents that we don’t trust to make themselves trustworthy.”

Holden Karnofsky: Yeah. Exactly. So there’s a lot to hate about this idea. But, heck, it could work. It really could. Because you could have a situation where, just in a few months, you’re able to do the equivalent of thousands of years of humans doing alignment research. And if these systems are not at the point where they can or want to screw you up, that really could do it. I mean, we just don’t know that thousands of years of human levelish alignment research isn’t enough to get us a real solution. And so that’s how you get through a lot of it.

And then you still have another problem in a sense, which is that you do need a way to stop dangerous systems. It’s not enough to have safe AI systems. But again, you have help from this giant automated workforce. And so in addition to coming up with ways to make your system safe, you can come up with ways of showing that they’re dangerous and when they’re dangerous and being persuasive about the importance of the danger.

I don’t know. I feel like if we had a 100 years before AGI right now, there’d be a good chance that normal flesh-and-blood humans could pull this off. So in that world, there’s a good chance that an automated workforce can cause it to happen pretty quickly, and you could pretty quickly get an understanding of the risks, agreement that we need to stop them. And you have more safe AIs than dangerous AIs, and you’re trying to stop the dangerous AIs. And you’re measuring the dangerous AIs, or you’re stopping any AI that refuses to be measured or whose developer refuses to measure it.

So then you have a world that’s kind of like this one, where there’s a lot of evil people out there, but they are generally just kept in check by being outnumbered by people who are at least law abiding, if not incredibly angelic. So you get a world that looks like this one, but it just has a lot of AIs running around in it, so we have a lot of progress in science and technology. And that’s a fine ending, potentially.

The value of having a successful and careful AI lab

Holden Karnofsky: First with the reminder that I’m married to the president of Anthropic, so take that for what it’s worth.

I just think there’s a lot of ways that if you had an AI company that was on the frontier, that was succeeding, that was building some of the world’s biggest models, that was pulling in a lot of money, and that was simultaneously able to really be prioritising risks to humanity, it’s not too hard to think of a lot of ways good can come with that.

Some of them are very straightforward. The company could be making a lot of money, raising a lot of capital, and using that to support a lot of safety research on frontier models. So you could think of it as a weird kind of earning to give or something. Also probably that AI company would be pretty influential in discussions of how AI should be regulated and how people should be thinking of AI: they could be a legitimiser, all that stuff. I think it’d be a good place for people to go and just skill up, learn more about AI, become more important players.

I think in the short run, they’d have a lot of expertise in-house, they could work on a lot of problems, probably to design ways of measuring whether an AI system is dangerous. One of the first places you’d want to go for people who’d be good at that would be a top AI lab that’s building some of the most powerful models. So I think there’s a lot of ways they could do good in the short run.

And then I have written stories that just have it in the long run. When we get these really powerful systems, it actually does matter a lot who has them first and what they’re literally using them for. When you have very powerful AIs, is the first thing you’re using them for trying to figure out how to make future systems safe or trying to figure out how to assess the threats of future systems? Or is the first thing you’re using them for just trying to rush forward as fast as you can, do faster algorithms, do more bigger systems? Or is the first thing you’re using them for just some random economic thing that is kind of cool and makes a lot of money?

Rob Wiblin: Some customer-facing thing. Yeah.

Holden Karnofsky: Yeah. And it’s not bad, but it’s not reducing the risk we care about. So I think there is a lot of good that can be done there.

And then there’s also — I want to be really clear here — a lot of harm an AI company could do, if you’re pushing out these systems.

Rob Wiblin: Kill everyone.

Holden Karnofsky: That kind of thing, yeah. For example. You know, you’re pushing out these AI systems, and if you’re doing it all with an eye toward profit and moving fast and winning, then you could think of it as you’re taking the slot of someone who could have been using that expertise and money and juice to be doing a lot of good things. You could also think of it as you’re just giving everyone less time to figure out what the hell is going on, and we already might not have enough. So I want to be really clear. This is a tough one. I don’t want to be interpreted as saying that one of the tentpoles of reducing AI risk is to go start an AI lab immediately — I don’t believe that.

But I also think that some corners of the AI safety world are very dismissive, or just think that AI companies are bad by default. And this is just really complicated, and it really depends exactly how the AI lab is prioritising risk to society versus success — and it has to prioritise success some to be relevant, or to get some of these benefits. So how it’s balancing is just really hard, and really complicated, and really hard to tell, and you’re going to have to have some judgements about it. So it’s not a ringing endorsement, but it does feel like, at least in theory, part of one of the main ways that we make things better. You know, you could do a lot of good.

Why information security is so important

Holden Karnofsky: I think you could build these powerful, dangerous AI systems, and you can do a lot to try to mitigate the dangers — like limiting the ways they can be used, you can do various alignment techniques — but if some state or someone else steals the weights, they’ve basically stolen your AI system, and they can run it without even having to do the training run. So you might spend a huge amount of money on a training run, end up with this AI system that’s very powerful, and someone else just has it. And they can then also fine-tune it, which means they can do their own training on it and change the way it’s operating. So whatever you did to train it to be nice, they can train that right out; the training they do could screw up whatever you did to try and make it aligned.

And so I think at the limit of ‘it’s really just trivial for any state to just grab your AI system and do whatever they want with it and retrain it how they want’, it’s really hard to imagine feeling really good about that situation. I don’t know if I really need to elaborate a lot more on that. So making it harder seems valuable.

This is another thing where I want to say, as I have with everything else, that it’s not binary. So it could be the case that, after you improve your security a lot, it’s still possible for a state actor to steal your system, but they have to take more risks, they have to spend more money, they have to take a deeper breath before they do it. It takes them more months. Months can be a very big deal. As I’ve been saying, when you get these very powerful systems, you could do a lot in a few months. By the time they steal it, you could have a better system. So I don’t think it’s an all-or-nothing thing.

But no matter what risk of AI you’re worried about — you could be worried about the misalignment; you could be worried about the misuse and the use to develop dangerous weapons; you can be worried about more esoteric stuff, like how the AI does decision theory; you could be worried about mind crime — you don’t want just anyone, including some of these state actors who may have very bad values, to just be able to steal a system, retrain it how they want, and use it how they want. You want some kind of setup where it’s the people with good values controlling more of the more powerful AI systems, using them to enforce some sort of law and order in the world, and enforcing law and order generally — with or without AI. So it seems quite robustly important.

Other things about security is that I think it’s very, very hard, just very hard to make these systems hard to steal for a state actor, and so there’s just a tonne of room to go and make things better. There could be security research on innovative new methods, and there can also be a lot of blocking and tackling — just getting companies to do things that we already know need to be done, but that are really hard to do in practice, take a lot of work, take a lot of iteration. Also, a nice thing about security, as opposed to some of these other things: it is a relatively mature field, so you can learn about security in some other context and then apply it to AI.

Part of me kind of thinks that the EA community or whatever kind of screwed up by not emphasising security more. It’s not too hard for me to imagine a world where we’ve just been screaming about the AI security problem for the last 10 years, and how do you stop a very powerful system from getting stolen? That problem is extremely hard. We’ve made a bunch of progress on it. There were tonnes of people concerned about this stuff on the security teams of all the top AI companies, and were not as active and only had a few people working on alignment.

I don’t know. Is that world better or worse than this one? I’m not really sure. A world where we’re more balanced, and had encouraged people who are a good fit for one to go into one, probably seems just better than the world we’re in. So yeah, I think security is a really big deal. I think it hasn’t gotten enough attention.

Holden vs hardcore utilitarianism

Rob Wiblin: So maybe a way of highlighting the differences here will be to imagine this conversation, where you’re saying, “I’m leading Open Philanthropy. I think that we should split our efforts between a whole bunch of different projects, each one of which would look exceptional on a different plausible worldview.” And the hardcore utilitarian comes to you and says, “No, you should choose the best one and fund that. Spend all of your resources and all of your time, just focus on that best one.” What would you say to them in order to justify the worldview diversification approach?

Holden Karnofsky: The first thing I would say to them is just, “The burden’s on you.” And this is kind of a tension I often have with people who consider themselves hardcore: they’ll just feel like, well, why wouldn’t you be a hardcore utilitarian? Like, what’s the problem? Why isn’t it just maximising the pleasure and minimising the pain, or the sum or the difference? And I would just be like, “No. No. No. You gotta tell me, because I am sitting here with these great opportunities to help huge amounts of people in very different and hard-to-compare ways.

And the way I’ve always done ethics before in my life is, like, I basically have some voice inside me and it says, “This is what’s right.” And that voice has to carry some weight. Even on your model, that voice has to carry some weight, because you — you, the hardcore utilitarian, not Rob, because we all know you’re not at all — it’s like, even the most systematic theories of ethics, they’re all using that little voice inside you that says what’s right. That’s the arbiter of all the thought experiments. So we’re all putting weight on it somewhere, somehow. And I’m like, cool. That’s gotta be how this works. There’s a voice inside me saying, “this feels right” or “this feels wrong” — that voice has gotta get some weight.

That voice is saying, “You know what? It is really interesting to think about these risks to humanity’s future, but also, it’s weird.” This work is not shaped like the other work. It doesn’t have as good feedback loops. It feels icky — like a lot of this work is about just basically supporting people who think like us, or it feels that way a lot of the time, and it just feels like it doesn’t have the same ring of ethics to it.

And then on the other hand, it just feels like I’d be kind of a jerk if… Like, Open Phil, I believe — and you could disagree with me — is not only the biggest, but the most effective farm animal welfare funder in the world. I think we’ve had enormous impact and made animal lives dramatically better. And you’re coming to say to me, “No, you should take all that money and put it into the diminishing margin of supporting people to think about some future x-risk in a domain where you mostly have a lot of these concerns about insularity.” Like, you’ve got to make the case to me — because the normal way all this stuff works is you listen to that voice inside your head, and you care what it says. And some of the opportunities Open Phil has to do a lot of good are quite extreme, and we do them.

So that’s the first thing: we’ve got to put the burden of proof in the right place.

Articles, books, and other media discussed in the show

Holden’s work:

Challenges in AI alignment and safety:

Careers in this space:

Hardcore utilitarianism, ethics, and UDASSA:

Other 80,000 Hours podcast episodes:

Related episodes

About the show

The 80,000 Hours Podcast features unusually in-depth conversations about the world's most pressing problems and how you can use your career to solve them. We invite guests pursuing a wide range of career paths — from academics and activists to entrepreneurs and policymakers — to analyse the case for and against working on different issues and which approaches are best for solving them.

Get in touch with feedback or guest suggestions by emailing [email protected].

What should I listen to first?

We've carefully selected 10 episodes we think it could make sense to listen to first, on a separate podcast feed:

Check out 'Effective Altruism: An Introduction'

Subscribe here, or anywhere you get podcasts:

If you're new, see the podcast homepage for ideas on where to start, or browse our full episode archive.