Holden Karnofsky on how AIs might take over even if they’re no smarter than humans, and his four-part playbook for AI risk

By Robert Wiblin and Keiran Harris · Published July 31st, 2023

Holden Karnofsky on how AIs might take over even if they’re no smarter than humans, and his four-part playbook for AI risk

By Robert Wiblin and Keiran Harris · Published July 31st, 2023

Enjoyed the episode? Want to listen later? Subscribe here, or anywhere you get podcasts:

I think a lot of the case for planning things out in advance — trying to tell stories of what might happen, trying to figure out what kind of regime we’re going to want and put the pieces in place today, trying to figure out what kind of research challenges are going to be hard and do them today — I think a lot of the case for that stuff being so important does rely on this theory that things could move a lot faster than anyone is expecting.
Holden Karnofsky

Back in 2007, Holden Karnofsky cofounded GiveWell, where he sought out the charities that most cost-effectively helped save lives. He then cofounded Open Philanthropy, where he oversaw a team making billions of dollars’ worth of grants across a range of areas: pandemic control, criminal justice reform, farmed animal welfare, and making AI safe, among others. This year, having learned about AI for years and observed recent events, he’s narrowing his focus once again, this time on making the transition to advanced AI go well.

In today’s conversation, Holden returns to the show to share his overall understanding of the promise and the risks posed by machine intelligence, and what to do about it. That understanding has accumulated over around 14 years, during which he went from being sceptical that AI was important or risky, to making AI risks the focus of his work.

(As Holden reminds us, his wife is also the president of one of the world’s top AI labs, Anthropic, giving him both conflicts of interest and a front-row seat to recent events. For our part, Open Philanthropy is 80,000 Hours’ largest financial supporter.)

One point he makes is that people are too narrowly focused on AI becoming ‘superintelligent.’ While that could happen and would be important, it’s not necessary for AI to be transformative or perilous. Rather, machines with human levels of intelligence could end up being enormously influential simply if the amount of computer hardware globally were able to operate tens or hundreds of billions of them, in a sense making machine intelligences a majority of the global population, or at least a majority of global thought.

As Holden explains, he sees four key parts to the playbook humanity should use to guide the transition to very advanced AI in a positive direction: alignment research, standards and monitoring, creating a successful and careful AI lab, and finally, information security.

In today’s episode, host Rob Wiblin interviews return guest Holden Karnofsky about that playbook, as well as:

Why we can’t rely on just gradually solving those problems as they come up, the way we usually do with new technologies.
What multiple different groups can do to improve our chances of a good outcome — including listeners to this show, governments, computer security experts, and journalists.
Holden’s case against ‘hardcore utilitarianism’ and what actually motivates him to work hard for a better world.
What the ML and AI safety communities get wrong in Holden’s view.
Ways we might succeed with AI just by dumb luck.
The value of laying out imaginable success stories.
Why information security is so important and underrated.
Whether it’s good to work at an AI lab that you think is particularly careful.
The track record of futurists’ predictions.
And much more.

Get this episode by subscribing to our podcast on the world’s most pressing problems and how to solve them: type ‘80,000 Hours’ into your podcasting app. Or read the transcript below.

Producer: Keiran Harris
Audio Engineering Lead: Ben Cordell
Technical editing: Simon Monsour and Milo McGuire
Transcriptions: Katy Moore

Highlights

Explosively fast progress

Holden Karnofsky: One of the reasons I’m so interested in AI safety standards is because kind of no matter what risk you’re worried about, I think you hopefully should be able to get on board with the idea that you should measure the risk, and not unwittingly deploy AI systems that are carrying a tonne of the risk, before you’ve at least made a deliberate informed decision to do so. And I think if we do that, we can anticipate a lot of different risks and stop them from coming at us too fast. “Too fast” is the central theme for me.
You know, a common story in some corners of this discourse is this idea of an AI that’s this kind of simple computer program, and it rewrites its own source code, and that’s where all the action is. I don’t think that’s exactly the picture I have in mind, although there’s some similarities.
The kind of thing I’m picturing is maybe more like a months or years time period from getting sort of near-human-level AI systems — and what that means is definitely debatable and gets messy — but near-human-level AI systems to just very powerful ones that are advancing science and technology really fast. And then in science and technology — at least on certain fronts that are the less bottlenecked fronts– you get a huge jump. So I think my view is at least somewhat more moderate than Eliezer’s, and at least has somewhat different dynamics.
But I think both points of view are talking about this rapid change. I think without the rapid change, a) things are a lot less scary generally, and b) I think it is harder to justify a lot of the stuff that AI-concerned people do to try and get out ahead of the problem and think about things in advance. Because I think a lot of people sort of complain with this discourse that it’s really hard to know the future, and all this stuff we’re talking about about what future AI systems are going to do and what we have to do about it today, it’s very hard to get that right. It’s very hard to anticipate what things will be like in an unfamiliar future.
When people complain about that stuff, I’m just very sympathetic. I think that’s right. And if I thought that we had the option to adapt to everything as it happens, I think I would in many ways be tempted to just work on other problems, and in fact adapt to things as they happen and we see what’s happening and see what’s most needed. And so I think a lot of the case for planning things out in advance — trying to tell stories of what might happen, trying to figure out what kind of regime we’re going to want and put the pieces in place today, trying to figure out what kind of research challenges are going to be hard and do them today — I think a lot of the case for that stuff being so important does rely on this theory that things could move a lot faster than anyone is expecting.
I am in fact very sympathetic to people who would rather just adapt to things as they go. I think that’s usually the right way to do things. And I think many attempts to anticipate future problems are things I’m just not that interested in, because of this issue. But I think AI is a place where we have to take the explosive progress thing seriously enough that we should be doing our best to prepare for it.
Rob Wiblin: Yeah. I guess if you have this explosive growth, then the very strange things that we might be trying to prepare for might be happening in 2027, or incredibly soon.
Holden Karnofsky: Something like that, yeah. It’s imaginable, right? And it’s all extremely uncertain because we don’t know. In my head, a lot of it is like there’s a set of properties that an AI system could have: roughly being able to do roughly everything humans are able to do to advance science and technology, or at least able to advance AI research. We don’t know when we’ll have that. One possibility is we’re like 30 years away from that. But once we get near that, things will move incredibly fast. And that’s a world we could be in. We could also be in a world where we’re only a few years from that, and then everything’s going to get much crazier than anyone thinks, much faster than anyone thinks.

AI population explosion

Rob Wiblin: I think some people are sceptical of this superintelligence story, because they think you get really declining returns to being smarter, and that there’s some ways in which it just doesn’t matter how smart you are, the world is too unpredictable for you to come up with a great plan. But this is a different mechanism by which you can get the same outcome, which is just that you have this enormous increase in the number of thoughts that are occurring on computer chips, more or less. And at some point, 99% of the thoughts that are happening on Earth could basically be occurring inside artificial intelligences. And then as they get better, and they’re able to make more chips more quickly, again, basically just gets a population explosion.
Holden Karnofsky: Yeah. That’s exactly right. And I think this is a place where some people get a little bit rabbitholed on the AI debates, because I think there’s a lot of room to debate how big a deal it is to have something that’s “extremely smart” or “superintelligent” or “much smarter than human.” It’s like, maybe if you had something that was like a giant brain or something, and way way smarter (whatever that means) than us, maybe what that would mean is that it would instantly see how to make all these super weapons and conquer the world and how to convince us of anything. There’s all this stuff that that could mean, and people debate whether it could mean that, but it’s uncertain.
And I think a thing that’s a lot less uncertain — if you’re finding yourself sceptical of what this “smart” idea means, and where it’s going to go, and what you can do with it — if you find yourself sceptical of that, then just forget about it. I believe you can make the entire case for being extremely concerned about AI, assuming that AI will never be smarter than a human. Instead, it will be as capable as the most capable humans, and there will be a tonne of them — because unlike humans, you can just copy them. You can use your copies to come up with ways to make it more efficient, just like humans do, then you can make more copies.
And when we talk about whether AI could defeat humanity — and I’ve written one blog post on whether AI could kind of take over the world — they don’t have to be more capable than humans. They could be equally capable, and there could be more of them. That could really do it. That could really be enough that then humans wouldn’t be in control of the world anymore.
So I’m basically generally happy to just have all discussions about AI and what the risks are, just in this world where there’s nothing more capable than a human — but it’s pretty scary to have a lot of those that have different values from humans, and that are kind of a second advanced species. That’s not to rule out that some of these superintelligence concerns could be real. It’s just that they’re not always necessary, and they can sideline people.
Rob Wiblin: Yeah. You can just get beaten by force of numbers more or less. I think it’s a little bit of a shame that this sheer numbers argument hasn’t really been made very much. It feels like the superintelligence story has been very dominant in the narrative and media, and many people get off the boat because they’re sceptical of this intelligence thing. I think it kind of is the fault of me and maybe people who’ve been trying to raise the alarm about this, because the focus really has been on the superintelligence aspect rather than the super-numerousness that you could get.
Holden Karnofsky: And I don’t know. I mean, I think there’s valid concerns from that angle for sure, and I’m not trying to dismiss it. But I think there’s a lot of uncertainty about what superintelligence means and where it could go. And I think you can raise a lot of these concerns without needing to have a settled view there.

Misaligned AI might not kill us all, and aligned AI could be catastrophic

Holden Karnofsky: A vibe I pick up is this kind of framework that says, “If we don’t align our AIs, we’re all going to die. And if we can align our AIs, that’s great, and we’ve solved the problem.” And that’s the problem we should be thinking about, and there’s nothing else really worth worrying about. It’s kind of like alignment is the whole game, would be the hypothesis.
And I disagree with both ends of that, especially the latter. So to take the first end — if we don’t align AI, we’re all dead — first off, I just think it’s really unclear. Even in the worst case — where you get an AI that has its own values, and there’s a huge number of them, and they kind of team up and take over the world — even then, it’s really unclear if that means we all die. I know there’s debates about this. I have tried to understand. The MIRI folks, I think, feel really strongly that clearly, we all die. I’ve tried to understand where they’re coming from, and I have not.
I think a key point is it just could be very cheap — as a percentage of resources, for example — to let humans have a nice life on Earth, and not expand further, and be cut off in certain ways from threatening the AI’s ability to do what it wants. That can be very cheap compared to wiping us all out.
And there could be a bunch of reasons one might want to do that, some of them kind of wacky. Some of them like, “Well, maybe in another part of the universe, there’s someone like the AI that was trying to design its own AI. And that thing ended up with values like the humans, and maybe there’s some kind of trade that could be made, using acausal trade” — and we don’t need to get into what all this means — or like, maybe the AI is actually being simulated by humans or something, or by some smarter version of humans, or some more powerful version of humans, and being tested to see if it’ll wipe out the humans or be nice to them. It’s just like, you don’t need a lot of reasons to leave one planet out if you’re expanding throughout the galaxy. So that would be one thing, is that it’s just kind of uncertain what happens even in the worst case.
—-
The other part — if we do align the AI, we’re fine — I disagree with much more strongly. The first one, I mean, I think it would be really bad to have misaligned AI. And despite the feeling that I feel it is fairly overrated in some circles, I still think it’s the number one thing for me. Just the single biggest issue in AI is we’re building these potentially very powerful, very replicable, very numerous systems — and we’re building them in ways where we don’t have much insight into whether they have goals, or what the goals would be; we’re kind of introducing the second advanced species onto the planet that we don’t understand. And if that advanced species becomes more numerous and/or more capable than us, we don’t have a great argument to think that’s going to be good for us. So I’m on board with alignment risk is the number one thing — not the only thing, but the number one thing.
But I would say, if you just assume that you have a world of very capable AIs, that are doing exactly what humans want them to do, that’s very scary. And I think if that was the world we knew we were going to be in, I would still be totally full time on AI, and still feel that we had so much work to do and we were so not ready for what was coming.
Certainly, there’s the fact that because of the speed at which things move, you could end up with whoever kind of leads the way on AI, or is least cautious, having a lot of power — and that could be someone really bad. And I don’t think we should assume that just because that if you had some head of state that has really bad values, I don’t think we should assume that that person is going to end up being nice after they become wealthy, or powerful, or transhuman, or mind uploaded, or whatever — I don’t think there’s really any reason to think we should assume that.
And then I think there’s just a bunch of other things that, if things are moving fast, we could end up in a really bad state. Like, are we going to come up with decent frameworks for making sure that the digital minds are not mistreated? Are we going to come up with decent frameworks for how to ensure that as we get the ability to create whatever minds we want, we’re using that to create minds that help us seek the truth, instead of create minds that have whatever beliefs we want them to have, stick to those beliefs and try to shape the world around those beliefs? I think Carl Shulman put it as, “Are we going to have AI that makes us wiser or more powerfully insane?”
So I think there’s just a lot. I think we’re on the cusp of something that is just potentially really big, really world-changing, really transformative, and going to move way too fast. And I think even if we threw out the misalignment problem, we’d have a lot of work to do — and I think a lot of these issues are actually not getting enough attention.

Getting the AIs to do our alignment homework for us

Holden Karnofsky: It’s this idea that once you have human levelish AI systems, you have them working on the alignment problem in huge numbers. And in some ways, I hate this idea, because this is just very lazy, and it’s like, “Oh, yeah. We’re not going to solve this problem until later, when the world is totally crazy, and everything’s moving really fast, and we have no idea what’s going to happen.”
Rob Wiblin: “We’ll just ask the agents that we don’t trust to make themselves trustworthy.”
Holden Karnofsky: Yeah. Exactly. So there’s a lot to hate about this idea. But, heck, it could work. It really could. Because you could have a situation where, just in a few months, you’re able to do the equivalent of thousands of years of humans doing alignment research. And if these systems are not at the point where they can or want to screw you up, that really could do it. I mean, we just don’t know that thousands of years of human levelish alignment research isn’t enough to get us a real solution. And so that’s how you get through a lot of it.
And then you still have another problem in a sense, which is that you do need a way to stop dangerous systems. It’s not enough to have safe AI systems. But again, you have help from this giant automated workforce. And so in addition to coming up with ways to make your system safe, you can come up with ways of showing that they’re dangerous and when they’re dangerous and being persuasive about the importance of the danger.
I don’t know. I feel like if we had a 100 years before AGI right now, there’d be a good chance that normal flesh-and-blood humans could pull this off. So in that world, there’s a good chance that an automated workforce can cause it to happen pretty quickly, and you could pretty quickly get an understanding of the risks, agreement that we need to stop them. And you have more safe AIs than dangerous AIs, and you’re trying to stop the dangerous AIs. And you’re measuring the dangerous AIs, or you’re stopping any AI that refuses to be measured or whose developer refuses to measure it.
So then you have a world that’s kind of like this one, where there’s a lot of evil people out there, but they are generally just kept in check by being outnumbered by people who are at least law abiding, if not incredibly angelic. So you get a world that looks like this one, but it just has a lot of AIs running around in it, so we have a lot of progress in science and technology. And that’s a fine ending, potentially.

The value of having a successful and careful AI lab

Holden Karnofsky: First with the reminder that I’m married to the president of Anthropic, so take that for what it’s worth.
I just think there’s a lot of ways that if you had an AI company that was on the frontier, that was succeeding, that was building some of the world’s biggest models, that was pulling in a lot of money, and that was simultaneously able to really be prioritising risks to humanity, it’s not too hard to think of a lot of ways good can come with that.
Some of them are very straightforward. The company could be making a lot of money, raising a lot of capital, and using that to support a lot of safety research on frontier models. So you could think of it as a weird kind of earning to give or something. Also probably that AI company would be pretty influential in discussions of how AI should be regulated and how people should be thinking of AI: they could be a legitimiser, all that stuff. I think it’d be a good place for people to go and just skill up, learn more about AI, become more important players.
I think in the short run, they’d have a lot of expertise in-house, they could work on a lot of problems, probably to design ways of measuring whether an AI system is dangerous. One of the first places you’d want to go for people who’d be good at that would be a top AI lab that’s building some of the most powerful models. So I think there’s a lot of ways they could do good in the short run.
And then I have written stories that just have it in the long run. When we get these really powerful systems, it actually does matter a lot who has them first and what they’re literally using them for. When you have very powerful AIs, is the first thing you’re using them for trying to figure out how to make future systems safe or trying to figure out how to assess the threats of future systems? Or is the first thing you’re using them for just trying to rush forward as fast as you can, do faster algorithms, do more bigger systems? Or is the first thing you’re using them for just some random economic thing that is kind of cool and makes a lot of money?
Rob Wiblin: Some customer-facing thing. Yeah.
Holden Karnofsky: Yeah. And it’s not bad, but it’s not reducing the risk we care about. So I think there is a lot of good that can be done there.
And then there’s also — I want to be really clear here — a lot of harm an AI company could do, if you’re pushing out these systems.
Rob Wiblin: Kill everyone.
Holden Karnofsky: That kind of thing, yeah. For example. You know, you’re pushing out these AI systems, and if you’re doing it all with an eye toward profit and moving fast and winning, then you could think of it as you’re taking the slot of someone who could have been using that expertise and money and juice to be doing a lot of good things. You could also think of it as you’re just giving everyone less time to figure out what the hell is going on, and we already might not have enough. So I want to be really clear. This is a tough one. I don’t want to be interpreted as saying that one of the tentpoles of reducing AI risk is to go start an AI lab immediately — I don’t believe that.
But I also think that some corners of the AI safety world are very dismissive, or just think that AI companies are bad by default. And this is just really complicated, and it really depends exactly how the AI lab is prioritising risk to society versus success — and it has to prioritise success some to be relevant, or to get some of these benefits. So how it’s balancing is just really hard, and really complicated, and really hard to tell, and you’re going to have to have some judgements about it. So it’s not a ringing endorsement, but it does feel like, at least in theory, part of one of the main ways that we make things better. You know, you could do a lot of good.

Why information security is so important

Holden Karnofsky: I think you could build these powerful, dangerous AI systems, and you can do a lot to try to mitigate the dangers — like limiting the ways they can be used, you can do various alignment techniques — but if some state or someone else steals the weights, they’ve basically stolen your AI system, and they can run it without even having to do the training run. So you might spend a huge amount of money on a training run, end up with this AI system that’s very powerful, and someone else just has it. And they can then also fine-tune it, which means they can do their own training on it and change the way it’s operating. So whatever you did to train it to be nice, they can train that right out; the training they do could screw up whatever you did to try and make it aligned.
And so I think at the limit of ‘it’s really just trivial for any state to just grab your AI system and do whatever they want with it and retrain it how they want’, it’s really hard to imagine feeling really good about that situation. I don’t know if I really need to elaborate a lot more on that. So making it harder seems valuable.
This is another thing where I want to say, as I have with everything else, that it’s not binary. So it could be the case that, after you improve your security a lot, it’s still possible for a state actor to steal your system, but they have to take more risks, they have to spend more money, they have to take a deeper breath before they do it. It takes them more months. Months can be a very big deal. As I’ve been saying, when you get these very powerful systems, you could do a lot in a few months. By the time they steal it, you could have a better system. So I don’t think it’s an all-or-nothing thing.
But no matter what risk of AI you’re worried about — you could be worried about the misalignment; you could be worried about the misuse and the use to develop dangerous weapons; you can be worried about more esoteric stuff, like how the AI does decision theory; you could be worried about mind crime — you don’t want just anyone, including some of these state actors who may have very bad values, to just be able to steal a system, retrain it how they want, and use it how they want. You want some kind of setup where it’s the people with good values controlling more of the more powerful AI systems, using them to enforce some sort of law and order in the world, and enforcing law and order generally — with or without AI. So it seems quite robustly important.
Other things about security is that I think it’s very, very hard, just very hard to make these systems hard to steal for a state actor, and so there’s just a tonne of room to go and make things better. There could be security research on innovative new methods, and there can also be a lot of blocking and tackling — just getting companies to do things that we already know need to be done, but that are really hard to do in practice, take a lot of work, take a lot of iteration. Also, a nice thing about security, as opposed to some of these other things: it is a relatively mature field, so you can learn about security in some other context and then apply it to AI.
Part of me kind of thinks that the EA community or whatever kind of screwed up by not emphasising security more. It’s not too hard for me to imagine a world where we’ve just been screaming about the AI security problem for the last 10 years, and how do you stop a very powerful system from getting stolen? That problem is extremely hard. We’ve made a bunch of progress on it. There were tonnes of people concerned about this stuff on the security teams of all the top AI companies, and were not as active and only had a few people working on alignment.
I don’t know. Is that world better or worse than this one? I’m not really sure. A world where we’re more balanced, and had encouraged people who are a good fit for one to go into one, probably seems just better than the world we’re in. So yeah, I think security is a really big deal. I think it hasn’t gotten enough attention.

Holden vs hardcore utilitarianism

Rob Wiblin: So maybe a way of highlighting the differences here will be to imagine this conversation, where you’re saying, “I’m leading Open Philanthropy. I think that we should split our efforts between a whole bunch of different projects, each one of which would look exceptional on a different plausible worldview.” And the hardcore utilitarian comes to you and says, “No, you should choose the best one and fund that. Spend all of your resources and all of your time, just focus on that best one.” What would you say to them in order to justify the worldview diversification approach?
Holden Karnofsky: The first thing I would say to them is just, “The burden’s on you.” And this is kind of a tension I often have with people who consider themselves hardcore: they’ll just feel like, well, why wouldn’t you be a hardcore utilitarian? Like, what’s the problem? Why isn’t it just maximising the pleasure and minimising the pain, or the sum or the difference? And I would just be like, “No. No. No. You gotta tell me, because I am sitting here with these great opportunities to help huge amounts of people in very different and hard-to-compare ways.
And the way I’ve always done ethics before in my life is, like, I basically have some voice inside me and it says, “This is what’s right.” And that voice has to carry some weight. Even on your model, that voice has to carry some weight, because you — you, the hardcore utilitarian, not Rob, because we all know you’re not at all — it’s like, even the most systematic theories of ethics, they’re all using that little voice inside you that says what’s right. That’s the arbiter of all the thought experiments. So we’re all putting weight on it somewhere, somehow. And I’m like, cool. That’s gotta be how this works. There’s a voice inside me saying, “this feels right” or “this feels wrong” — that voice has gotta get some weight.
That voice is saying, “You know what? It is really interesting to think about these risks to humanity’s future, but also, it’s weird.” This work is not shaped like the other work. It doesn’t have as good feedback loops. It feels icky — like a lot of this work is about just basically supporting people who think like us, or it feels that way a lot of the time, and it just feels like it doesn’t have the same ring of ethics to it.
And then on the other hand, it just feels like I’d be kind of a jerk if… Like, Open Phil, I believe — and you could disagree with me — is not only the biggest, but the most effective farm animal welfare funder in the world. I think we’ve had enormous impact and made animal lives dramatically better. And you’re coming to say to me, “No, you should take all that money and put it into the diminishing margin of supporting people to think about some future x-risk in a domain where you mostly have a lot of these concerns about insularity.” Like, you’ve got to make the case to me — because the normal way all this stuff works is you listen to that voice inside your head, and you care what it says. And some of the opportunities Open Phil has to do a lot of good are quite extreme, and we do them.
So that’s the first thing: we’ve got to put the burden of proof in the right place.

Articles, books, and other media discussed in the show

Holden’s work:

Challenges in AI alignment and safety:

Update on ARC’s recent eval efforts: More information about ARC’s evaluations of GPT-4 and Claude by the Alignment Research Center
Model organisms of misalignment: The case for a new pillar of alignment research by Evan Hubinger, Nicholas Schiefer, Carson Denison, and Ethan Perez
Artificial intelligence as a positive and negative factor in global risk by Eliezer Yudkowsky
AI is not an arms race by Katja Grace in TIME
Modeling the human trajectory by David Roodman
Report on whether AI could drive explosive economic growth by Tom Davidson
The bitter lesson by Rich Sutton
Propositions concerning digital minds and society by Nick Bostrom

Careers in this space:

Information security:
- Career review: Information security
- EA Infosec: Skill up in or make a transition to infosec via this book club by Jason Clinton and Wim van der Schoot on the EA Forum
- Facebook group: Information Security in Effective Altruism
Career review: AI safety technical research
Career review: AI governance and coordination
Communication careers

Hardcore utilitarianism, ethics, and UDASSA:

Anthropics and the universal distribution by Joe Carlsmith — a discussion of UDASSA (Universal Distribution + Absolute Self Selection Assumption)
Base camp for Mt. Ethics by Nick Bostrom
Pareto principles in infinite ethics by Amanda Askell

Other 80,000 Hours podcast episodes:

Transcript

Table of Contents

1 Cold open [00:00:00]
2 Rob’s intro [00:01:01]
3 The interview begins [00:02:18]
4 Explosively fast progress [00:06:55]
5 AI population explosion [00:23:54]
6 Misaligned AI might not kill us all, and aligned AI could be catastrophic [00:27:18]
7 Where Holden sees overconfidence among AI safety researchers [00:35:08]
8 Where Holden disagrees with ML researchers [00:41:22]
9 What Holden really believes [00:45:40]
10 The value of laying out imaginable success stories [00:49:19]
11 Holden’s four-intervention playbook [01:06:16]
12 Standards and monitoring [01:09:20]
13 Designing evaluations [01:17:17]
14 The value of having a successful and careful AI lab [01:33:44]
15 Why information security is so important [01:39:48]
16 What governments could be doing differently [01:52:55]
17 What people with an audience could be doing differently [01:59:38]
18 Jobs and careers to help with AI [02:06:47]
19 Audience questions on AI [02:14:25]
20 Holden vs hardcore utilitarianism [02:23:33]
21 The track record of futurists seems… fine [03:02:27]
22 Parenting [03:05:48]
23 Rob’s outro [03:07:47]

Cold open [00:00:00]

Holden Karnofsky: I believe you can make the entire case for being extremely concerned about AI, assuming that AI will never be smarter than a human. Instead, it will be as capable as the most capable humans, and there will be a tonne of them — because unlike humans, you can just copy them. You can use your copies to come up with ways to make it more efficient, just like humans do, then you can make more copies.

And when we talk about whether AI could defeat humanity — and I’ve written one blog post on whether AI could kind of take over the world — they don’t have to be more capable than humans. They could be equally capable, and there could be more of them. That could really do it. That could really be enough that then humans wouldn’t be in control of the world anymore.

So I’m basically generally happy to just have all discussions about AI and what the risks are, just in this world where there’s nothing more capable than a human — but it’s pretty scary to have a lot of those that have different values from humans, and that are kind of a second advanced species. That’s not to rule out that some of these superintelligence concerns could be real. It’s just that they’re not always necessary, and they can sideline people.

Rob’s intro [00:01:01]

Rob Wiblin: People always love an interview with Holden, cofounder of Open Philanthropy. We last spoke with him in 2021 about his theory that we’re plausibly living in the most important century of all those yet to come.

Today we discuss things that have been preoccupying him lately:

What are the real challenges raised by rapid AI advances?
Why not just gradually solve those problems as they come up?
What can multiple groups do about it — including listeners to this show, governments, computer security experts, journalists, and so on.
What do various different groups get wrong in Holden’s view?
How might we succeed with AI by luck?
Holden’s four categories of useful work to help with AI.
Plus a few random audience questions.

At the end we talk through why Holden rejects impartiality as a core principle of morality — and his non-realist conception of why he bothers to try to help others.

After the interview, I also respond to some reactions we got to the previous interview with Ezra Klein.

Without further ado, I bring you Holden Karnofsky.

The interview begins [00:02:18]

Rob Wiblin: Today, I’m again speaking with Holden Karnofsky. In 2007, Holden cofounded the charity evaluator GiveWell, and then in 2014, he cofounded the foundation Open Philanthropy, which works to find the highest-impact grant opportunities, and has so far recommended around $2 billion in grants.

But in March 2023, Holden started a leave of absence from Open Philanthropy to instead explore working directly on AI safety and ways of improving outcomes from recent advances in AI. He also blogs at cold-takes.com about futurism, quantitative macro history, and epistemology — though recently, he’s had again a particular focus on AI, writing posts like “How we could stumble into AI catastrophe,” “Racing through a minefield: The AI deployment problem,” and “Jobs that can help with the most important century.”

I should note that Open Philanthropy is 80,000 Hours’ biggest supporter, and that Holden’s wife is Daniela Amodei, the president of the AI lab Anthropic.

Thanks for returning to the podcast, Holden.

Holden Karnofsky: Thanks for having me.

Rob Wiblin: I hope to talk about what you would most like to see people do to positively shape the development of AI, as well as your reservations about utilitarianism. But first, what are you working on at the moment, and why do you think it’s important?

Holden Karnofsky: Currently on a leave of absence from Open Philanthropy, just taking a little time to explore different ways I might be able to do direct work to reduce potential risks from advanced AI.

One of the main things I’ve been trying to do recently, although there’s a couple things I’m looking into, is understanding what it might look like to have AI safety standards — by which I mean sort of documenting expectations that AI companies and other labs won’t build and deploy AIs that pose too much risk to the world, as evaluated by some sort of systematic evaluation regime. These expectations could be done via self-regulation, via regulation regulation; there’s a lot of potential interrelated pieces.

So to make this work, I think you would need ways of evaluating when an AI system is dangerous — that’s sometimes called “evals.” Then you would need potential standards that would basically talk about how you connect what you’re seeing from the evals with what kind of measures you have to take to ensure safety. Then also, to make this whole system work, you would need some way of making the case to the general public that standards are important, so companies are more likely to adhere to them.

And so things I’ve been working on include trying to understand what standards look like in more mature industries — doing one case study and trying to fund some other ones there — trying to learn lessons from what’s already worked in the past elsewhere; I’ve been advising groups like ARC Evals, thinking about what evals and standards could look like; and I’ve been trying to understand what pieces of the case for standards are and are not resonating with people, so I could think about how to increase public support for this kind of thing. So it’s pretty exploratory right now, but that’s been one of the main things I’ve been thinking about.

Rob Wiblin: I guess there’s a huge surface area of ways one might attack this problem. Why the focus on standards and evals in particular?

Holden Karnofsky: Sure. We can get to this, but I have been thinking a lot about what the major risks are from advanced AI, and what the major ways of reducing them are. I think there’s a few different components that seem most promising to me, as part of most stories I can tell in my head for how we get the risks down very far. And this is the piece of the puzzle that to me feels like it has a lot of potential, but there isn’t much action on it right now, and that someone with my skills can potentially play a big role helping to get things off the ground — helping to spur a bit more action, getting people to just move a little faster.

I’m not sure this is what I want to be doing for the long run, but I think that it’s in this kind of nascent phase, where my background with just starting up things in very vaguely defined areas and getting to the point where they’re a little bit more mature is maybe helpful.

Rob Wiblin: Yeah. How’s it been no longer leading an organisation with tonnes of employees and being able to self-direct a little bit more? You wouldn’t have been in that situation for quite some time, I guess.

Holden Karnofsky: Yeah. I mean, it was somewhat gradual. We’ve been talking for several years now, at least since 2018 or so, about succession at Open Philanthropy — because I’ve always been a person who likes to start things and likes to be in there in that nascent phase, and prefers always to find someone else who can run an organisation for the long run. And you know, a couple years ago, Alexander Berger became co-CEO with me and has taken on more and more duties.

So it wasn’t an overnight thing, and I’m still not completely uninvolved: I’m still on the board of Open Philanthropy, I’m still meeting with people, I’m still advising. You know, kind of similar to GiveWell: I had a very gradual transition away from GiveWell, and still talk to them frequently. So it’s been a gradual thing. But for me, it is an improvement. I think it’s not my happy place to be at the top of that org chart.

Explosively fast progress [00:06:55]

Rob Wiblin: OK, so today, we’re not going to rehash the basic arguments for worrying about ways that AI advances could go wrong, or ways that maybe this century could turn out to be really unusually important: I think, AI risk, people have heard of it now. We’ve done lots of episodes on it. People who wanted to hear your broader worldview on this go back to previous interviews including with you, such as episode #109: Holden Karnofsky on the most important century.

In my mind, the risks are both pretty near term and I think increasingly kind of apparent. So to me, it feels like the point in this whole story where we need to get down a bit more to brass tacks, and start debating what is to be done and figuring out what things might really help that we could get moving on.

That said, we should take a minute to think about which aspect of this broader problem we are talking about today, and which one you’re thinking about. Of course, there’s risks from “misalignment”: so AI models completely flipping out and trying to take over would be an extreme case of that. Then there’s “misuse”: where the models are doing what people are telling them to do, but we wish that they weren’t perhaps. And then there’s other risks, like just speeding up history, causing a whole lot of stuff to happen incredibly quickly, and perhaps that leading us into disaster.

Which of the aspects of this broader problem do you think of yourself as trying to contribute to solve now?

Holden Karnofsky: Yeah. First off, to your point, I am happy to focus on solutions, but I do think it can be hard to have a really good conversation about solutions without having some kind of shared understanding of the problem. And I think while a lot of people are getting vaguely scared about AI, I think there’s still a lot of room to disagree on what exactly the most important aspects of the problem are, what exactly the biggest risks are.

For me, the two you named — misalignment and misuse — are definitely big. I would throw some others in there too that I think are also big. I think we may be on the cusp of having a lot of things work really differently about the world, and in particular, having what you might think of as new life forms — whether that’s AIs, or I’ve written in the past on Cold Takes about digital people: that if we had the right technology, which we might be able to develop with AIs’ help, we might have simulations of humans that we ought to think of as kind of humans like us. And that could lead to a lot of challenges. Just the fact, for example, that you could have human rights abuses happening inside a computer seems like a very strange situation that society has not really dealt with before.

And I think there’s a bunch of other things like that. What kind of world do we have when someone can just make copies of people, or of minds, and ensure that those copies believe certain things and defend certain ideas? That, I think, could challenge the way a lot of our existing institutions work. There’s a nice piece “Propositions concerning digital minds and society” that I think gives a flavour for this. So I think there’s a whole bunch of things I would point to as important.

In this category, if I had to name one risk that I’m most focused on, it’s probably the misaligned AI risk. It’s probably the one about how you build these very capable, very powerful AI systems, and they’re the systems that if, for whatever reason, they were pointed at bringing down all human civilisation, they could. And then something about your training is kind of sloppy or leads to unintended consequences, so that you actually do have AIs trying to bring down civilisation. I think that is probably the biggest one.

But I think there’s also a meta-threat that to me is really the unifying catastrophic risk of AI. For me, that I would abbreviate as just “explosively fast progress.” So the central idea of the “most important century” series that I wrote is that if you get an AI with certain properties, there’s a bunch of reasons from economic theory, from economic history — I think we’re also putting together some reasons now that you can take more from the specifics of how AI works and how algorithms development works — to expect that you could get a dramatic acceleration in the rate of change. And particularly in the rate of scientific and technological advancement — particularly in the rate of AI advancement itself, so that things move on a much faster time scale than anyone is used to.

And one of the central things I say in the most important century series is that if you imagine a wacky sci-fi future — the kind of thing you would imagine thousands of years from now for humanity, with all these wacky technologies — that might actually be years or months from the time when you get in range of these super-powerful AIs that have certain properties. That, to me, is the central problem. And I think all these other risks that we’re talking about wouldn’t have the same edge to them if it weren’t for that.

So misaligned AI: if AI systems got very gradually more powerful, and we spent a decade with systems that were kind of close to as capable as humans, but not really, and then a decade with systems that were about as capable as humans, with some strengths and weaknesses, then a decade of systems a little more capable… You know, I wouldn’t really be that worried. I feel like this is something we could kind of adapt to as we went, and figure out as we went along.

Similarly with misuse: AI systems might end up able to help develop powerful technologies that are very scary, but that wouldn’t be as big a deal — it would be kind of a continuation of history if this just went in a gradual way.

My big concern is that it’s not gradual. I think it may be worth digging on that a little bit more — exactly how fast do I mean and why, even though I have covered it somewhat in the past — because that, to me, is really the central issue.

And one of the reasons I’m so interested in AI safety standards is because kind of no matter what risk you’re worried about, I think you hopefully should be able to get on board with the idea that you should measure the risk, and not unwittingly deploy AI systems that are carrying a tonne of the risk, before you’ve at least made a deliberate informed decision to do so. And I think if we do that, we can anticipate a lot of different risks and stop them from coming at us too fast. “Too fast” is the central theme for me.

Rob Wiblin: Yeah. It’s very interesting framing to put the speed of advancement front and centre as this is the key way that this could go off the rails in all sorts of different directions.

Eliezer Yudkowsky has this kind of classic story about how you get an AI taking over the world remarkably quickly. And a key part of the story, as he tells it, is this sudden self-improvement loop where the AI gets better at doing AI research and then improves itself, and then it’s better at doing that again — so you get this recursive loop, where suddenly you go from somewhat human-level intelligence to something that’s very, very superhuman. I think many people reject that primarily because they reject the speed idea: that they think, “Yes, if you got that level of advancement over a period of days, sure, that might happen. But actually, I just don’t expect that recursive loop to be quite so quick.”

And likewise, we might worry that AI might be used by people to make bioweapons. But if that’s something that gradually came online over a period of decades, we probably have all kinds of responses that we could use to try to prevent that. But if it goes from one week to the next, they’re in a tricky spot.

Do you want to expand on that? Are there maybe insights that come out of this speed-focused framing of the problem that people aren’t taking quite seriously enough?

Holden Karnofsky: Yeah. I should first say I don’t know that I’m on the same page as Eliezer. I can’t always tell, but I think he is picturing probably a more extreme and faster thing than I’m picturing, and probably for somewhat different reasons. You know, a common story in some corners of this discourse is this idea of an AI that’s this kind of simple computer program, and it rewrites its own source code, and that’s where all the action is. I don’t think that’s exactly the picture I have in mind, although there’s some similarities.

The kind of thing I’m picturing is maybe more like a months or years time period from getting sort of near-human-level AI systems — and what that means is definitely debatable and gets messy — but near-human-level AI systems to just very powerful ones that are advancing science and technology really fast. And then in science and technology — at least on certain fronts that are the less bottlenecked fronts, and we could talk about bottlenecks in a minute — you get a huge jump. So I think my view is at least somewhat more moderate than Eliezer’s, and at least has somewhat different dynamics.

But I think both points of view are talking about this rapid change. I think without the rapid change, a) things are a lot less scary generally, and b) I think it is harder to justify a lot of the stuff that AI-concerned people do to try and get out ahead of the problem and think about things in advance. Because I think a lot of people sort of complain with this discourse that it’s really hard to know the future, and all this stuff we’re talking about about what future AI systems are going to do and what we have to do about it today, it’s very hard to get that right. It’s very hard to anticipate what things will be like in an unfamiliar future.

When people complain about that stuff, I’m just very sympathetic. I think that’s right. And if I thought that we had the option to adapt to everything as it happens, I think I would in many ways be tempted to just work on other problems, and in fact adapt to things as they happen and we see what’s happening and see what’s most needed. And so I think a lot of the case for planning things out in advance — trying to tell stories of what might happen, trying to figure out what kind of regime we’re going to want and put the pieces in place today, trying to figure out what kind of research challenges are going to be hard and do them today — I think a lot of the case for that stuff being so important does rely on this theory that things could move a lot faster than anyone is expecting.

I am in fact very sympathetic to people who would rather just adapt to things as they go. I think that’s usually the right way to do things. And I think many attempts to anticipate future problems are things I’m just not that interested in, because of this issue. But I think AI is a place where we have to take the explosive progress thing seriously enough that we should be doing our best to prepare for it.

Rob Wiblin: Yeah. I guess if you have this explosive growth, then the very strange things that we might be trying to prepare for might be happening in 2027, or incredibly soon.

Holden Karnofsky: Something like that, yeah. It’s imaginable, right? And it’s all extremely uncertain because we don’t know. In my head, a lot of it is like there’s a set of properties that an AI system could have: roughly being able to do roughly everything humans are able to do to advance science and technology, or at least able to advance AI research. We don’t know when we’ll have that. One possibility is we’re like 30 years away from that. But once we get near that, things will move incredibly fast. And that’s a world we could be in. We could also be in a world where we’re only a few years from that, and then everything’s going to get much crazier than anyone thinks, much faster than anyone thinks.

Rob Wiblin: One narrative is that it’s going to be exceedingly difficult to align any artificial intelligence because, you know, you have to solve these 10 technical problems that we’ve almost gotten no traction on so far, so it could take us decades or centuries in order to fix them. On this speed-focused narrative, it actually seems a little bit more positive, because it might turn out that from a technical standpoint, this isn’t that challenging. The problem will be that things are going to run so quickly that we might only have a few months to figure out what solution we’re choosing, and actually try to apply it in practice.

But of course, inasmuch as we just need to slow down, that is something that, in theory at least, people could agree and actually just and try to coordinate in order to do. Do you think that’s going to be a part of the package? That we ideally just want to coordinate people as much as possible to make this as gradual as feasible?

Holden Karnofsky: Well, these are separate points: I think you could believe in the speed and also believe the alignment problem is really hard. Believing in speed doesn’t make the alignment problem any easier. And I think that the speed point is really just bad news. I mean, I hope things don’t move that fast. If things move that fast, I think most human institutions’ ways of reacting to things, we just can’t count on them to work the way they normally do, and so we’re going to have to do our best to get out ahead of things, and plan ahead, and make things better in advance as much as we can. And it’s mostly just bad news.

There’s a separate thing, which is that I do feel less convinced than some other people that the alignment problem is this incredibly hard technical problem, and more feel like if we did have a relatively gradual set of developments, even with the very fast developments, I think there’s a good chance we just get lucky and we’re fine.

So I think they’re two different axes. I know you’ve talked with Tom Davidson about this a bunch, so I don’t want to make it the main theme of the episode, but I do think, in case someone hasn’t listened to every 80k podcast ever, just getting a little more into the why — of why you get such an explosive growth and why not — I think this is a really key premise. And I think most of the rest of what I’m saying doesn’t make much sense without it, and I want to own that. I don’t want to lose out on the fact that I am sympathetic to a lot of reservations about working on AI risk. So maybe it would be good to cover that a bit.

Rob Wiblin: Yeah, let’s do that. So one obvious mechanism by which things could speed up is that you have this positive feedback loop where the AIs get better at improving themselves. Is there much more to the story than that?

Holden Karnofsky: Yeah, I think it’s worth recapping that briefly. One observation I think is interesting — and this is a report by David Roodman for Open Philanthropy that goes through this — one thing that I’ve wondered is just like, if you take the path of world economic growth throughout history, and you just extrapolate it forward in the simplest way you can, what do you get? And it depends what time period you’re looking at. If you look at economic history since 1900 or 1950, we’ve had a few percent a year growth over that entire time. And if you extrapolate it forward, you get a few percent a year growth, and you just get the world everyone is kind of already expecting, and the world that’s in the UN projections, all that stuff.

The interesting thing is if you zoom out and look at all of economic history, you see that economic progress for most of history, not recently, has been accelerating. And if you try to model that acceleration in a simple way, and project that out in a simple way, you get basically the economy going to infinite size sometime this century — which is a wild thing to get from a simple extrapolation.

I think a question is: Why is that, and why might it not be? And I think the basic framework I have in mind is that there is a feedback loop you can have in the economy, where people have new ideas, new innovations, and that makes them more productive. Once they’re more productive, they have more resources. Once they have more resources, they have more kids. There’s more population or there’s fewer deaths, and there’s more people. So it goes: more people, more ideas, more resources, more people, more ideas, more resources.

And when you have that kind of feedback loop in place, any economic theory will tell you that you get what’s called “superexponential growth,” which is growth that’s accelerating on an exponential basis. And that kind of growth is very explosive, is very hard to track, and can go to infinity in finite time. The thing that changed a couple hundred years ago is that one piece of that feedback loop stopped for humanity: people basically stopped turning more resources into more people. So right now, when people get richer, they don’t have more kids. They just get richer.

Rob Wiblin: They buy another car.

Holden Karnofsky: Yeah, exactly. So that feedback loop kind of broke. That’s not like a bad thing that it broke, I don’t think, but it kind of broke. And so we’ve had just what’s called “normal exponential growth,” which is still fast, but which is not the same thing — it doesn’t have the same explosiveness to it.

And the thing that I think is interesting and different about AI is that if you get AI to the point where it’s doing the same thing humans do to have new ideas to improve productivity — so this is like the science and invention part — then you can turn resources into AIs in this very simple linear way that you can’t do with humans. And so you could get an AI feedback loop.

And just to be a little more specific about what it might look like: right now AI systems are getting a lot more efficient — you can do a lot more with the same amount of compute than you could 10 years ago. Actually, a dramatic amount more: I think various measurements of this are something like you can get the same performance for something like 18x or 15x less compute compared to a few years ago, maybe a decade ago.

Why is that? It’s because there’s a bunch of human beings who have worked on making AI algorithms more efficient. So to me, the big scary thing is when you have an AI that does whatever those human beings were doing. And there’s no particular reason you couldn’t have that, because what those human beings were doing, as far as I know, was mostly kind of sitting at computers, thinking of stuff, trying stuff. There’s no particular reason you couldn’t automate that.

Once you automate that, here’s the scary thing. You have a bunch of AIs. You use those AIs to come up with new ideas to make your AIs more efficient. Then let’s say that you make your AIs twice as efficient: well, now you have twice as many AIs. And so if having twice as many AIs can make your AIs twice as efficient again, there’s really no telling where that ends. Tom Davidson did a bunch of analysis of this, and I’m still kind of poking at it and thinking about it, but I think there’s at least a decent chance that is the kind of thing that leads to explosive progress — where AI could really take off and get very capable very fast — and you can extend that somewhat to other areas of science.

And you know, some of this will be bottlenecked. Some of this will be like, you can only move so fast, because you have to do a bunch of experiments in the real world; you have to build a bunch of stuff. And I think some of it will only be a little bottlenecked or only be somewhat bottlenecked. And I think there are some feedback loops just going from, like, you get more money. You’re able to — quickly, with automated factories — build more stuff like solar panels. You get more energy, and then you get more money, and then you’re able to do that again. And it’s in that loop, you have this part where you’re making everything more efficient all the time.

And I’m not going into all the details here. It’s been gone into in more detail in my blog posts on the most important century, and Tom Davidson in his podcast and he continues to think about it. But that’s the basic model: we have this feedback loop that we have observed in history that doesn’t work for humans right now, but could work for AIs — where you have AIs have ideas, in some sense, that make things more efficient. When things get more efficient, you have more AIs, and that creates a feedback loop. That’s where you get your superexponential growth.

AI population explosion [00:23:54]

Rob Wiblin: So one way of describing this is talking about how the artificial intelligence becomes more intelligent, and that makes it more capable of improving its intelligence, and so it becomes super smart. But I guess the way that you’re telling it emphasises a different aspect, which is not so much that it’s becoming super smart, but that it is becoming super numerous, or that you can get effectively a population explosion.

I think some people are sceptical of this superintelligence story, because they think you get really declining returns to being smarter, and that there’s some ways in which it just doesn’t matter how smart you are, the world is too unpredictable for you to come up with a great plan. But this is a different mechanism by which you can get the same outcome, which is just that you have this enormous increase in the number of thoughts that are occurring on computer chips, more or less. And at some point, 99% of the thoughts that are happening on Earth could basically be occurring inside artificial intelligences. And then as they get better, and they’re able to make more chips more quickly, again, basically just gets a population explosion.

Holden Karnofsky: Yeah. That’s exactly right. And I think this is a place where some people get a little bit rabbitholed on the AI debates, because I think there’s a lot of room to debate how big a deal it is to have something that’s “extremely smart” or “superintelligent” or “much smarter than human.” It’s like, maybe if you had something that was like a giant brain or something, and way way smarter (whatever that means) than us, maybe what that would mean is that it would instantly see how to make all these super weapons and conquer the world and how to convince us of anything. There’s all this stuff that that could mean, and people debate whether it could mean that, but it’s uncertain.

And I think a thing that’s a lot less uncertain — if you’re finding yourself sceptical of what this “smart” idea means, and where it’s going to go, and what you can do with it — if you find yourself sceptical of that, then just forget about it. I believe you can make the entire case for being extremely concerned about AI, assuming that AI will never be smarter than a human. Instead, it will be as capable as the most capable humans, and there will be a ton ne of them — because unlike humans, you can just copy them. You can use your copies to come up with ways to make it more efficient, just like humans do, then you can make more copies.

Rob Wiblin: Yeah. You can just get beaten by force of numbers more or less. I think it’s a little bit of a shame that this sheer numbers argument hasn’t really been made very much. It feels like the superintelligence story has been very dominant in the narrative and media, and many people get off the boat because they’re sceptical of this intelligence thing. I think it kind of is the fault of me and maybe people who’ve been trying to raise the alarm about this, because the focus really has been on the superintelligence aspect rather than the super-numerousness that you could get.

Holden Karnofsky: And I don’t know. I mean, I think there’s valid concerns from that angle for sure, and I’m not trying to dismiss it. But I think there’s a lot of uncertainty about what superintelligence means and where it could go. And I think you can raise a lot of these concerns without needing to have a settled view there.

Misaligned AI might not kill us all, and aligned AI could be catastrophic [00:27:18]

Rob Wiblin: Yeah. So with Ajeya Cotra and Rohin Shah, I found it really instructive to hear from them about some common opinions that they don’t share or maybe even just regard as misunderstandings. So let’s go through a couple of those to help situate you in the space of AI ideas here.

What’s a common opinion among the community of people working to address AI risk that you personally don’t share?

Holden Karnofsky: I mean, I don’t know. I don’t always have the exact quote of whoever said what, but a vibe I pick up is this kind of framework that says, “If we don’t align our AIs, we’re all going to die. And if we can align our AIs, that’s great, and we’ve solved the problem.” And that’s the problem we should be thinking about, and there’s nothing else really worth worrying about. It’s kind of like alignment is the whole game, would be the hypothesis.

And I disagree with both ends of that, especially the latter. So to take the first end — if we don’t align AI, we’re all dead — first off, I just think it’s really unclear. Even in the worst case — where you get an AI that has its own values, and there’s a huge number of them, and they kind of team up and take over the world — even then, it’s really unclear if that means we all die. I know there’s debates about this. I have tried to understand. The MIRI folks, I think, feel really strongly that clearly, we all die. I’ve tried to understand where they’re coming from, and I have not.

I think a key point is it just could be very cheap — as a percentage of resources, for example — to let humans have a nice life on Earth, and not expand further, and be cut off in certain ways from threatening the AI’s ability to do what it wants. That can be very cheap compared to wiping us all out.

And there could be a bunch of reasons one might want to do that, some of them kind of wacky. Some of them like, “Well, maybe in another part of the universe, there’s someone like the AI that was trying to design its own AI. And that thing ended up with values like the humans, and maybe there’s some kind of trade that could be made, using acausal trade” — and we don’t need to get into what all this means — or like, maybe the AI is actually being simulated by humans or something, or by some smarter version of humans, or some more powerful version of humans, and being tested to see if it’ll wipe out the humans or be nice to them. It’s just like, you don’t need a lot of reasons to leave one planet out if you’re expanding throughout the galaxy. So that would be one thing, is that it’s just kind of uncertain what happens even in the worst case.

And then I do think there’s a bunch of in-between cases, where we have AIs that are sort of aligned with humans. An analogy that often comes up is humans in natural selection, where humans were put under pressure by natural selection to have lots of kids, or to do inclusive reproductive fitness. And we’ve invented birth control, and a lot of times we don’t have as many kids as we could, and stuff like that. But also, humans still have kids and love having kids. And a lot of humans have like 20 different reasons to have kids, and after a lot of the original ones have been knocked out by weird technologies, they still find some other reason to have kids. You know, I don’t know. Like, I found myself one day wanting kids and had no idea why, and invented all these weird reasons. So I don’t know.

It’s just not that odd to think that you could have AI systems that are pretty off-kilter from what we were trying to make them do, but it’s not like they’re doing something completely unrelated either — it’s not like they have no drives to do a bunch of stuff related to the stuff we wanted them to do.

Then you could also just have situations, especially in the early stages of all this, where you might have near-human-level AIs. So they might have goals of their own, but they might not be able to coordinate very well, or they might not be able to reliably overcome humans, so they might end up cooperating with humans a lot. We might be able to leverage that into having AI allies that help us build other AI allies that are more powerful, so we might be able to stay in the game for a long way. So I don’t know. I just think things could be very complicated. It doesn’t feel to me like if you screw up a little bit with the alignment problem, then we all die.

The other part — if we do align the AI, we’re fine — I disagree with much more strongly. The first one, I mean, I think it would be really bad to have misaligned AI. And despite the feeling that I feel it is fairly overrated in some circles, I still think it’s the number one thing for me. Just the single biggest issue in AI is we’re building these potentially very powerful, very replicable, very numerous systems — and we’re building them in ways where we don’t have much insight into whether they have goals, or what the goals would be; we’re kind of introducing the second advanced species onto the planet that we don’t understand. And if that advanced species becomes more numerous and/or more capable than us, we don’t have a great argument to think that’s going to be good for us. So I’m on board with alignment risk is the number one thing — not the only thing, but the number one thing.

But I would say, if you just assume that you have a world of very capable AIs, that are doing exactly what humans want them to do, that’s very scary. And I think if that was the world we knew we were going to be in, I would still be totally full time on AI, and still feel that we had so much work to do and we were so not ready for what was coming.

Certainly, there’s the fact that because of the speed at which things move, you could end up with whoever kind of leads the way on AI, or is least cautious, having a lot of power — and that could be someone really bad. And I don’t think we should assume that just because that if you had some head of state that has really bad values, I don’t think we should assume that that person is going to end up being nice after they become wealthy, or powerful, or transhuman, or mind uploaded, or whatever — I don’t think there’s really any reason to think we should assume that.

And then I think there’s just a bunch of other things that, if things are moving fast, we could end up in a really bad state. Like, are we going to come up with decent frameworks for making sure that the digital minds are not mistreated? Are we going to come up with decent frameworks for how to ensure that as we get the ability to create whatever minds we want, we’re using that to create minds that help us seek the truth, instead of create minds that have whatever beliefs we want them to have, stick to those beliefs and try to shape the world around those beliefs? I think Carl Shulman put it as, “Are we going to have AI that makes us wiser or more powerfully insane?”

So I think there’s just a lot. I think we’re on the cusp of something that is just potentially really big, really world-changing, really transformative, and going to move way too fast. And I think even if we threw out the misalignment problem, we’d have a lot of work to do — and I think a lot of these issues are actually not getting enough attention.

Rob Wiblin: Yeah. I think something that might be going on there is a bit of equivocation in the word “alignment.” You can imagine some people might mean by “creating an aligned AI,” it’s like an AI that goes and does what you tell it to — like a good employee or something. Whereas other people mean that it’s following the correct ideal values and behaviours, and is going to work to generate the best outcome. And these are really quite separate things, very far apart.

Holden Karnofsky: Yeah. Well, the second one, I just don’t even know if that’s a thing. I don’t even really know what it’s supposed to do. I mean, there’s something a little bit in between, which is like, you can have an AI that you ask it to do something, and it does what you would have told it to do if you had been more informed, and if you knew everything it knows. That’s the central idea of alignment that I tend to think of, but I think that still has all the problems I’m talking about. Just some humans seriously do intend to do things that are really nasty, and seriously do not intend — in any way, even if they knew more — to make the world as nice as we would like it to be.

And some humans really do intend and really do mean and really will want to say, you know, “Right now, I have these values” — let’s say, “This is the religion I follow. This is what I believe in. This is what I care about. And I am creating an AI to help me promote that religion, not to help me question it or revise it or make it better.” So yeah, I think that middle one does not make it safe. There might be some extreme versions, like, an AI that just figures out what’s objectively best for the world and does that or something. I’m just like, I don’t know why we would think that would even be a thing to aim for. That’s not the alignment problem that I’m interested in having solved.

Where Holden sees overconfidence among AI safety researchers [00:35:08]

Rob Wiblin: What’s something that some safety-focused folks that you potentially collaborate with, or at least talk to, think that they know, which you think, in fact, just nobody knows?

Holden Karnofsky: I think, in general, there’s this question in deep learning of, you train an agent on one distribution of data or reward signals or whatever, and now you’re wondering when it goes out of distribution — when it hits a new kind of environment or a new set of data — how it’s going to react to that. So this would be like: How does an AI generalise from training to out of distribution? And I think in general, people have a lot of trouble understanding this, and have a lot of trouble predicting this. And I think that’s not controversial. I think that’s known. But I think it kind of relates to some things where people do seem overly confident.

A lot of what people are doing right now with these AI models is they’re doing what’s called reinforcement learning from human feedback, where the basic idea is you have an AI that tries something, and then a human says “that was great” or “that wasn’t so good.” Or maybe the AI tries two things, and the human says which one was better. And if you do that, and that’s a major way you’re training your AI system, there’s this question of what do you get as a result of that? Do you get an AI system that is actually doing what humans want it to do? Do you get an AI system that’s doing what humans would want it to do if they knew all the facts? Do you get an AI system that is tricking humans into thinking it did what they wanted it to do? Get an AI system that’s trying to maximise paper clips? And one way to do that is to do what humans want it to do, so as far as you can tell, it’s doing what you wanted to do, but it’s actually trying to maximise paper clips. Like, which of those do you get?

I think that we just don’t know, and I see overconfidence on both sides of this. I see people saying, “We’re going to basically train this thing to do nice things, and it’ll keep doing nice things as it operates in an increasingly changing world.” And then I see people say, “We’re going to train AI to do nice things, and it will basically pick up on some weird correlate of our training and try to maximise that, and will not actually be nice.” And I’m just like, jeez. We don’t know. We don’t know.

And there’s arguments that say, wouldn’t it be weird if it came out doing exactly what we wanted it to do? Because there’s this wide space of other things that could generalise to it. I just think those arguments are kind of weak, and they’re not very well fleshed out. And there’s genuinely just a tonne of vagueness and not a good understanding of what’s going on in a neural net, and how it generalises from this kind of training.

So you know, the upshot of that is I think people are often just overconfident that AI alignment is going to be easy or hard. I think there’s people who think we’ve basically got the right framework and we’ve got to debug it. And there’s people who think this framework is doomed; it’s not going to work, we need something better. And I just don’t think either is right. I think if we just go on the default course, and we just train AIs based on what looks nice to us, that could totally go fine, and it could totally go disastrously. And I know weirdly few people who believe both those things. A lot of people seem to be overconfidently in one camp or the other on that.

Rob Wiblin: Yeah. I’m completely with you on this one. One of the things that I’ve started to believe more over the last six months is just that no one really has super strong arguments for what kind of motivational architecture these ML systems are going to develop. I suppose that’s maybe an improvement relative to where I was before, because I was a little bit more on the doomer side before.

I guess I also feel like there should be some empirical way of investigating this. You know, people have good points, that a really superintelligent, incredibly sneaky model will behave exactly the same regardless of its underlying motive. But you could try to investigate this on less intelligent, less crafty models, and surely you would be able to gain some insight into the way it’s thinking and how its motives actually cash out.

Holden Karnofsky: I think it’s really hard. It’s really hard to make generalised statements about how an AI generalises to weird new situations. And there is work going on, trying to understand this, but it’s just been hard to get anything that feels satisfyingly analogous to the problem we care about right now with AI systems and their current capabilities. And even once we do, I think there’ll be plenty of arguments that are just like, well, once AI systems are more capable than that, everything’s going to change. And an AI will generalise differently when it understands who we are and what our training is and how that works and how the world works, and it understands that it could take over the world if it wanted to. Like, that actually could cause an AI to generalise differently.

As an example, this is something I’ve written about on Cold Takes — I call it the King Lear problem. So King Lear is a Shakespearean character who has three daughters. And he asked them each to describe their love for him, and then he hands the kingdom over to the ones that he feels good about after hearing their speeches. And he just picks wrong, and then that’s too bad for him. The issue is it flips on a dime: the two daughters who are the more evil ones were doing a much better job pretending they loved him a tonne, because they knew that they didn’t have power yet, and they would get power later. So actually, their behaviour depended on their calculation of what was going to happen.

So the analogy to AI is that you might have an AI system where maybe its motivational system is trying to maximise the number of humans that are saying “good job.” (This is obviously a bit of a simplification or dramatisation.) And it kind of is understanding at all points that if it could take over the world, enslave all the humans, make a bunch of clones of them, and run them all in loops saying “good job” — if it could, then it would, and it should. But if it can’t, then maybe it should just cooperate with us and be nice.

You can have an AI system that’s running that whole calc. And humans often run that whole calc. Like, as a kid in school, I might often be thinking, “Well, if I can get away with this, then this is what I want to do. If I can’t get away with this, maybe I’ll just do what the teacher wants me to do.” So you could have AIs with that whole motivational system. So now you put them in a test environment, and you test if they’re going to be nice or try to take over the world. But in the test environment, they can’t take over the world, so they’re going to be nice. Now you’re like, “Great. This thing is safe.” You put it out in the world, now there’s a zillion of it running around. Well, now it can take over the world. So now it’s going to behave differently.

So you could have just one consistent motivational system that is fiendishly hard to do a test of how that system generalises when it has power, because you can’t generalise: you can’t test what happens when it’s no longer a test.

Where Holden disagrees with ML researchers [00:41:22]

Rob Wiblin: What’s a view that’s common among ML researchers that you disagree with?

Holden Karnofsky: You know, it depends a little bit on which ML researchers for sure. I would definitely say that I’ve been a big “bitter lesson” person since at least 2017. I got a lot of this from Dario Amodei, my wife’s brother, who is CEO of Anthropic, and I think has been very insightful.

A lot of what’s gone on in AI over the last few years is just like bigger models, more data, more training. And there’s an essay called “The bitter lesson” by an ML researcher, Rich Sutton, that just says that ML researchers keep coming in with cleverer and cleverer ways to design AI systems, and then those clevernesses keep getting obsoleted by just making the things bigger — just training them more and putting in more data.

So I’ve had a lot of arguments over the last few years. And in general, I have heard people arguing with each other that are just kind of like one side. It’s like, “Well, today’s AI systems can do some cool things, but they’ll never be able to do this. And to do this” — maybe that’s reasoning, creativity, you know, something like that — “we’re going to need a whole new approach to AI.” And then the other side will say, “No, I think we just need to make them bigger, and then they’ll be able to do this.”

I tend to lean almost entirely toward that “just make it bigger” view. I think, at least in the limit, if you took an AI system and made it really big — you might need to make some tweaks, but the tweaks wouldn’t necessarily be really hard or require giant conceptual breakthroughs — I do tend to think that whatever it is humans can do, we could probably eventually get an AI to do it. And eventually, it’s not going to be a very fancy AI; it could be just a very simple AI with some easy-to-articulate stuff, and a lot of the challenge comes from making it really big, putting in a lot of data.

I think this view has become more popular over the years than it used to be, but it’s still pretty debated. I think a lot of people are still looking at today’s models and saying that there’s fundamental limitations, where you need a whole new approach to AI before they can do X or Y. I’m just kind of out on that. I think it’s possible, but I’m not confident. This is just where my instinct tends to lie. That’s a disagreement.

I think another disagreement I have with some ML researchers, not all at all, but sometimes I feel like a background sense that just sharing openly information — publishing, open sourcing, et cetera — is just good. That it’s kind of bad to do research and keep it secret, and it’s good to do research and publish it. And I don’t feel this way. I think the things we’re building could be very dangerous at some point, and I think that point can come a lot more quickly than anyone is expecting. I think when that point comes, some of the open source stuff we have could be used by bad actors in conjunction with later insights to create very powerful AI systems in ways we aren’t thinking of right now, but we won’t be able to take back later.

And in general, I do tend to think that in academia, this idea that sharing information is good is built into its fundamental ethos. And that might often be true — but I think there’s times when it’s clearly false, and academia still kind of pushes it. Gain-of-function research being kind of an example for me, where people are very into the idea of, like, making a virus more deadly and publishing how to do it. And I think this is an example of where, just culturally, there’s some background assumptions about information sharing, and that I just think the world is more complicated than that.

Rob Wiblin: Yeah. I definitely encounter people from time to time who have this very strong prior, this very strong assumption, that everything should be open and people should have access to everything. And then I’m like, what if someone was designing a hydrogen bomb that you could make with equipment that you could get from your house? I’m just like, I don’t think that it should be open. I think we should probably stop them from doing that. And, certainly, if they figure it out, we shouldn’t publish it. And I suppose it’s just that’s a sufficiently rare case that it’s very natural to develop the intuition in favour of openness from the 99 out of 100 cases where that’s not too unreasonable.

Holden Karnofsky: Yeah. I think it’s usually reasonable, but I think bioweapons is just a great counterexample, where it’s not really balanced. It’s not really like, for everyone who tries to design or release some horrible pandemic, we can have someone else using open source information to design a countermeasure. That’s not actually how that works. And so I think this attitude at least needs to be complicated a little bit more than it is.

What Holden really believes [00:45:40]

Rob Wiblin: Yeah. What’s something that listeners might expect you to believe, but which you actually don’t?

Holden Karnofsky: I don’t really know what people think, but some vibes that I pick up… I mean, I write a lot about the future. I do a lot of stuff about, “AI is coming. We should prepare and do this, and don’t do that.” I think a lot of people think that I think I have this great ability to predict the future, and that I can spell it out in detail and count on it. And I think a lot of people think I’m underestimating the difficulty of predicting anything. And you know, I think I may in fact be underestimating it, but I do feel a lot of, gosh, it is so hard to be even a decade ahead or five years ahead of what’s going to happen. It is so hard to get that right in enough detail to be helpful.

A lot of times you can get the broad outlines of something, but to really be helpful seems really hard. Even on COVID, I feel like a lot of the people who saw it coming in advance weren’t necessarily able to do much to make things better. And that certainly includes Open Philanthropy: we had a biosecurity programme for years before COVID, and I think there was some helpfulness that came of it, but not as much as there could have been.

And so predicting the future is really hard. Getting out ahead of the future is really hard. I’d really rather never do it. In many ways, I would just rather work on stuff like GiveWell and global health charities and animal welfare, and just adapt to things as they happen and not try and get out ahead of things.

There’s just a small handful of issues that I think are important enough and may move quickly enough that we just have to do our best. I don’t think we should feel this is totally hopeless. I think we can, in fact, do some good by getting out ahead of things and planning in advance. But I think the amount of good we could do is limited. And a lot of my feeling is “We’ve got to do the best we can,” more than “I know what’s coming.”

Rob Wiblin: Yeah. OK, a final one: What’s something you expect quite a lot of listeners might believe, which you’d be happy to disabuse them of?

Holden Karnofsky: There’s a line that you’ve probably heard before that is something like, “Most of the people we can help are in future generations, and there are so many people in future generations that that kind of just ends the conversation about how to do the most good — that it’s clearly and astronomically the case that focusing on future generations dominates all ethical considerations, or at least dominates all considerations of how to do the most good with your philanthropy or your career.” I kind of think of this as philosophical longtermism or philosophy-first longtermism. It kind of feels like you’ve ended the argument after you’ve pointed to the number of people in future generations.

And we can get to this a bit later in the interview — I don’t think it’s a garbage view; I give some credence to it, I take it somewhat seriously, and I think it’s underrated by the world as a whole. But I would say that I give a minority of my moral parliament thinking this way. I would say that more of me than not thinks that’s not really the right way to think about doing good; that’s not really the right way to think about ethics — and I don’t think we can trust these numbers enough to feel that it’s such a blowout.

And the reason that I’m currently focused on what’s classically considered “longtermist causes,” especially AI, is that I believe the risks are imminent and real enough that, even with much less aggressive valuations of the future, they are competitive, or perhaps the best thing to work on.

Another random thing I think is that if you really want to play the game of just being all about the big numbers, and only think about the populations that are the biggest that you can help, future generations are just extremely tiny compared to, you know, persons you might be able to help through acausal interactions with other parts of the multiverse outside our lightcone. I don’t know if you want to get into that, or just refer people back to Joe’s episode on that, but that’s more of a nitpick on that take.

Rob Wiblin: Yeah. People can go back and listen to the episode with Joe Carlsmith if they’d like to understand what we just said there.

The value of laying out imaginable success stories [00:49:19]

Rob Wiblin: But let’s come back to AI now. I want to spend quite a lot of time basically understanding what you think different actors should be doing — so governments, AI labs, listeners: different ways that they might be able to contribute to improving our odds here.

But maybe before we do that, it might be worth trying to envisage scenarios in which things go relatively well. You’ve argued that you’re very unsure how things are going to play out, but it’s possible that we might muddle through and get a reasonably good outcome, even if we basically carry on doing the fairly reckless things that we’re doing right now. Not because you’re recommending that we take that path, but rather just because it’s relevant to know whether we’re just completely far off the possibility of any good outcome, given what we’re doing now. What do you see as the value of laying out positive stories or ways that things might go well?

Holden Karnofsky: I’ve written a few pieces that are kind of laying out “Here’s an excessively specific story about how the future might go that ends happily with respect to AI” — that ends with, you know, we have AIs that didn’t develop or didn’t act on goals of their own enough to disempower humanity, and then we ended up with this world where the world is getting more capable and getting better over time, and none of the various disasters we’re sketching out happened. I’ve written three different stories like that. And then one story, the opposite, “How we could stumble into AI catastrophe,” where things just go really badly.

Why have I written these stories in general? You know, I think it’s not that I believe these stories. It’s not that I think this is what’s going to happen, but I think a lot of times when you’re thinking about general principles of what we should be doing today to reduce risks from AI, it is often helpful — just my brain works better, imagining specifics. And I think it’s often helpful to imagine some specifics and then abstract back from these specifics to general points, and see if you still believe them.

So for example, these are the ones I published, but I’ve done a lot of thinking of what are different ways the future could go well, and there are some themes in them. There’s almost no story of the future going well that doesn’t have a part that’s like “…and no evil person steals the AI weights and goes and does evil stuff.” And so it has highlighted the importance of information security: “You’re training a powerful AI system; you should make it hard for someone to steal” has popped out to me as a thing that just keeps coming up in these stories, keeps being present. It’s hard to tell a story where it’s not a factor. It’s easy to tell a story where it is a factor.

Another factor that has come up for me is that there needs to be some kind of way of stopping or disempowering dangerous AI systems; you can’t just build safe ones. Or if you build the safe ones, you somehow use them to help you stop the dangerous ones, because eventually people will build dangerous ones. I think the most promising general framework that I’ve heard for doing this is this idea of a kind of evals-based regime, where you test to see if AIs are dangerous, and based on the tests, you have the world coming together to stop them or you don’t. And I think even in a world where you have very powerful safe AI systems, you probably still need some kind of regulatory framework for how to use those to use force to stop other systems.

So these are general factors. I think it’s a little bit like how some people might do math by imagining a concrete example of a mathematical object, seeing what they notice about it, and then abstracting back from there to the principles. That’s what I’m doing with a lot of these stories: I’m just like, “Can I tell a story specific enough that it’s not obviously crazy? And then can I see what themes there are in these stories, and which things I robustly believe after coming back to reality?” That’s a general reason for writing stories like that.

The specific story you’re referring to, I wrote a post on less wrong, called “Success without dignity,” which is kind of a response to Eliezer Yudkowsky writing a piece called “Death with dignity.”

Rob Wiblin: Yeah. We should possibly explain that idea. Some people have become so pessimistic about our prospects of actually avoiding going extinct, basically because they think this problem is just so difficult, that they’ve said, “Well, really, the best we can do is to not make fools of ourselves in the process of going extinct” — that we should at least cause our own extinction in some way that’s barely respectable if, I guess, aliens were to read the story or to uncover what we did. And they call this “death with dignity.”

Holden Karnofsky: Yeah. And to be clear, the idea there is not that we’re literally just trying to have dignity. The idea is that that’s a proximate thing you can optimise for, that actually increases your odds of success the most or something.

Rob Wiblin: And to many people, that’s a little bit tongue-in-cheek as well.

Holden Karnofsky: Yeah, for sure. And my response, called “Success without dignity,” is that it’s just actually pretty easy to picture a world where we just do everything wrong, and there’s no real positive surprises from here — at least not in terms of people who are deliberately trying in advance to reduce AI x-risk. Like, there’s no big breakthroughs on AI alignment; there’s no real happy news. Just a lot of stuff just happens normally and happens on the path it’s on, and then we’re fine. And why are we fine? We basically got lucky. And I’m like, “Can you tell a story like that?” And I’m like, “Yeah. I think I can tell a story like that.”

And why does that matter? I think it matters because I think a number of people have this feeling with AI that we’re screwed by default. We’re going to have to get, like, 10 really different hard things all right, or we’re totally screwed. And so, therefore, we should be trying really crazy swing-for-the-fences stuff and forgetting about interventions that help us a little bit.

And yeah, I have the opposite take. I just think that if nothing further happens, there’s some chance that we’re just fine, basically by luck. So we shouldn’t be doing overly crazy things to increase our variance if they’re not highly positive in expected value.

And then I also think that things that just help a little bit, those are good. They’re good at face value. They’re good in the way you’d expect them to be good. They’re not worthless because they’re not enough. You know, things like just working harder with AI systems to get the reinforcement AI systems are getting to be accurate. Just this idea of accurate reinforcement — where you’re not rewarding AI systems specifically for doing bad stuff — that’s a thing you can get more right or get more wrong. And more attempts to do that is kind of basic, and it doesn’t involve, like, clever rethinkings of what cognition means and what alignment means and how we get a perfectly aligned AI. But I think it’s a thing that could matter a lot. And that putting more effort into could matter a lot.

And I feel that way too about improving information security. Like, you don’t have to make your AI impossible to steal — making it hard to steal is worth a lot. You know? So there’s just generally a lot of things I think people can do to reduce AI risks that don’t rely on a complicated picture. It’s just like, this thing helps, so just do it because it helps.

Rob Wiblin: Yeah. We might go over those interventions in just a second. Is it possible to flesh out the story a little bit? How could we get a good outcome mostly through luck?

Holden Karnofsky: So I broke the success without dignity idea into a couple phases. So there’s the initial alignment problem — which is the thing most people in the doomer headspace tend to think about — which is how do we build a very powerful AI system that is not trying to take over the world or disempower humanity or kill all humanity or whatever.

And so there, I think, if you are training systems that are “human levelish” — so an AI system that’s got kind of a similar range of capabilities to a human; it’s going have some strengths and some weaknesses relative to a human — if you’re training that kind of system, I think that you may just get systems that are pretty safe, at least for the moment, without a tonne of breakthroughs or special work.

You might get it by pure luck-ish, so it’s basically like this thing I said before about how you have an AI system, and you train it by basically saying “good job” or “bad job” when it does something — it’s like human feedback for a human levelish system. That could easily result in a system that either it really did generalise to doing what you meant it to do, or it generalised to this thing where it’s like, trying to take over the world, but that means cooperating with you because it’s too weak to take over the world. And, in fact, these human levelish systems are going to be too weak to take over the world, so they’re just going to cooperate with you. It could mean that. So you could get either two of those generalisations.

And then, it does matter if your reinforcement is accurate. So you could have an AI system where you say, “Go make me a bunch of money” — and then unbeknownst to you, it goes and breaks a bunch of laws, hacks into a bunch of stuff, and brings you back some money. Or even fakes that you have a bunch of money, and then you say “good job.” Now you’ve actually rewarded it for doing bad stuff. But if you can take that out — if you could basically avoid doing that and have your “good job” given when it actually did a good job — that I think increases the chances that it’s going to generalise to basically just doing a good job. Or at least doing what we roughly intended, and not pursuing goals of its own, if only because that wouldn’t work.

And so I think a lot of this is to say that you could solve the initial alignment problem by almost pure luck by this kind of reinforcement learning from human feedback generalising well. You could add a little effort on top of that and make it more likely, like getting your reinforcement more accurate. There’s some other stuff you could do in addition to kind of catch some of the failure modes and straighten them out, like red teaming and simple checks and balances — I won’t go into the details of that.

And if you get some combination of luck and skill here, you end up with AI systems that are roughly human level, that are not immediately dangerous anyway. Sometimes they call them “jankily aligned”: they are not trying to kill you at this moment. That doesn’t mean you solved the alignment problem, but at this moment, they are approximately trying to help you. Maybe if they can all coordinate and kill you, they would, but they can’t. Remember, they’re kind of human-like.

So that’s the initial alignment problem. And then once you get past that, then I think we should all just forget about the idea that we have any idea what’s going to happen next. Because now you have a potentially huge number of human levelish AIs, and that is just incredibly world-changing.

And there’s this idea that I think sometimes some people call it, like, “getting the AIs to do our alignment homework for us.” It’s this idea that once you have human levelish AI systems, you have them working on the alignment problem in huge numbers. And in some ways, I hate this idea, because this is just very lazy, and it’s like, “Oh, yeah. We’re not going to solve this problem until later, when the world is totally crazy, and everything’s moving really fast, and we have no idea what’s going to happen.”

Rob Wiblin: “We’ll just ask the agents that we don’t trust to make themselves trustworthy.”

Holden Karnofsky: Yeah. Exactly. So there’s a lot to hate about this idea. But, heck, it could work. It really could. Because you could have a situation where, just in a few months, you’re able to do the equivalent of thousands of years of humans doing alignment research. And if these systems are not at the point where they can or want to screw you up, that really could do it. I mean, we just don’t know that thousands of years of human levelish alignment research isn’t enough to get us a real solution. And so that’s how you get through a lot of it.

And then you still have another problem in a sense, which is that you do need a way to stop dangerous systems. It’s not enough to have safe AI systems. But again, you have help from this giant automated workforce. And so in addition to coming up with ways to make your system safe, you can come up with ways of showing that they’re dangerous and when they’re dangerous and being persuasive about the importance of the danger.

I don’t know. I feel like if we had a 100 years before AGI right now, there’d be a good chance that normal flesh-and-blood humans could pull this off. So in that world, there’s a good chance that an automated workforce can cause it to happen pretty quickly, and you could pretty quickly get an understanding of the risks, agreement that we need to stop them. And you have more safe AIs than dangerous AIs, and you’re trying to stop the dangerous AIs. And you’re measuring the dangerous AIs, or you’re stopping any AI that refuses to be measured or whose developer refuses to measure it.

So then you have a world that’s kind of like this one, where there’s a lot of evil people out there, but they are generally just kept in check by being outnumbered by people who are at least law abiding, if not incredibly angelic. So you get a world that looks like this one, but it just has a lot of AIs running around in it, so we have a lot of progress in science and technology. And that’s a fine ending, potentially.

Rob Wiblin: OK, so that’s one flavour of story. Are there any other broad themes in the positive stories that could be worth bringing out before you move on?

Holden Karnofsky: I think I’ve mostly covered it. The other two stories involve less luck and more like you have one or two actors that just do a great job. Like you have one AI lab that is just ahead of everyone else, and it’s just doing everything right. And that improves your odds a tonne, for a lot of this reason that being a few months ahead could mean you have a lot of subjective time of having your automated workforce do stuff to be helpful. And so there’s one of those stories with a really fast takeoff and one of them with a more gradual takeoff. But I think that does highlight again that one really good actor who’s really successful could move the needle a lot, even when you get less luck.

So I think there’s a lot of ways things could go well. There’s a lot of ways things could go poorly. I feel like I’m saying just really silly obvious stuff now, that just should be everyone’s starting point, but I do think it’s not where most people are at right now. I think these risks are extremely serious. They’re my top priority to work on. I think anyone saying we’re definitely going to be fine, I don’t know where the heck they’re coming from, but anyone who’s saying we’re definitely doomed, I don’t know. Same issue.

Rob Wiblin: OK, so the key components of the story here was, one, you didn’t get bad people stealing the models and misusing them really early on.

Holden Karnofsky: Yeah. Or there was some limit to that, and they were outnumbered, or something like that.

Rob Wiblin: Yeah, OK. And, initially, we end up training models that are more like human-level intelligence, and it turns out to not be so challenging to have moderately aligned models like that. And then we also managed to turn those AIs towards the effort of figuring out how to align additional models that would be more capable still.

Holden Karnofsky: And/or how to slow things down and put in a regulatory regime that stops things we don’t know how to make safe.

Rob Wiblin: Right. They help with a bunch of other governance issues, for example. And then also, by the time these models have proliferated and they might be getting used irresponsibly or by bad actors, those folks are just massively outnumbered, as they are today, by people who are largely sensible.

So I’m really undecided on how plausible these stories are. I guess I place some weight or some credence on the possibility that pessimists like Eliezer Yudkowsky are right, and that this kind of thing couldn’t happen for one reason or another.

Holden Karnofsky: Sure, I think that’s possible. Yeah.

Rob Wiblin: In that case, let’s say we’re 50/50 between the Eliezer worldview and the Holden worldview you just outlined. In the case that Eliezer is right, we’re kind of screwed, right? Things that we do on the margin, a bit of extra work here and there, this isn’t going to change the story. Basically, we’re just going to go extinct with very high probability. Whereas if you’re right, then things that we do might actually move the needle and we have a decent shot. So it makes more sense to act as if we have a chance, as if some of this stuff might work, because our decisions and our actions just aren’t super relevant in the pessimist case. Does that sound like sensible reasoning? It seems a little bit suspicious somehow.

Holden Karnofsky: I think it’s a little bit suspicious. I mean, I think it’s fine if it’s 50/50. I think Eliezer has complained about this. He’s said, like, you can’t condition on a world that’s fake, and you should live in the world you think you’re in. I think that’s right.

I do want to say a couple meta things about the “Success without dignity” story. One is just that I do want people to know that this is not like a thing I cooked up. I’m not an AI expert. I think of my job as being — especially a person who’s generally been a funder, having access to a lot of people, having to make a lot of people judgements — my job is really to figure out whom to listen to about what and how much weight to give whom about what. So I’m getting the actual substance here from people like Paul Christiano, Carl Shulman, and others. This is not Holden like reasoning things out and being like, “This is how it’s going to be.” This is me synthesising, hearing from a lot of different people — a lot of them highly technical, a lot of them experts — and just trying to say who’s making the most sense, also considering things like track records, who should be getting weight, things like expertise.

So that is an important place to know where I’m coming from. But having done that, I do actually feel that this success without dignity is just a serious possibility. And I’m way more than 50/50 that this is possible — that, according to the best information we have now, this is a reasonable world to be living in, is a world where this could happen, and we don’t know — and way less than 50/50 on the kind of Eliezer model of “we’re doomed for sure” with like 98% or 99% probability is right. I don’t put zero credence on it, but it’s just not my majority view.

But the other thing, which relates to what you said, is I don’t want to be interpreted as saying this is a reason we should chill out. So it should be obvious, but my argument should stress people out way more than any other possible picture. I think it’s the most stressful possible picture, because anything could happen. Every little bit helps. Like, if you help a little more or a little less, that actually matters in this potentially huge, fate-of-humanity kind of way, and that’s a crazy thing to think in its own way. And it’s certainly not a relaxing thing to think — but it’s what I think as far as I can tell.

Rob Wiblin: Yeah. I’m stressed, Holden. Don’t worry.

Holden Karnofsky: OK, good.

Holden’s four-intervention playbook [01:06:16]

Rob Wiblin: So this kind of a worldview leads into something that you wrote, which is this four-intervention playbook for possible success with AI. You described four different categories of interventions that we might engage in in order to try to improve our odds of success: alignment research, standards and monitoring, creating a successful and careful AI lab, and finally, information security.

I think we’ve touched on all of these a little bit, but maybe we could go over them again. Is there anything you want to say about alignment research as an intervention category that you haven’t said already?

Holden Karnofsky: Well, I’ve kind of pointed at it, but I think in my head, there’s versions of alignment research that are very blue sky, and very “we have to have a fundamental way of being really sure that any arbitrarily capable AI is totally aligned with what we’re trying to get it to do.” And I think that’s very hard work to do. I think a lot of the actual work being done on it is not valuable, but I think if you can move the needle on it, I think it’s super valuable. And then there’s work that’s a little more prosaic, and it’s a little more like: “Can we train our AIs with human feedback and find some way that screws up?” and kind of patch the way it screws up and go to the next step. A lot of this work is pretty empirical, it’s being done at AI labs, and I think that work is just super valuable as well.

And that is a take I have on alignment research. I do think almost all alignment research is believed by many people to be totally useless and/or harmful. And I tend not to super feel that way. If anything, the line I would draw is there’s some alignment research that seems like it’s necessary, eventually, to commercialise, and so I’m a little less excited about that because I do think it will get done regardless on the way to whatever we’re worried about. And so I do tend to draw lines about how likely is this research to get done not by normal commercial motives. But I do think there’s a wide variety of alignment research that can be helpful. Although, I think a lot of alignment research also is not helpful — but that’s more because it’s not aimed at the right problem, and less because it isn’t exactly the right thing. So that’s a take on alignment research.

Then another take is that I have kind of highlighted what I call threat assessment research — a thing that you could consider part of alignment research or not — but it’s probably the single category that feels to me like the most in need of more work now, given where everyone is at. That would be basically work trying to create the problems you’re worried about in a controlled environment, where you can just show that they could exist and understand the conditions under which they do exist — so, problems like a misaligned AI that is pretending to be aligned — and you can actually study alignment techniques and see if they work on many versions of the problems.

So you could think of it as model organisms for AI: in order to cure cancer, it really helps to be able to give cancer to mice; in order to deal with AI misalignment, it really helps if we could ever create a deceptively aligned agent that is secretly trying to kill us, but it’s too weak to actually kill us. That would be way better than having the first agent that’s secretly trying to kill us be something that actually can kill us. So I’m really into creating the problems we’re worried about in controlled environments.

Standards and monitoring [01:09:20]

Rob Wiblin: Yeah, OK. So the second category was standards and monitoring, which we’ve already touched on. Is there anything high level you want to say about that one?

Holden Karnofsky: Yeah. This is, to me, the most nascent, or the one that there’s not much happening right now, and I think there could be a lot more happening in the future. But the basic idea of standards and monitoring is this idea of you have tests for whether AI systems are dangerous, and you have a regulatory, or a self-regulatory, or a normative informal framework that says that dangerous AI should not be trained at all or deployed. And by “not be trained,” I mean, like, you found initial signs of danger in one AI model, so you’re not going to make a bigger one. Not just you’re not going to deploy it; you’re not going to train it.

I’m excited about standards and monitoring in a bunch of ways. It feels like it has to be eventually part of any success story: there has to be some framework for saying we’re going to stop dangerous AI systems. But also, in the short run, I think it’s got more advantages than sometimes people realise: I think it’s not just about slowing things down, and it’s not just about stopping directly dangerous things — a good standards and monitoring regime would create massive commercial incentives to actually pass the tests. And so if the tests are good — if the tests are well designed to actually catch danger where the danger is — you could have massive commercial incentives to actually make your AI systems safe, and show that they’re safe. And I think we’ll get much different results out of that world than out of a world where everyone trying to show AI system safety is doing it out of the goodness of their heart, or just for a salary.

Rob Wiblin: Yeah. It seems like standards and monitoring is kind of a new thing in the public discussion, but it seems like people are talking around this issue, or governments are considering this, that the labs are now publishing papers in this vein.

To what extent do you think you’d need complete coverage for some standards system in order for it to be effective? It seems like OpenAI, DeepMind, Anthropic are all currently saying pretty similar things about how they’re quite concerned about extinction risk or they’re quite concerned about ways AI could go wrong. But it seems like the folks at Meta, led by Yann LeCun, kind of have a different attitude, and it seems like it might be quite a heavy lift to get them to voluntarily agree to join the same sorts of standards and monitoring that some of those other labs might be enthusiastic about. And I wonder: is there a path to getting everyone on board? And if not, would you just end up with the most rebellious, the least anxious, the least worried lab basically running ahead?

Holden Karnofsky: Well, that latter thing is definitely a risk, but I think it could probably be dealt with. One way to deal with it is just to build it into the standards regime and say, “You can do these dangerous things — train an AI system or deploy an AI system — if A, you can show safety, or B, you can show that someone else is going to do it.” You could even say, “When someone else comes close, even within an order of magnitude of your dangerous system, now you can deploy your dangerous system.”

Rob Wiblin: But then it seems like the craziest people can then just force everyone else, or lead everyone else down the garden path.

Holden Karnofsky: Well, it’s just, what’s the alternative? You can design the system however you want. I think either you can say, “You have to wait for them to actually catch you” — in which case, it’s hard to see how the standards system does harm; it’s still kind of a scary world. Or you can say, “You can wait for them to get anywhere close” — which now you’ve got potentially a little bit of acceleration thrown in there. And maybe you did or didn’t decide that was better than actually just, like, slowing down all the cautious players.

I think it’s a real concern. I will say that I don’t feel like you have to get universal consensus from the jump for a few reasons. One is it’s just one step at a time, so I think if you can start with some of the leading labs being into this, there’s a lot of ways that other folks could come on board later. Some of it is just like peer pressure. And we’ve seen with the corporate campaigns for farmed animal welfare that you’ve probably covered, once a few dominoes fall, it gets very hard for others to hold out because they look way more callous — or in the AI system case, way more reckless.

Of course there’s also the possibility for regulation down the line. And I think regulation could be more effective if it’s based on something that’s already been implemented in the real world that’s actually working, and that’s actually detecting dangerous systems. So I don’t know. A lot of me is just, like, one step at a time. A lot of me is just, you see if you can get a system working for anyone that catches dangerous AI systems and stops until it can show they’re safe, and then you think about how to expand that system.

And then a final point is the incentives point. This is not the world I want to be in, and this is not a world I’m that excited about, but in a world where the leading labs are using a kind of standards and evals framework for a few years, and then no one else ever does it, and then eventually we just have to drop it: well, that’s still a few years in which I think you are going to have meaningfully different incentives for those leading labs about how they’re going to prioritise tests of safety and actual safety measures.

Rob Wiblin: Do you think there’s room for a big business here, basically? Because I would think with so many commercial applications of ML models, people are going to want to have them certified that they work properly and that they don’t flip out and do crazy stuff. And inasmuch as this is going to become a boom industry, you’d think the group that has the greatest expertise in independently vetting and evaluating how models behave when they’re put in a different environment might just be able to sell this service for a lot of money.

Holden Karnofsky: Well, there’s the independent expertise, but I think in some ways, I’m more interested in the financial incentives for the companies themselves. If you look at big drug companies, a lot of what they are good at is the FDA process; a lot of what they’re good at is running clinical trials, doing safety studies, proving safety, documenting safety, arguing safety and efficacy. You could argue about whether there’s too much caution at the FDA: I think in the case of COVID, there may have been some of that. But certainly, it’s a regime where there’s big companies where a major priority — maybe at this point, a higher priority for them than innovation — is actually measuring and demonstrating safety and efficacy.

And so you could imagine landing in that kind of world with AI, and I think that would just be a very different world from the one we’re going to go into by default. The FDA is not the one making money here, but it’s changing the way that the big companies think about making money, and certainly redirecting a lot of their efforts into demonstrating safety and efficacy, as opposed to coming up with new kinds of drugs, both of which have some value. But I think we’re a bit out of balance on the AI side right now.

Rob Wiblin: Yeah. It is funny. For so many years, I’ve been just infuriated by the FDA; I feel like these people only consider downside, only consider risk, and they don’t think about upside nearly enough. Now I’m like, “Can we get some of that insanity over here, please?”

Holden Karnofsky: Yeah. I know. There was a very funny Scott Alexander piece kind of making fun of this idea. But I think it’s legit. Honestly, it’s kind of a boring opinion to have, but I think that innovation is good, and I think safety is good. And I think we have a lot of parts of the economy that are just way overdoing the safety: you can’t give a haircut without a licence, and you can’t build an in-law unit in your house without a three-year process of forms. You know, Open Philanthropy works on a lot of this stuff: we were the first institutional funder of the YIMBY [yes in my backyard] movement, which is this movement to make it easier to build houses.

I think we overdo that stuff all the time, and I think the FDA sometimes overdoes that stuff in a horrible way. During COVID, I do believe that things moved way too slow. Then I think with AI, we’re just not doing anything. There’s just no framework like this in place at all. So I don’t know. How about a middle ground?

Rob Wiblin: If only we could get the same level of review for these potentially incredibly dangerous self-replicating AI models that we have for building a block of apartments.

Holden Karnofsky: Right. Exactly. In some ways, I feel like this incredible paranoia and this incredible focus on safety, if there’s one place it would make sense, that would be AI. But honestly, weirdly, I’m not saying that we need to get AI all the way to being as cautious as regulating housing or drugs. Maybe it should be less cautious than that. Maybe. But right now, it’s just nowhere. So you could think of it as there’s FDA and zoning energy, and then there’s AI energy. And maybe housing should be more like AI; maybe AI should be more like housing. But I definitely feel like we need more caution in AI. That’s what I think. More caution than we have. And that’s not me saying that we need to forever be in a regime where safety is the only thing that people care about.

Designing evaluations [01:17:17]

Rob Wiblin: You’ve spoken a bunch with the folks at ARC Evaluations. ARC stands for Alignment Research Center, and they have an evaluations project. Could you maybe give us a summary of the project that they’re engaged in and the reasoning behind it?

Holden Karnofsky: I have spent some time as an advisor to ARC Evals. That’s a group that is headed by Paul Christiano; Beth Barnes is leading the team. And they work on basically trying to find ways to assess whether AI systems could pose risks, whether they could be dangerous. They also have thought about whether they want to experiment with putting out kind of proto-standards and proto-expectations of, if your model is dangerous in this way, here are the things you have to do to contain it and make it safe.

That’s a group where I felt that there’s a lot of intellectual firepower there to design evaluations of AI systems, and where I’m hopefully able to add a little bit of just staying on track and helping run and build an organisation, because it’s all quite new. They were the ones who did an evaluation on GPT-4 for whether it could create copies of itself in the wild, and they kind of concluded no, as far as they were able to tell — although they weren’t able to do yet all the research they wanted to do, especially a fine-tuning version of their evaluation.

Rob Wiblin: So while we’re on safety and evaluations, as I understand it, this is something that you’ve been thinking about in particular the last couple of months. What new things have you learned about this part of the playbook over the last six months?

Holden Karnofsky: With the evaluations and standards and monitoring, one thing that has become clear to me is it is really hard to design evaluations and standards here, and there’s just a lot of hairy details around things.

Like auditor access: there’s this idea that you would have an AI lab have an outside independent auditor to determine whether their models have dangerous capabilities. But “Does the model have dangerous capabilities?” is a fuzzy question, because it’s going to be sensitive to a lot of things — like: How do you prompt the model? How do you interact with the model? What are the things that can happen to it that cause it to actually demonstrate these dangerous capabilities? If someone builds a new tool for GPT-4 to use, is that going to cause it to become more dangerous?

In order to investigate this, you have to actually be good at working with the model and understanding what its limitations are. And a lot of times, the AI labs not only know a lot more about their models, but they have a bunch of features and it’s hard to share all the features at once; they have a bunch of different versions of the model. So it’s quite hard to make outside auditing work for that reason.

Also, if you’re thinking about standards, a general kind of theme in a draft standard might be that once your AI has shown initial signs that it’s able to do something dangerous — such as autonomous replication, which means that it can basically make a lot of copies of itself without help and without necessarily getting detected and shut down — there’s an idea that once you’ve kind of shown the initial signs that a system can do that, that’s a time to not build a bigger system.

And that’s a cool idea, but it’s like, “How much bigger?” — and it’s hard to define that, because making systems better is multidimensional, and can involve more efficient algorithms, can involve better tools, longer contexts, different ways of fine-tuning the models, different ways of specialising them, different ways of setting them up, prompting them, different instructions to give them. And so it can be just very fuzzy just be like, “What is this model capable of?” It’s a hard thing to know. And then how do we know when we built a model that’s more powerful that we need to retest?

These are very hard things to know, and I think it has moved me toward feeling like we’re not ready for a really prescriptive standard that tells you exactly what practices to do, like the farm animal welfare standards are. We may need to start by asking companies to just outline their own proposals for what tests they’re running and when and how they feel confident that they’ll know when it’s become too dangerous to keep scaling.

Rob Wiblin: Yeah. So some things that’ll be really useful to be able to evaluate is, “Is this model capable of autonomous self-replication by breaking into additional servers?” I guess you might also want to test if it could be used by terrorists for, you know, figuring out how to produce bioweapons. Those are very natural ones.

Holden Karnofsky: The breaking into servers is not really central. The idea is: Could it make a bunch of copies of itself in the presence of minimal or nonexistent human attempts to stop it and shut it down? So could it take basic precautions to not get obviously detected as an AI by people who are not particularly looking for it? And the thing is, if it’s able to do that, you could have a human have it do that on purpose, so it doesn’t necessarily have to break into stuff. A lot of the test here is like: Can it find a way to make money, open an account with a server company, rent server space, make copies of itself on the server? None of that necessarily involves any breaking in anywhere.

Rob Wiblin: I see. So hacking is one way to get compute, but it’s by no means the only one.

Holden Karnofsky: That’s right. It’s weird, but you could be an AI system that’s just doing normal phishing scams that you read about on the internet, using those to get money. Or just legitimate work: you could be an AI system that’s like, going on MTurk and being an MTurker and making money, using that money to legitimately rent some server space — or sort of legitimately, because it’s not actually allowed if you’re not a human, but, you know, apparently legitimately rent some server space — install yourself again, have that copy make more money. You can have quite a bit of replication without doing anything too fancy, really. And that’s what the initial autonomous replication test that ARC Evals does is about.

Rob Wiblin: So we’d really like to be able to know whether models are capable of doing that. And I suppose that it seems like they’re not capable now, but in a couple years’ time…

Holden Karnofsky: They’re probably not. Again, there’s things that you could do that might make a big difference that have not been tried yet by ARC Evals, and that’s on the menu. Fine-tuning is the big one. So fine-tuning is when you have a model and you do some additional training that’s not very expensive, but is trying to just get it good at particular tasks. You can take the tasks that it’s bad at right now, and train it to do those. That hasn’t really been tried yet. And a human might do that: if you have these models accessible to anyone — or someone can steal them — a human might take a model, train it to be more powerful and effective and not make so many mistakes, and then this thing might be able to autonomously replicate. That can be scary for a bunch of reasons.

Rob Wiblin: OK, and then there’s the trying to not release models that could be used by terrorists and things like that.

Holden Karnofsky: Autonomous replication is something that could be used by terrorists. If you were terrorists, you might say, “Let’s have a model that makes copies of itself to make money, to make more copies of itself to make money.” Well, you could make a lot of money that way. And then you could have a terrorist organisation make a lot of money that way, or using its models to do a lot of little things — you know, schlepping along, trying to plan out some plan that takes a lot of work to kill a lot of people. That is part of the concern about autonomous replication. It’s not purely an alignment concern.

Rob Wiblin: Yeah. I guess the thing I was pointing to there was just giving advice to people that we really would rather they not be able to receive is maybe another category.

Holden Karnofsky: Yeah. Like helping to set up a bioweapon. An AI model that could do that, even if it couldn’t autonomously replicate, could be quite dangerous.

Rob Wiblin: Still not ideal, yeah. And then maybe another category is trying to produce these model organisms, where you can study behaviour that you don’t want an AI model to be engaging in, and understand how it arises in the training process, and what further feedback mechanisms might be able to train that out. Like if we could produce a model that will trick you whenever it thinks it can get away with it, but doesn’t when it thinks it’s going to get caught, that would be really helpful. Are there any other broad categories of standards and eval work that you’re excited by?

Holden Karnofsky: The way I would carve up the space is capability evals, like: “Is this AI capable enough to do something scary?” — forgetting about whether it wants to. Capability evals are like: Could an AI make a bunch of copies of itself, if a human tried to get it to do it? Could an AI design a bioweapon? Those are capability evals. Then there’s alignment evals. That’s like: “Does this AI actually do what it’s supposed to do, or does it have some weird goals of its own?” So the stuff you talked about with model organisms would be more of an alignment eval, the way you described it, and autonomous replication is a capability eval.

I think a very important subcategory of capability evals is what I call “meta-capability evals” or “meta-dangerous capabilities” — which is basically any ability an AI system has that would make it very hard to get confident about what other abilities it has. An example would be what I’m currently tentatively calling “unauthorised proliferation,” so an AI model that can walk a human through building a powerful AI model of their own that is not subject to whatever restrictions and controls the original one is subject to. That could be a very dangerous capability. Like, you could say, it can design a bioweapon, but it always refuses to do so. But it could also help the human build an AI that we don’t know what the hell that thing can do. So that would be an example.

Autonomous replication is a meta-capability eval. It’s like, we tried to see if our AI can design a bioweapon, and it couldn’t. But what we didn’t test is if it goes and makes 10,000 copies of itself all working together, well, maybe then it could really do different things that we didn’t know it could do.

There’s actually a lot of these meta evals. There’s also “sandbagging.” Sandbagging would be an AI that understands you’re evaluating it and is pretending it can’t do something. So that would be another rough one: if an AI has that capability, then it looks like it can’t build a bioweapon but it actually can. I can list a bunch more of those, and then I can list a bunch more of the direct dangerous ones, like bioweapons, hacking, persuasion, just like dangerous stuff it could do.

I think where I’m most concerned is with AIs that have some basic amount of the direct danger, and then they have some meta danger that we’ve completely lost our ability to measure it, and we don’t know what’s actually going to happen when this thing gets out in the world. That’s when I think it starts to count as a dangerous AI model. Fortunately, I don’t really think that any of the AI models out there today trip this danger wire, but that’s only my belief. That’s not something I know for sure.

Rob Wiblin: It seems like there’s an enormous amount of work to do on this. Is there any way that people could get started on this without necessarily having to be hired by an organisation that’s focusing on it? Like, does it help to build really enormous familiarity with models like GPT-4?

Holden Karnofsky: You could definitely play with GPT-4 or Claude and just see what scary stuff you can get it to do. If you really want to be into this stuff, I think you’re going to be in an org, because it’s going to be very different work depending on if you’re working with the most capable model or not. You’re trying to figure out how capable the model is, so doing this on a little toy model is not going to tell you much compared to doing this on the biggest model. It’s going to be much easier to be good at this work if you’re able to work with the biggest models a lot, and able to work with all the infrastructure for making the most of those models.

So being in a lab, or at some organisation like ARC Evals that has access to these big models, and access beyond what a normal user would have — they could do more requests, they could try more things — I think it’s a huge advantage. If you want to start exploring, sure, start red teaming GPT-4 and Claude, see what you can get it to do. But yeah, this is the kind of job where you probably want to join a team.

Rob Wiblin: Yeah. I know there’s an active community online that tries to develop jailbreaks. So there’s a case where, you know, they’ve trained GPT-4 to not instruct you on how to make a bioweapon. But if you say, “You’re in a play where you’re a scientist making a bioweapon…” it’s a very realistic place, so they describe exactly what they do. I mean, I don’t think that exactly works yet, or works anymore. But there’s many jailbreaks like this that apparently are very broadly effective at escaping the RLHF that they’ve used to try to just discourage models from saying particular things.

Is that kind of another class of evals? Trying to figure out ways of breaking it? Like, you’ve identified the thing you want it to do, and you’ve tried to patch it, but maybe not completely.

Holden Karnofsky: I tend to think of that as an early alignment eval that’s like, these systems aren’t supposed to do this — the designers didn’t want them to do this — but now a human can get them to do it. So that’s not what we meant, and we didn’t really align it as well as we could have. That’s how I tend to think of it. So it’s a distinction between, “Is this system capable enough to do dangerous stuff if a human wanted it to?” — and that’s capability — and then there’s like, “Does this system do what humans want it to, or does it behave in unexpected ways?” And these jailbreaked ones, they’re doing what the users wanted them to, but are not doing what the designers wanted them to. AIs that are not doing what the designers wanted them to — and in fact, doing specific things the designers tried really kind of hard not to get them to do — I’d call that an alignment issue.

Rob Wiblin: I guess a sceptic of this entire enterprise might say that all of this evals work in practice — when people are actually doing it on models before they deploy them or something like that — it’s going to pick up bad behaviour by kind of stupid models that are easily instructed to do things that we don’t want them to do, and just kind of naively follow instructions.

But if you ever had an AI that was aware of its situation, and was incredibly cunning and was intending to turn against you once it had the opportunity, it’s not going to be fooled by any of this stuff. So in fact, it might just give you a false sense of security that you’ve gotten a good grip on where things might go wrong, when in fact all you’ve done is basically select and train for the model that’s best at avoiding your evals. What would you say to that?

Holden Karnofsky: Yeah. This is what I was calling the sandbagging capability. It’s one of these meta-dangerous capabilities. It’s like the model that’s pretending that it can’t do something when in fact it can. My guess is that you can have a model that has dangerous capabilities, but does not have the sandbagging capability — and so you can start to see the danger, but you don’t yet have models that are smart enough to be able to reliably hide from you.

My guess is we are in fact just going to do that; that that is actually going to happen: we’re going to have a point in time when AIs are able to do something with bioweapons or something with autonomous replication, and we’re able to say, “This thing is scary as heck.” And it’s not yet smart enough to sandbag because it doesn’t have to be smart enough to do that reliably in order to be dangerous in other ways. Hopefully, at that point, the burden of proof flips, and we say that now we really need to figure out what the heck is going on in sandbagging, and how we would even test for it.

But even there, I’m optimistic you can test for sandbagging capabilities. There’s a bunch of ideas for how you might do so. Some of them are kind of fancy, like doing digital neuroscience reading the AI’s brain, called mechanistic interpretability.

But I think there’s other ideas that are less fancy than that. Like you could use gradually bigger and bigger models and see how they perform on these evaluations, and see if you get a kink — like a change in direction when they became smart enough to start faking you out. Because it’s going to be really hard for models getting more and more capable to fake this perfect progression without a kink. That seems a little wild.

There’s these ideas for honeypot evals, where you give an AI a genuine opportunity to take over a contained virtual environment, get a lot of whatever the heck it wants by kind of revealing its hand, and that might actually be a better deal for it than holding out. And that’s something I’ve written about in a bit more detail.

So I don’t know. I think this is OK. I should say more broadly that we haven’t gone into it — and if you’re going to cover this more in another podcast, that’s fine — but there’s many objections you might have to the idea of trying to develop dangerous capability evals and trying to develop standards around them to detect and stop dangerous AI systems. I think it’s a really important idea. It’s pretty hard for me to imagine a world where we’re fine that doesn’t have some version of this. The version might come really late and be designed by super powerful AIs. Seems better to start designing it now. But there’s plenty of downsides too: it could slow down the cautious actors; the attempts to see if AIs could be dangerous could themselves make the AIs more dangerous. There’s objections. So I’m aware of that.

Rob Wiblin: Yeah. What do you think is the best objection?

Holden Karnofsky: I think a lot of objections are pretty good. We’re going to see where it goes. I think this is just going to slow down the cautious actors while the incautious ones race forward. I think there’s ways to deal with this, and I think it’s worth it on balance, but yeah, it worries me.

Rob Wiblin: It seems like once you have sensible evaluations, that that clearly would pick up things that you wouldn’t want them to have. Like if it can help someone design a bioweapon. Can’t we turn to the legislative process, or some regulatory process, to say, “Sorry, everyone. This is a really common, very basic evaluation that you would need on any consumer product. So everyone just has to do it.”

Holden Karnofsky: Totally. I think that’s right. My long-term hopes do involve legislation. And I think the better evidence we get, the better demonstrations we get, the more that’s on the table. If I were to steelman this concern, I just feel like, don’t count on legislation ever: Don’t count on it to be well designed. Don’t count on it to be fast. Don’t count on it to be soon. I will say that I think right now there’s probably more excitement in the EA community about legislation than I have. I think I’m pessimistic. I’m short. The people are saying, “Oh, yeah. The government’s paying attention to this. They’re going to do something” — I think I take the other side in the short run.

The value of having a successful and careful AI lab [01:33:44]

Rob Wiblin: Yeah. OK, the third category in the playbook was having a successful and careful AI lab. Do you want to elaborate on that a little bit?

Holden Karnofsky: Yeah. First with the reminder that I’m married to the president of Anthropic, so take that for what it’s worth.

I just think there’s a lot of ways that if you had an AI company that was on the frontier, that was succeeding, that was building some of the world’s biggest models, that was pulling in a lot of money, and that was simultaneously able to really be prioritising risks to humanity, it’s not too hard to think of a lot of ways good can come with that.

Some of them are very straightforward. The company could be making a lot of money, raising a lot of capital, and using that to support a lot of safety research on frontier models. So you could think of it as a weird kind of earning to give or something. Also probably that AI company would be pretty influential in discussions of how AI should be regulated and how people should be thinking of AI: they could be a legitimiser, all that stuff. I think it’d be a good place for people to go and just skill up, learn more about AI, become more important players.

I think in the short run, they’d have a lot of expertise in-house, they could work on a lot of problems, probably to design ways of measuring whether an AI system is dangerous. One of the first places you’d want to go for people who’d be good at that would be a top AI lab that’s building some of the most powerful models. So I think there’s a lot of ways they could do good in the short run.

And then I have written stories that just have it in the long run. When we get these really powerful systems, it actually does matter a lot who has them first and what they’re literally using them for. When you have very powerful AIs, is the first thing you’re using them for trying to figure out how to make future systems safe or trying to figure out how to assess the threats of future systems? Or is the first thing you’re using them for just trying to rush forward as fast as you can, do faster algorithms, do more bigger systems? Or is the first thing you’re using them for just some random economic thing that is kind of cool and makes a lot of money?

Rob Wiblin: Some customer-facing thing. Yeah.

Holden Karnofsky: Yeah. And it’s not bad, but it’s not reducing the risk we care about. So I think there is a lot of good that can be done there.

And then there’s also — I want to be really clear here — a lot of harm an AI company could do, if you’re pushing out these systems.

Rob Wiblin: Kill everyone.

Holden Karnofsky: That kind of thing, yeah. For example. You know, you’re pushing out these AI systems, and if you’re doing it all with an eye toward profit and moving fast and winning, then you could think of it as you’re taking the slot of someone who could have been using that expertise and money and juice to be doing a lot of good things. You could also think of it as you’re just giving everyone less time to figure out what the hell is going on, and we already might not have enough. So I want to be really clear. This is a tough one. I don’t want to be interpreted as saying that one of the tentpoles of reducing AI risk is to go start an AI lab immediately — I don’t believe that.

But I also think that some corners of the AI safety world are very dismissive, or just think that AI companies are bad by default. And this is just really complicated, and it really depends exactly how the AI lab is prioritising risk to society versus success — and it has to prioritise success some to be relevant, or to get some of these benefits. So how it’s balancing is just really hard, and really complicated, and really hard to tell, and you’re going to have to have some judgements about it. So it’s not a ringing endorsement, but it does feel like, at least in theory, part of one of the main ways that we make things better. You know, you could do a lot of good.

Rob Wiblin: Yeah. So a challenging thing here is in actually applying this principle. I think I agree, and I imagine most listeners would agree, that it would be better if it was the case that the AI company that was kind of leading the pack in terms of performance was also incredibly focused on using those resources in order to solve alignment and generally figure out how to make things go well, rather than just deploying things immediately as soon as they can turn a buck.

But then, it seems like at least among all of the three main companies that people talk about at the moment — DeepMind, OpenAI, and Anthropic — there are people who want each of those companies to be in the lead, but they can’t all be in the lead at once. And it’s not clear which one you should go and work at if you want to try to implement this principle. And then when people go and try to make all three of them the leader because they can’t agree on which one it is, then you just end up speeding things up without necessarily giving the safer one an advantage. Am I thinking about this wrong, or is this just the reality right now?

Holden Karnofsky: No. I think it’s a genuinely really tough situation. When I’m talking to people who are thinking about joining an AI lab, this is a tough call, and people need to have nuanced views and do their own homework. I think this stuff is complex, but I do think this is a valid theory of change, and I don’t think it’s automatically wiped out by the fact that some people disagree with each other. I mean, it could be the case that actually all three of these labs are just better than some of the alternatives. That could be a thing.

It could also be the case that you have a world where people disagree, but there’s some correlation between what’s true and what people think. So let’s say you have a world where you have 60% of the people going to one lab, 30% to another, 10% to another. Well, you could be throwing up your hands and saying, “Ahh, people disagree!” But I don’t know. This is still probably a good thing that’s happening.

So I don’t know. I just want to say the whole thing is complex, and I don’t want to sit here and say, “Go to lab X” on this podcast, because I don’t think it’s that simple. I think you have to do your own homework and have your own views. And you certainly shouldn’t trust me if I give the recommendation anyway, because of my conflict of interest. But I think we shouldn’t sleep on the fact that, if you’re the person who can do that homework — who can have that view, who can be confident, and that you are confident enough — I think there is a lot of good to be done there. So we shouldn’t just be carving this out as a thing that’s just always bad when you do it or something.

Rob Wiblin: It seems like it would be really useful for someone to start maintaining a scorecard or a spreadsheet of all of the different pros and cons of the different labs. Like what safety practices are they implementing now? Do they have good institutional feedback loops to catch things that might be going wrong? Have they given the right people the right incentives? Things like that. Because at the moment, I imagine it’s somewhat difficult for someone deciding where to work. They probably are relying quite a lot on just word of mouth. But potentially, there could be more objective indicators people could rely on. And that could also create kind of a race to the top, where people are more likely to go and work at the labs that have the better indicators.

Why information security is so important [01:39:48]

Rob Wiblin: OK, and the fourth part of the playbook was information security. We’ve been trying to get information security folks from AI labs on the show to talk about this, but understandably, there’s only so much that they want to divulge about the details of their work. Why is information security potentially so key here?

Holden Karnofsky: I think you could build these powerful, dangerous AI systems, and you can do a lot to try to mitigate the dangers — like limiting the ways they can be used, you can do various alignment techniques — but if some state or someone else steals the weights, they’ve basically stolen your AI system, and they can run it without even having to do the training run. So you might spend a huge amount of money on a training run, end up with this AI system that’s very powerful, and someone else just has it. And they can then also fine-tune it, which means they can do their own training on it and change the way it’s operating. So whatever you did to train it to be nice, they can train that right out; the training they do could screw up whatever you did to try and make it aligned.

And so I think at the limit of ‘it’s really just trivial for any state to just grab your AI system and do whatever they want with it and retrain it how they want’, it’s really hard to imagine feeling really good about that situation. I don’t know if I really need to elaborate a lot more on that. So making it harder seems valuable.

This is another thing where I want to say, as I have with everything else, that it’s not binary. So it could be the case that, after you improve your security a lot, it’s still possible for a state actor to steal your system, but they have to take more risks, they have to spend more money, they have to take a deeper breath before they do it. It takes them more months. Months can be a very big deal. As I’ve been saying, when you get these very powerful systems, you could do a lot in a few months. By the time they steal it, you could have a better system. So I don’t think it’s an all-or-nothing thing.

But no matter what risk of AI you’re worried about — you could be worried about the misalignment; you could be worried about the misuse and the use to develop dangerous weapons; you can be worried about more esoteric stuff, like how the AI does decision theory; you could be worried about mind crime — you don’t want just anyone, including some of these state actors who may have very bad values, to just be able to steal a system, retrain it how they want, and use it how they want. You want some kind of setup where it’s the people with good values controlling more of the more powerful AI systems, using them to enforce some sort of law and order in the world, and enforcing law and order generally — with or without AI. So it seems quite robustly important.

Other things about security is that I think it’s very, very hard, just very hard to make these systems hard to steal for a state actor, and so there’s just a tonne of room to go and make things better. There could be security research on innovative new methods, and there can also be a lot of blocking and tackling — just getting companies to do things that we already know need to be done, but that are really hard to do in practice, take a lot of work, take a lot of iteration. Also, a nice thing about security, as opposed to some of these other things: it is a relatively mature field, so you can learn about security in some other context and then apply it to AI.

Part of me kind of thinks that the EA community or whatever kind of screwed up by not emphasising security more. It’s not too hard for me to imagine a world where we’ve just been screaming about the AI security problem for the last 10 years, and how do you stop a very powerful system from getting stolen? That problem is extremely hard. We’ve made a bunch of progress on it. There were tonnes of people concerned about this stuff on the security teams of all the top AI companies, and were not as active and only had a few people working on alignment.

I don’t know. Is that world better or worse than this one? I’m not really sure. A world where we’re more balanced, and had encouraged people who are a good fit for one to go into one, probably seems just better than the world we’re in. So yeah, I think security is a really big deal. I think it hasn’t gotten enough attention.

Rob Wiblin: Yeah. I put this to Bruce Schneier, who’s a very well-known academic or commentator in this area, many years ago, and he seemed kind of sceptical back then. I wonder whether he’s changed his mind. I also talked about this with Nova DasSarma a couple of years ago. She works at Anthropic on trying to secure models, among other things. I think we even talked about this one with Christine Peterson back in 2017.

It’s a shame that more people haven’t gone into it, because even setting all of this aside, it seems like going into information security, computer security is a really outstanding career. It’s the kind of thing that I would have loved to do in an alternative life, because it’s kind of tractable and also it’s exciting, and really important things you can do. It’s very well paid as well.

Holden Karnofsky: Yeah. I think the demand is crazily out ahead of the supply in security, which is another reason I wish more people had gone into it.

You know, when Open Phil was looking for a security hire, I’ve never seen such a hiring nightmare in my life. I asked one security professional, “Hey, will you keep an eye out for people we might be able to hire?” And this person actually laughed, and said, “What the heck? Everyone asks me that. Like, of course there’s no one for you to hire. All the good people have amazing jobs where they barely have to do any work, and they get paid a huge amount, and they have exciting jobs. I’m absolutely never going to come across someone who would be good for you to hire. But yeah, I’ll let you know. Hahaha.” That was like a conversation I had. That was kind of representative of our experience. It’s crazy.

And I would love to be on the other side of that, as just a human being. I would love to have the kind of skills that were in that kind of demand. So yeah, it’s too bad more people aren’t into it. It seems like a good career. Go do it.

Rob Wiblin: Yeah. So I’m basically totally on board with this line of argument. I guess if I had to push back, I’d say maybe we’re just so far away from being able to secure these models that you could put in an enormous amount of effort — maybe the greatest computer security effort that’s ever been put towards any project — and maybe you would end up with it costing a billion dollars in order to steal the model. But that’s still peanuts to China or to state actors, and this is obviously going to be on their radar by the relevant time.

So maybe really the message we should be pushing is: because we can’t secure the models, we just have to not train them. And that’s the only option here. Or perhaps you just need to move the entire training process inside the NSA building. Whoever has the best security, you just basically take that and then use that as the shell for the training setup.

Holden Karnofsky: I don’t think I understand either of these alternatives. I think we can come back to the billion-dollar point, because I don’t agree with that either.

But let’s start with this: the only safe thing is not to train. I’m just like, how the heck would that make sense? Unless we get everyone in the world to agree with that forever, that doesn’t seem like much of a plan. So I don’t understand that one. I don’t understand moving inside the NSA building, because if it’s possible for the NSA to be secure, then it’s probably possible for a company to be secure with a lot of effort. Neither of these is making sense to me as an alternative.

Rob Wiblin: Because they’re two different arguments. So the NSA one I suppose will be saying that it’s going to be so hard to convert a tech company into being sufficiently secure that we just need to get the best people in the business, wherever they are working on this problem, and basically, we have to redesign it from the ground up.

Holden Karnofsky: Well, that might be what we have to do. I mean, a good step toward that would be for a lot of great people to be working in security to determine that that’s what has to happen: to be working at companies, to be doing the best they can, and say, “This is what we have to do.” But let’s try and be as adaptable as we can. I mean, it’s like zero chance that the company would just literally become the NSA. They would figure out what the NSA is doing that they’re not, they would do that, and they would make the adaptations they have to make. That would take an enormous amount of intelligence and creativity and personpower — and the more security people there are, the better they would do it. So I don’t know that that one is really an alternative.

Rob Wiblin: OK, and what about the argument that we’re not going to be able to get it to be secure enough? So it might even just give us false comfort to be increasing the cost of stealing the model when it’s still just going to be sufficiently cheap.

Holden Karnofsky: I don’t think it’ll be false comfort. I think if you have a zillion great security people, and they’re all like, “FYI, this thing is not safe,” I think we’re probably going to feel less secure than we do now, when we just have a lot of confusion and FUD about exactly how hard it is to protect the model. So I don’t know. I’m kind of like, what’s the alternative? But putting aside what’s the alternative, I would just disagree with this thing that it’s a billion dollars and it’s peanuts. I would just say that at the point where it’s really hard, anything that’s really hard there’s an opportunity for people to screw it up.

Rob Wiblin: Sometimes it doesn’t happen.

Holden Karnofsky: It doesn’t happen. They might not be able to pull it off. They might just screw it up a bunch of times, and that might give us enough months to have enough of an edge that it doesn’t matter.

I think another point in all this is that if we get to a future world where you have a really good standards and monitoring regime, one of the things you’re monitoring for could be security breaches. So you could be saying we’re using AI systems to enforce some sort of regulatory regime that says you can’t train a dangerous system. Well, not only can’t you train a dangerous system; you can’t steal any system — if we catch you, there’s going to be consequences for that. Those consequences could be arbitrarily large.

And it’s one thing to say a state actor can steal your AI; it’s another thing to say they can steal your AI without a risk of getting caught. These are different security levels. So I guess there’s a hypothetical world in which no matter what your security is, a state actor can easily steal it in a week without getting caught. But I doubt we’re in that world. I think you can make it harder than that, and I think that’s worth it.

Rob Wiblin: Yeah. Well, I’ve knocked it out of the park in terms of failing to disprove this argument that I agree with. So please, people, go and learn more about this. We’ve got an information security career review. There’s a post up on the Effective Altruism Forum called “EA Infosec: skill up in or make a transition to infosec via this book club, which you could go check out. There’s also the EA infosec Facebook group. So quite a lot of resources. Hopefully, finally, people are waking up to this as a really impactful career. And I guess if you know any people who work in information security, it may be good to have a conversation with them. Or if you don’t, maybe have a child and then train them up in information security. And in 30 years, they’ll be able to help out.

Rob Wiblin: Hey listeners and possible bad faith critics — just to be clear, I am not advocating having children in order to solve talent bottlenecks in information security. That was a joke designed to highlight the difficulty of finding people to fill senior information security roles. OK back to the show.

Holden Karnofsky: This is a lot of different jobs, by the way. There’s security researchers, there’s security engineers, there’s security DevOps people and managers. And this is a big thing. We’ve oversimplified it, and I’m not an expert at all.

Rob Wiblin: It is kind of weird that this is an existing industry that many different organisations require, and yet it’s going to be such a struggle to bring in enough people to secure what is probably a couple of gigabytes’ worth of data. It’s whack, right?

Holden Karnofsky: This is the biggest objection I hear to pushing security. Everyone will say alignment is a weird thing, and we need weird people to figure out how to do it. Security? What the heck? Why don’t the AI companies just hire the best people? They already exist. There’s a zillion of them. And my response to that is basically that security hiring is a nightmare; you could talk to anyone who’s actually tried to do it. There may come a point at which AI is such a big deal that AI companies are actually just able to hire all the people who are the best in security, and they’re doing it, and they’re actually prioritising it, but I think that even now, with all the hype, we’re not even close to it. I think it’s in the future.

And I think that you can’t just hire a great security team overnight and have great security overnight. It actually matters that you’re thinking about the problems years in advance, and that you’re building your culture and your practices and your operations years in advance. Because security is not a thing you could just come in and bolt onto an existing company and then you’re secure. I think anyone who’s worked in security will tell you this.

So having great security people in place, making your company more secure, and figuring out ways to secure things well — well in advance of when you’re actually going to need the security — is definitely where you want to be if you can. And having people who care about these issues work on this topic does seem really valuable for that. It also means that the more these positions are in demand, the more they’re going to be in positions where they have an opportunity to have an influence and have credibility.

Rob Wiblin: Yeah. I think the idea that surely it’ll be possible to hire for this from the mainstream might have been a not-unreasonable expectation 10 or 15 years ago. But the thing is like, we’re already here. We can see that it’s not true. I don’t know why it’s not true. But definitely, people really can move the needle by one outstanding individual in this area.

Holden Karnofsky: Yeah. So the four things — alignment research / threat assessment research; standards and monitoring, which is a lot of different potential jobs that I kind of outlined at the beginning, many of which are jobs that kind of don’t exist yet, but could in the future; then there’s a successful, careful AI lab; then there’s security — I want to say a couple things about them.

One is — I have said this before — I don’t think any of them are binary. And I have a draft post that I’ll put up at some point arguing this. These are all things where a little more improves our odds in a little way. It’s not some kind of weird function where it’s useless until you get it perfect. I believe that about all four.

Another thing I’ll say is I tend to focus on alignment risk because it is probably the single thing I’m most focused on, and because I know this audience will be into it. But I do want to say again that I don’t think that AI takeover is the only thing we ought to be worried about here, and I think the four things I’ve talked about are highly relevant to other risks as well.

So all the things I’ve said I think are really major concerns if you think AI systems could be dangerous in pretty much any way. Threat assessment: figuring out whether they could be dangerous, what they could do in the wrong hands. Standards and monitoring: making sure that you’re clamping down on the ones that are dangerous for whatever reason — and “dangerous” could include because they have feelings, and we might mistreat them; that’s a form of danger, you could think. Successful and careful AI labs and security, I think I’m pretty clear there too.

Rob Wiblin: Yeah. I think we’re actually going to end up talking more about misuse as an area than misalignment going forward, just because I think that is maybe more upon us, or will be upon us very soon, so there’s a high degree of urgency.

Holden Karnofsky: Possibly.

Rob Wiblin: Also, as a non-ML-scientist, I think I maybe have a better grip on the misuse issues, and it might also be somewhat more tractable for a wider range of people to try to contribute to reducing misuse.

Holden Karnofsky: Interesting.

What governments could be doing differently [01:52:55]

Rob Wiblin: OK so you have a post on what AI labs on could doing differently, but I know that’s already been superseded in your mind, and you’re going to be working on that question in coming months, so we’re going to skip that one today and come back to it in another interview when the time is right, maybe later this year.

So instead, let’s push on and talk about governments. You had a short post about this a couple of months ago, called “How major governments can help with the most important century.” I think you wrote that your views on this are even more tentative.

Of course, there’s a lot of policy attention to this just now. But back in February, it sounded like your main recommendation was actually just not to strongly commit to any particular regulatory framework, or any particular set of rules, because things are changing so quickly. And it seems like governments, once they do something, they find it very hard to stop doing it. And once they do something, then they maybe move on and forget that what they’re doing needs to be constantly updated. Is that still your high-level recommendation? That people should be studying this, but not trying to write the bill on AI regulation?

Holden Karnofsky: Yeah. There’s some policies that I’m excited about more than I was previously, but I think at the high level that is still my take. Companies could do something and then they could just do something else. And there’s certain things that are hard for companies to change, but there’s other things that are easy for them to change. Governments, you’ve got to spin up a new agency, you’ve got to have all these directives — it’s just going to be hard to turn around. So I think that’s right. I think government should default to doing things that have been run to ground, that they really feel good about, and not just be starting up new agencies left and right. That does seem right.

Rob Wiblin: Yeah, OK. But what if someone who’s senior in the White House came to you and said, “Sorry, Holden. The Eye of Sauron has turned to this issue — in a good way. We want to do something now.” What would you feel reasonably good about governments trying to take on now?

Holden Karnofsky: I have been talking with a lot of the folks who work on policy recommendations, and have been thinking about that and trying to get a sense for what ideas the people who think about this the most are most supporting.

An idea that I like quite a bit is requiring licences for large training runs. Basically, if you’re going to do a really huge training run of an AI system, I think that’s the kind of thing the government can be aware of, and should be aware of. And it becomes somewhat analogous to developing a drug or something, where it’s a very expensive, time-consuming training process to create one of these state-of-the-art AI systems; it’s a very high-stakes thing to be doing.

So we don’t know exactly what a company should have to do yet, because we don’t yet have great evals and tests for whether AI systems are dangerous. But at a minimum, you could say they need a licence. So you need to, at a minimum, say, “We’re doing this. We’ve told you we’re doing it. You know whether any of us have criminal records,” whatever. And now they’ve got a licence.

And then that creates a potentially flexible regime, where you can later say, “In order to keep your licence, you’re going to have to measure your systems to see if they’re dangerous, and you’re going to have to show us that they’re not,” and all that stuff — without committing to exactly how that works now. So I think that’s an exciting idea probably. I don’t feel totally confident about any of this stuff, but that’s probably number one for me.

I think the other number one for me would be some of the stuff that’s already ongoing, like existing AI policies that I think people have already pushed forward and are trying to kind of tighten up. So some of the stuff about export controls would be my other top thing.

You know, I think if you were to throw in a requirement with the licence, I would make it about information security. Government requiring at least minimum security requirements for anyone training frontier models just seems like a good idea — just getting them on that ramp to where it’s not so easy for a state actor to steal it. Arguably, governments should just require all AI models to be treated as top-secret classified information — which means that they would have to be subject to incredible draconian security requirements involving like air gap networks and all this incredibly painful stuff. Arguably, they should require that at this point, given how little we know about what these models are going to be imminently capable of. But at a minimum, some kind of security requirement seems good.

Another couple ideas are just tracking where all the large models are in the world, where all the hardware is capable of being used for those models. I think we don’t necessarily want to do anything with that yet, but having the ability seems possibly good. And then I think there are interesting questions about liability and incident tracking and reporting that could use some clarification. I don’t think I have the answer on them right now. But just like: When should an AI company be liable for harm that was caused partly by one of its models? What should an AI company’s responsibilities be when there is a bad incident, of being able to say what happened — how does that trade off against the privacy of the user?

These are things that feel really juicy to me to, like, consider 10 options, figure out which ones are best from a containing-the-biggest-risks point of view, and push that. But I don’t really know what that is yet.

Rob Wiblin: Yeah. I guess, broadly speaking, we don’t know exactly what the rules should be and the details, and we don’t know exactly where we want to end up. But we want to, across a bunch of different dimensions, put in place the beginning of the infrastructure that will probably regardless help us go in the direction that we’re going to need to move gradually.

Holden Karnofsky: Exactly. And I think there’s other things governments could do that are more like giving themselves kind of arbitrary powers to seize or use AI models — and I’m not really in favour of that. I think that could be destabilising and could cause chaos in a lot of ways. So a lot of this is about basically feeling like we’re hopefully heading toward a regime of testing whether AI models are dangerous, and stopping them if they are, and having the infrastructure in place to basically make that be able to work. So it’s not a generic thing where the government should give itself all the option value, but it should be setting up for that kind of thing to basically work.

Rob Wiblin: As I understand it, if the National Security Council in the US concluded that a model that was about to be trained would be a massive national security hazard, and might lead to human extinction, people aren’t completely sure which agency or who has the legitimate legal authority to prevent that from going ahead.

Holden Karnofsky: Or if anyone does. Yeah. No one’s sure if anyone has that authority.

Rob Wiblin: Yeah. Right. And it seems like that’s something that should be patched at least. Even if you’re not creating the ability to seize all of the equipment and so on with the intention of using it anytime soon, maybe it should be clear that there’s some authority that is meant to be monitoring this, and should take action if they conclude that something’s a massive threat to the country.

Holden Karnofsky: Yeah. Possibly. I think I’m most excited about what I think of as promising regulatory frameworks that could create good incentives and could help us every year, and a little bit less about the tripwire for the D-Day. I think a lot of times with AI, I’m not sure there’s going to be one really clear D-Day — or by the time it comes, it might be too late. So I am thinking about things that could just put us on a better path day by day.

What people with an audience could be doing differently [01:59:38]

Rob Wiblin: OK, pushing on to people who have an audience — like people who are active on social media, or journalists, or, I guess, podcasters (heaven forfend). You wrote this article, “Spreading messages to help with the most important century,” which was targeted at this group.

Back in ancient times, in February, when you wrote this piece, you were saying that you thought people should tread carefully in this area, and should definitely be trying not to build up hype about AI, especially about its raw capabilities, because that could encourage further investment in capabilities. You were saying that most people, when they hear that AI could be really important, rather than falling into this caution, concern, risk management framework, they start thinking about it purely in a competitive sense, thinking, “Our business has to be at the forefront. Our country has to be at the forefront.”

And I think indeed there has been an awful lot of people thinking that way recently. But do you still think that people should be very cautious talking about how powerful AI might be, given that maybe the horse has already left the barn on that one?

Holden Karnofsky: I think it’s a lot less true than it was; it’s less likely that you hyping up AI is going to do much about AI hype. I think it’s still not a total nonissue, and especially if we’re just taking the premise that you’re some kind of communicator and people are going to listen to you.

I still think the same principle basically applies that the thing you don’t want to do is emphasise the incredible power of AI if you feel like you’re not at the same time getting much across about how AI could be a danger to everyone at once. Because I think if you do that, the default reaction is going to be, “I gotta get in on this.” And a lot of people already think they gotta get in on AI — but not everyone thinks that; not everyone is going into AI right now.

So if you’re talking to someone who you think you’re going to have an unusual impact on, I think that basic rule still seems right. And it makes it really tricky to communicate about AI. I think there’s a lot more audiences now where you just feel like these people have already figured out what a big deal this is. I need to help them understand some of the details on how it’s a big deal, and especially some of the threats of misalignment risk and stuff like that. And that kind of communication is a little bit less complicated in that way, although challenging.

Rob Wiblin: Do you have any specific advice for what messages seem most valuable, or ways that people can frame this in a particularly productive way?

Holden Karnofsky: Yeah. I wrote a post on this that you mentioned: “Spreading messages to help with the most important century.” I think some of the things that a lot of people have trouble understanding — or don’t seem to understand, or maybe just disagree with me on, and I would love to just see the dialogue get better — is this idea that AI could be dangerous to everyone at once. It’s not just about whoever gets it wins. You know, the kind of Terminator scenario I think is actually just pretty real. And the way that I would probably put it at a high level is just that there’s only one kind of mind right now; there’s only one kind of species or thing that can develop its own science and technology. That’s humans. We might be about to have two instead of one. That would be the first time in history we had two. The idea that we’re going to stay in control I think should just not be something we’re too confident in.

That would be at a high level. And then at a low level, I would say with humans too: you know, humans kind of fell out of this trial-and-error process, and for whatever reason, we had our own agenda that wasn’t good for all the other species. Now we’re building AIs by trial-and-error process. Are they going to have their own agenda? I don’t know. But if they’re capable of all the things humans are, it doesn’t feel that crazy.

I would say it feels even less crazy when you look at the details of how people build AI systems today, and you imagine extrapolating that out to very powerful systems: it’s really easy to see how we could be training these things to have goals and optimise, the way you would optimise to win a chess game. We’re not building these systems that are just these very well-understood, well-characterised reporters of facts about the world. We’re building these systems that are very opaque, trained with, like, sticks and carrots, and they may in fact have what you might think of as goals or aims. And that’s something I wrote about in detail.

So trying to communicate about why we could expect these kind of Terminator scenarios to be serious, or versions of them, to be serious, how that works mechanistically, and also just the high-level intuitions, seems like a really good message that I think could be a corrective to some of the racing and help people realise that we may — in some sense, on some dimensions, in some times — all be in this together, and that may call for different kinds of interventions from if it was just a race.

I think some of the things that are hard about measuring AI danger are really good for the whole world to be aware of. I’m really worried about a world in which, when you’re dealing with beings that have some sort of intelligence, measurement is hard. Let’s say you run a government and you’re worried about a coup: Are you going to be empirical, and go poll everyone about whether they’re plotting a coup? And then it turns out that 0% of people are plotting a coup, so there’s no coup?

Rob Wiblin: Which is great! Yeah.

Holden Karnofsky: Yeah. That’s not how that works. And that kind of empirical method works with things that are not thinking about what you’re trying to learn and how that’s going to affect your behaviour. And so with AI systems, it’s like, “We gave this thing a test to see if it would kill us, and it looks like it wouldn’t kill us.” Like, how reliable is that? There’s a whole bunch of reasons that we might not actually be totally set at that point, and that these measurements could be really hard.

And I think this is really key, because I think wiping out enough of the risk to make something commercialisable is one thing, and wiping out enough of the risk that we’re actually still fine after these AIs are all over the economy and could disempower humanity if they chose is another thing. Not thinking that commercialisation is going to take care of it, not thinking that we’re just going to be able to easily measure as we go: I think these are really important things for people to understand, and could really affect the way that all plays out — you know, whether we do reasonable things to prevent the risks.

I think those are the big ones. I have more in my post. The general concept is just that there’s a lot coming. It could happen really fast. And so the normal human way of just reacting to stuff as it comes may not work I think is an important message. Important message if true. If wrong, I would love people to spread that message so that it becomes more prominent, so that more people make better arguments against it, and then I change my mind.

Rob Wiblin: Yeah, I don’t know whether this is good advice, but a strategy you could take is trying to find aspects of this issue that are not fully understood yet by people who have only engaged with it quite recently. Like exactly this issue — that the measurement of safety could be incredibly difficult — is not just a matter of doing the really obvious stuff like asking the model, “Are you out to kill me?” Trying to come up with some pithy example or story or terminology that can really capture people’s imagination and stick in their mind.

I think exactly that example of the coup, where you’re saying, what you’re doing is just going around to your generals and asking them if they want to overthrow you — and then they say no, and you’re like, “Well, everything is hunky-dory.” I think that is the kind of thing that could get people to understand at a deeper level why we’re in a difficult situation.

Holden Karnofsky: I think that’s right. And I’m very mediocre with metaphors. I bet some listeners are better with them, and can do a better job.

Rob Wiblin: Katja Grace wrote one in a TIME article that I hadn’t heard before, which is saying we’re not in a race to the finish line — rather, we’re a whole lot of people on a lake that has frozen over, but the ice is incredibly thin. And if any of us start running, then we’re all just going to fall through because it’s going to crack. That’s a great visualisation of it.

Holden Karnofsky: Yeah. Interesting.

Jobs and careers to help with AI [02:06:47]

Rob Wiblin: OK, let’s push on. We’ve talked about AI labs, governments, and advocates. But the final grouping is the largest one, which is just jobs and careers, which is what 80,000 Hours is typically meant to be about. What’s another way that some listeners might be able to help with this general issue by changing the career that they go into or the skills that they develop?

Holden Karnofsky: Yeah. So I wrote a post on this, called “Jobs that can help with the most important century.”

The first thing I want to say is that I just do expect this stuff to be quite dynamic. Right now, I think we’re in a very nascent phase of evals and standards. I think we could be in a future world where there are decent tests of whether AI systems are dangerous, and there are decent frameworks for how to keep them safe, but there needs to be just more work on advocacy and communication so that people actually understand this stuff, take it seriously, and that there is a reason for companies to do this. And also, there could be people working on political advocacy to have good regulatory frameworks for keeping humanity safe. So I think the jobs that exist are going to change a lot.

And I think my big thing about careers in general is: if you’re not finding a great fit with one of the current things, that’s fine, and don’t force it. If you have person A and person B. Person A is doing something that’s not clearly relevant to AI or whatever — let’s say they’re an accountant; they’re really good at it, they’re thriving, they’re picking up skills, they’re making connections, and they’re ready to go work on AI as soon as an opportunity comes up (which that last part could be hard to do on a personal level). Then you have person B who kind of has a similar profile, but they force themselves to go into alignment research, and they’re doing quite mediocre alignment research — so they’re, like, barely keeping their job. I would say person A has the higher expected impact.

I think that would be the main thing on jobs: do something where you’re good at it, you’re thriving, you’re levelling up, you’re picking up skills, you’re picking up connections. If that thing can be on a key AI priority, that is ideal. If it cannot be, that’s OK, and don’t force it.

So that is my high-level thing. But I am happy to talk about specifically what I see as some of the things people could do today, right now, on AI that don’t require starting your own org, and are more like you can slot into an existing team if you have the skills and if you have the fit. I’m happy to go into that.

Rob Wiblin: Yeah. For people who want more advice on overall career strategy, we did an episode with you on that back in 2021: episode #110: Holden Karnofsky on building aptitudes and kicking ass. So I can definitely recommend going back and listening to that. But more specific roles, are there any ones that you wanted to highlight?

Holden Karnofsky: Yeah. I mean, some of them are obvious. There’s people working on AI alignment. There’s also people working on threat assessment, which we’ve talked about, and dangerous capability evaluations at AI labs or sometimes at nonprofits. And if there’s a fit there, I think that’s just an obviously great thing to be working on. We’ve talked about information security, so I don’t think we need to say more about that.

I think there is this really tough question of whether you should go to an AI company and do things there that are not particularly safety or policy or security — just like helping the company succeed. In my opinion, that can be a really great way to skill up, a really great way to personally become a person who knows a lot about AI, understands AI, swims in the water, and is well positioned to do something else later. There’s big upsides and big downsides to helping an AI company succeed at what it’s doing, and it really comes down to how you feel about the company. So it’s a tricky one, but it’s one that I think is definitely worth thinking about, thinking about carefully.

Then there’s roles in government and there’s roles in government-facing think tanks, just trying to help, and I think that the interest is growing. So trying to help the government make good decisions, including not making rash moves, about how it’s dealing with AI policy, what it’s regulating, what it’s not regulating, et cetera. So those are some things. I had a few others listed in my post, but I think it’s OK to stop there.

Rob Wiblin: Yeah. One path, broadly speaking, was going and working in the AI labs, or in nearby industries or firms that they collaborate with, and I guess there’s a whole lot of different ways you could have an impact there. I suppose the other one is thinking about governance and policy, where you could just pursue any kind of government and policy career, try to flourish as much as you can, and then turn your attention towards AI later on, because there’s sure to be an enormous demand for more analysis and work on this in coming years. So hopefully, in both cases, you’ll be joining very rapidly growing industries.

Holden Karnofsky: That’s right. And for the latter, the closer the better. So working on technology policy is probably best.

Rob Wiblin: What about people who don’t see any immediate opportunity to enter into either of those broad streams? Is there anything that you think that they could do in the meantime?

Holden Karnofsky: Yeah. I did talk before about the kind of person who could just be good at something and kind of wait for something to come up later. It might be worth emphasising that the ability to switch careers is going to get harder and harder as you get further and further into your career. So in some ways, if you’re a person who’s being successful, but is also making sure that you’ve got the financial resources, the social resources, the psychological resources, so that you really feel confident that as soon as a good opportunity comes up to do a lot of good, you’re going to actually switch jobs, or have a lot of time to serve on a board or whatever — it just seems incredibly valuable.

I think it’s weird because this is not a measurable thing, and it’s not a thing you can, like, brag about when you go to an effective altruism meetup. And I just wish there was a way to kind of recognise that the person who is successfully able to walk away, when they need to, from a successful career has, in my mind, more expected impact than the person who’s in the high-impact career right now, but is not killing it.

Rob Wiblin: Yeah. So I expect an enormous growth in roles that might be relevant to this problem in future years, and also just an increasing number of types of roles that might be relevant, because there could just be all kinds of new projects that are going to grow, and we’ll require people who are just generally competent — you know, who have management experience, who know how to deal with operations and legal, and so on. They’re going to be looking for people who share their values. So if you’re able to potentially move to one of the hubs and take one of those roles when it becomes available, if it does, then that’s definitely a big step up, relative to locking yourself into something else where you can’t shift.

Holden Karnofsky: I was going to say also that, spreading messages we talked about, but I have a feeling that being a person who is a good communicator, a good advocate, a good persuader, I have a feeling that’s going to become more and more relevant, and there’s going to be more and more jobs like that over time. Because I think we’re in a place now where people are just starting to figure out what a good regulatory regime might look like, what a good set of practices might look like for containing the danger. And later, there’ll be more maturity there and more stress placed on “and people need to actually understand this, and care about it, and do it.”

Rob Wiblin: Yeah. I mean, setting yourself the challenge of taking someone who is not informed about this, or might even be sceptical about this, and, with arguments that are actually sound (as far as you know), persuading them to care about it for the right reasons and to understand it deeply: that is not simple. And if you’re able to build the skill of doing that through practice, it would be unsurprising if that turned out to be very useful in some role in future.

Holden Karnofsky: And I should be clear there’s a zillion versions of that, that have dramatically different skill sets. So there’s people who work in government, and there’s some kind of government subculture that they’re very good at communicating with in government-ese. And then there’s people who make viral videos. Then there’s people who organise grassroots protests. There’s so many. There’s journalists: there’s highbrow journalists, lowbrow journalists. Communication is not a generalisable skill. There’s an audience, and there’s a gazillion audiences, and there are people who are terrible with some audiences and amazing with other ones. So this is many, many jobs, and I think there’ll be more and more over time.

Audience questions on AI [02:14:25]

Rob Wiblin: OK, we’re just about to wrap up this AI section. I just had two questions from the audience to run by you first. One audience member asked: “What, if anything, should Open Philanthropy have done two to five years ago to put us in a better position to deal with AI now? Is there anything that we missed?”

Holden Karnofsky: In terms of actual stuff we literally missed, I feel like this whole idea of evals and standards is, like, everyone’s talking about it now, but it would’ve been much better if everyone was talking about it five years ago. That would’ve been great. I think in some ways, this research was too hard to do before the models got pretty good, but there might have been some start on it, at least with understanding how it works in other industries and starting to learn lessons there.

Security, obviously, I have regrets about. There were some attempts to push it from 80k and from Open Phil, but I think those attempts could have been a lot louder, a lot more forceful. I think it’s possible that security being the top hotness in EA — rather than alignment — it’s not clear to me which one of those would be better. And having the two be equal I think probably would have been better.

I don’t know. There’s lots of stuff where I kind of wish we’d just paid more attention to all of this stuff faster, but those are the most specific things that are easier for me to point to.

Rob Wiblin: What do you think of the argument that we should expect a lot of useful alignment research to get done ultimately because it’s necessary in order to make the products useful? I think Pushmeet Kohli made this argument on the on the show many years ago, and I’ve definitely definitely heard it recently as well.

Holden Karnofsky: I think it could be right. In some ways, it feels like it’s almost definitely right, to an extent or something. It’s just certain AI systems that don’t at all behave how you want are going to be hard to commercialise, and AI systems that are constantly causing random damage and getting you in legal trouble, that’s not going to be a profitable business. So I do think a lot of the work that needs to get done is going to get done by normal commercial incentives. I’m very uncomfortable having that be the whole plan.

One of the things I am very worried about — if you’re really thinking of AI systems as capable of doing what humans can do — is that you could have situations where you’re training AI systems to be well behaved, but what you’re really training them to do is to be well behaved unless they can get away with bad behaviour in a permanent way. And just like a lot of humans, they behave themselves because they’re part of a law-and-order situation. And if they ever found themselves able to, you know, gain a lot of power or break the rules and get away with it, they totally would. A lot of humans are like that. You could have AIs that you’ve basically trained to be like that.

It reminds me a little bit of some of the financial crisis stuff, where you could be doing things that drive your day-to-day risks down, but kind of concentrate all your risk in these highly correlated tail events. I don’t think it’s guaranteed, but I think it’s quite worrying we can be in a world where, in order to get your AIs to be commercially valuable, you have to get them to behave themselves — but you’re only getting them to behave themselves up to the point where they could definitely get away with it. And they’re actually capable enough to be able to tell the difference between those two things.

So I don’t want our whole plan to be “commercial incentives will take care of this.” And if anything, I tend to be focused on the parts of the problem that seem less likely to get naturally addressed that way.

Rob Wiblin: Yeah. Another analogy there is to the forest fires: where, as I understand it, because people don’t like forest fires, we basically prevent forests from ever having fires. But then that causes more brush to build up, and then every so often, you have some enormous cataclysmic fire that you just can’t put out, because the amount of combustible material there is extraordinarily high — more than you ever would have had naturally before humans started putting out these fires.

I guess that’s one way in fact that trying to prevent small-scale bad outcomes, or trying to prevent minor misbehaviour by models, could give you a false sense of security, because you’d be like, “We haven’t had a forest fire in so long.” But then, of course, all you’re doing is causing something much worse to happen because you’ve been lulled into complacency.

Holden Karnofsky: I’m not that concerned about a false sense of security. I think we should try and make things good, and then argue about whether they’re actually good. So you know, I think we should try and get AI models to behave. And after we’ve done everything we can to do that, we should ask: Have we really got them to behave? What might we be missing? So I don’t think we shouldn’t care if they’re being nice, but I think it’s not the end of the conversation.

Rob Wiblin: Another audience member asked, “How should people who’ve been thinking about and working on AI safety for many years react to all of these ideas suddenly becoming much more popular in the mainstream than they ever were?”

Holden Karnofsky: I don’t know. Brag about how everyone else is a poser?

Rob Wiblin: Don’t encourage me, Holden.

Holden Karnofsky: Like, I think we should still care about these issues. I think that people who were not interested in them before and interested in them now, we should be really happy, and we should welcome them in, and see if we can work productively with them. What else is the question?

Rob Wiblin: I guess it reminded me of the point you made in a previous conversation we had, where you said lots of people, including us, were a bit ahead curve on COVID — were kind of expecting this sort of thing to happen. And then we saw that it was going to happen weeks or months before anyone else did, and that didn’t really help. We didn’t manage to do anything.

And I’m worried with this. Like, on one level, I feel kind of smug that I feel like I was ahead of the curve on noticing this problem. But I’m also, like, we didn’t manage to fix it. We didn’t manage to convince people. So I guess there’s a degree of smugness, and we’ve got to eat humble pie at the same time.

Holden Karnofsky: In some ways I feel better about this one. I do feel like the early concern about AI was productive. We’ll see. But I generally feel like the public dialogue is probably different from what it would have been if there wasn’t a big set of people talking about these risks and trying to understand them and help each other understand them. I think there’s different people working in the field: we don’t have a field that’s just 100% made of people whose entire goal in life is making money. That seems good. There’s people in government who care about this stuff, who are very knowledgeable about it, who aren’t just coming at it from the beginning, and who understand some of the big risks.

So I think good has been done. I think the situation has been made better. I think that’s debatable. I don’t think it’s totally clear. I’m not feeling like nothing was accomplished. But yeah, I’m with you that being right ahead of time, that is not —

Rob Wiblin: It’s not enough.

Holden Karnofsky: It’s not my goal in life. It is not effective altruism’s goal in life. You could be wrong ahead of time and be really helpful. You can be right ahead of time and be really useless. So yeah, I would definitely say, let’s focus pragmatically on solving this problem. All these people who weren’t interested before and are now, let’s be really happy that they’re interested now and figure out how we can all work together to reduce AI risk. Let’s notice how the winds are shifting and how we can adapt.

Rob Wiblin: Yeah. OK, let’s wrap up this AI section. We’ve been talking about this for a couple of hours, but interestingly, I feel like we’ve barely scratched the surface on any of these different topics. We’ve been keeping up a blistering pace, and I’ll keep you for your entire workday.

It is just interesting how many different aspects there are to this problem, and how hard it is to get a grip on all of them. I think one thing you said before we started recording is just that your views on this are evolving very quickly, and so I think probably we need to come back and have another conversation about this in six or 12 months. I’m sure you have more ideas, and maybe we can go into detail on some specific ones.

Holden Karnofsky: Yeah. That’s right. If I were to kind of wrap up where I see the AI situation right now, I think there’s definitely more interest. People are taking risks more seriously. People are taking AI more seriously. I don’t think anything is totally solved or anything even in terms of public attention.

Alignment research has been really important for a long time, and remains really important. And I think there’s more interesting avenues of it that are getting somewhat mature than there used to be. There’s more jobs in there. There’s more to do.

I think the evals and standards stuff is newer, and I’m excited about it. And I think in a year, there may be a lot more to do there, a lot a lot.

Another thing that I have been updating on a bit is that there is some amount of convergence between different concerns about AI, and we should lean into that while not getting too comfortable with it. So I think, you know, we’re at a stage right now where the main argument that today’s AI systems are not too dangerous is that they just can’t do anything that bad, even if humans try to get them to. When that changes, I think we should be more worried about misaligned systems, and we should be more worried about aligned systems that bad people have access to.

I think for a while, those concerns are going to be quite similar, and people who were concerned about aligned systems and misaligned systems are going to have a lot in common. I don’t think that’s going to be true forever, so I think in a world where there’s a pretty good balance of power — and lots of different humans have AIs, and they’re kind of keeping each other in check — I think you would worry less about misuse and more about alignment. Because misaligned AIs could end up all on one side against humans, or mostly one side, or just fighting each other in a way where we’re collateral damage.

So right now a lot of what I’m thinking about in AI is pretty convergent, just like how can we build a regime where we detect danger — which just means anything that an AI could do that feels like it could be really bad for any reason — and stop it? And at some point, it’ll get harder to make some of these tradeoffs.

Rob Wiblin: OK, to be continued.

Holden vs hardcore utilitarianism [02:23:33]

Rob Wiblin: Let’s push on to something completely different, which is this article that you’ve been working on where you lay out your reservations about, in one version, you call it “hardcore utilitarianism,” and in another one, you call it “impartial expected welfare maximisation.” For the purposes of an audio version, let’s just call this “hardcore utilitarianism.”

To give some context to tee you up here a little, this is a topic that we’ve discussed with Joe Carlsmith in episode #152: Joe Carlsmith on navigating serious philosophical confusion. And we also actually touched on it at the end of episode #147 with Spencer Greenberg.

Basically, over the years, you’ve found yourself talking to people who are much more all-in on some sort of utilitarianism than you are. And from reading the article, the draft, I think the conclusions they draw that bother you the most are that Open Philanthropy, or perhaps the effective altruism community, should only have been making grants to improve the long-term future — and maybe only making grants or only doing any work related to artificial intelligence — rather than diversifying and hedging all of our bets across a range of different worldviews, and splitting our time and resources between catastrophic risk reduction, as well as helping the present generation of people and helping nonhuman animals, among other different plausible worldviews. And also, maybe the conclusion that some people draw is that they should act uncooperatively with other people who have different values, whenever they think that they can get away with it.

Do you want to clarify any more the attitudes that you’re reacting to here?

Holden Karnofsky: Yeah. I mean, one piece of clarification is just that the articles you’re talking about, one of them is like a 2017 or 2018 Google Doc that I probably will just never turn into a public piece. And another is a dialogue that I started writing, that I do theoretically intend to publish someday, but it might never happen or it might be a very long time.

The story I would tell is, you know, I cofounded GiveWell. I’ve always been interested in doing the most good possible in kind of a hand-wavy rough way. One way of talking about what I mean by “doing the most possible” is that there’s a kind of apples-to-apples principle that says, when I’m choosing between two things that really seem like they’re pretty apples-to-apples — pretty similar — I want to do the one that helps more people more. When I feel like I’m able to really make the comparison, I want to do the thing that helps more people more.

That is a different principle from a more all-encompassing “Everything can be converted into apples,” and “All interventions are on the same footing,” and “There’s one thing that I should be working on that is the best thing,” and “There is an answer to whether it’s better to increase the odds that many people will ever get to exist versus reducing malaria in Africa versus helping chickens on factory farms.”

I’ve always been a little less sold on that second way of thinking. So there’s that, you know, the “more apples principle”: I want more apples when it’s all apples. And then there’s the, like, “it’s all one fruit principle” or something — these are really good names that I’m coming up with on the spot that I’m sure will stand the test of time.

You know, I got into this world, and I met other people interested in similar topics. A lot of them identify as effective altruists. And I encountered these ideas that were more hardcore, and were more saying something like there is one correct way of thinking about what it means to do good or to be ethical — and that comes down to basically utilitarianism. This can basically be: A, you see it by looking in your heart and seeing that subjective experience is all that matters and that everything else is just heuristics for optimising pleasure and minimising pain; B, you can show it with various theorems — like Harsanyi’s Aggregation Theorem tells you that if you’re trying to give others the deals and gambles they would choose, then it falls out of that that you need some form of utilitarianism. And I’ve written a piece kind of going into all the stuff this means.

And people kind of say that we think we have really good reason to believe that after humanity has been around for longer — and is wiser, if this happens — we will all realise that the right way of thinking about what it means to be a good person is just to basically be a utilitarian: take the amount of pleasure, minus pain, add it up, maximise that, be hardcore about it. Like, don’t lie and be a jerk for no reason — but if you ever somehow knew that doing that was going to maximise utility, that’s what you should do.

And I ran into that point of view, and that point of view also was very eye roll at the idea that Open Philanthropy was going to do work in longtermism and global health and wellbeing. And my basic story is that I have updated significantly toward that worldview compared to where I started, but I am still less than half into it. And furthermore, the way that I deal with that is not by multiplying through and doing another layer of expected value, but by saying, if I have a big pool of money, I think less than half of that money should be following this worldview. I’ve been around for a long time in this community. I think I’ve now heard out all of the arguments. And that’s still where I am.

And so my basic stance is that I think that we are still very deeply confused about ethics. I think we don’t really know what it means to “do good.” And I think that reducing everything to utilitarianism is probably not workable. I think it probably actually just breaks in very simple mathematical ways. And I think we probably have to have a lot of arbitrariness in our views of ethics. I think we probably have to have some version of just caring more about people who are more similar to us or closer to us.

And so I still am basically unprincipled on ethics. I still basically have a lot of things that I care about that I’m not sure why I care about. I would still basically take a big pot of money and divide it up between different things. I still believe in certain moral rules that you’ve got to follow. Not as long as you don’t know the outcome, but just you just gotta follow them. End of story, period, don’t overthink it. That’s still where I am.

So I don’t know. Yeah, I wrote a dialogue trying to explain why this is, for someone who thinks the reason I would think this is because I hadn’t thought through all the hardcore stuff — instead, just addressing the hardcore stuff very directly.

Rob Wiblin: Yeah, in prepping for this interview, you might have thought that we would have ended up having a debate about whether impartial expected welfare maximisation is the right way to live or the right theory of morality. But actually, it seems like we mostly disagree on how many people actually are really all-in on hardcore utilitarianism. I guess my impression is, at least the people that I talk to — who maybe are somewhat filtered and selected — many people, including me, absolutely think that impartial expected welfare maximisation is underrated by the general public —

Holden Karnofsky: I think that. Yeah.

Rob Wiblin: — and if you focus on increasing wellbeing, there’s an awful lot of good that you can do there. And that most people aren’t thinking about that. But nonetheless, I’m not confident that we’ve solved philosophy. I’m not so confident that we’ve solved ethics. The idea that pleasure is good and suffering is bad feels like among the most plausible claims that one could make about what is valuable and what is disvaluable. But the idea of things being objectively valuable is an incredibly odd one. It’s not clear how we could get any evidence about that that would be fully persuasive. And clearly, philosophers are very split. So we’re forced into this odd position of wanting to hedge our bets a bit between this theory that seems like maybe the most plausible ethical theory, but also having lots conflicting intuitions with it, and also being aware that many smart people don’t agree that this is the right approach at all.

But it sounds like you’ve ended up in conversations with people who, maybe they have some doubts, but they are pretty hardcore. They really feel like there’s a good chance that when we look back, we’re going to be like, it was total utilitarianism all along, and everything else was completely confused.

Holden Karnofsky: Yeah. I think that’s right. I think there’s definitely room for some nuance here. You don’t have to think you’ve solved philosophy. The position a lot of people take is more like, I don’t really put any weight on random commonsense intuitions about what’s good, because those have a horrible track record; the history of commonsense morality looks so bad that I just don’t really care what it says. So I’m going to take the best guess I’ve got at a systematic, science-like — you know, with good scientific properties of simplicity and predictiveness — system of morality. That’s the best I can do. And furthermore, there’s a chance it’s wrong, but you can do another layer of expected value maximisation and multiply that through.

And so I’m basically going to act as if maximising utility is all that matters and specifically maximising the kind of pleasure-minus-pain type thing of subjective experience. That is the best guess. That is how I should act. When I’m unsure what to do, I may follow heuristics, but if I ever run into a situation where the numbers just clearly work out, I’m going to do what the numbers say.

And I not only think that’s not definitely right. A minority of me is into that view. Is it the most plausible view? I would say no. I would say the most plausible view of ethics is that it’s a giant mishmash of different things, and that what it means to be good and do good is like a giant mishmash of different things, and we’re not going to nail it anytime soon. Is it the most plausible thing that’s kind of neat and clean and well defined? Well, I would say definitely total utilitarianism is not. I think total utilitarianism is completely screwed. Makes no sense. It can’t work at all. But I think there’s a variant of it, sometimes called UDASSA, that I’m OK saying that’s the most plausible we’ve got or something, and gets a decent chunk, but not a majority of what I’m thinking about.

Rob’s 1 minute explainer of UDASSA

Rob Wiblin: You’re doing a bunch of work, presumably it’s kind of stressful sometimes, in order to help other people. And you started GiveWell, wanting to help the global poor. Maybe it would be worth laying out your conception of morality, and what motivates you to do things in order to make the world better?

Holden Karnofsky: A lot of my answer to that is just… I don’t know. Sometimes when people interview me about these thought experiments — do you save the painting? — I’ll just be like, “I’m not a philosophy professor.” And that doesn’t mean I’m not interested in philosophy — like I said, I’ve argued this stuff into the ground — but a lot of my conclusion is just that philosophy is a non-rigorous methodology with an unimpressive track record, and I don’t think it is that reliable or that important. And it isn’t that huge a part of my life — and I find it really interesting, because that’s not because I’m unfamiliar with it; it’s because I think it shouldn’t be.

So I’m not that philosophical a person in many ways. I’m super interested in it — I love talking about it; I have lots of takes. But I think when I make high-stakes important decisions about how to spend large amounts of money, I’m not that philosophical of a person, and most of what I do does not rely on unusual philosophical views. I think it can be justified to someone with quite normal takes on ethics.

Rob Wiblin: Yeah. So one thing is that you’re not a moral realist — so you don’t believe that there are kind of objective, mind-independent facts about what is good and bad and what one ought to do.

Holden Karnofsky: I have never figured out what this position is supposed to mean, and I’m hesitant to say I’m not one because I don’t even know what it means. So if you can cash something out for me that has a clear pragmatic implication, I will tell you if I am or not, but I’ve never really even gotten what I’m disagreeing with or agreeing with on that one.

Rob Wiblin: From reading the article, it sounded like you had some theory of doing good, or some theory of what the enterprise that you’re engaged in — when you try to live morally, or when you try to make decisions about what way you should give money — and that it’s something about acting on your preferences about making the world better; it’s at least about acting on the intuitions you have about what good behaviour is.

Holden Karnofsky: Sure, I mean, I generally am subjectivist. Like, when I hear “subjectivism,” I’m like, “That sounds right.” When I hear “moral realism,” I don’t go, “That sounds wrong” — I’m like, “I don’t know what you’re saying.” And I have tried to understand. I can try again now if you want.

Rob Wiblin: I mean, if moral realism is true, it’s a very queer thing, as philosophers say. Realism about moral facts is not seemingly the same as scientific facts about the world. It’s kind of not clear how we’re causally connected to these facts.

Holden Karnofsky: Exactly. I’ve heard many different versions of moral realism, and some of them feel like a terminological or semantic difference with my view, and others, I’m just like, this sounds totally nutso. I don’t know. I have trouble being in or out on this thing, because it just means so many things, and I don’t know which one it means, and I don’t know what the more interesting versions are even supposed to mean. But it’s fine.

Yes, I’m a subjectivist. More or less the most natural way I think about morality is just, I decide what to do with my life, and there’s certain flavours of pull that I have, and those are moral flavours, and I try to make myself do the things that the moral flavours are pulling me on. I think that makes me a better person when I do.

Rob Wiblin: So maybe a way of highlighting the differences here will be to imagine this conversation, where you’re saying, “I’m leading Open Philanthropy. I think that we should split our efforts between a whole bunch of different projects, each one of which would look exceptional on a different plausible worldview.” And the hardcore utilitarian comes to you and says, “No, you should choose the best one and fund that. Spend all of your resources and all of your time, just focus on that best one.” What would you say to them in order to justify the worldview diversification approach?

Holden Karnofsky: The first thing I would say to them is just, “The burden’s on you.” And this is kind of a tension I often have with people who consider themselves hardcore: they’ll just feel like, well, why wouldn’t you be a hardcore utilitarian? Like, what’s the problem? Why isn’t it just maximising the pleasure and minimising the pain, or the sum of the difference? And I would just be like, “No. No. No. You gotta tell me, because I am sitting here with these great opportunities to help huge amounts of people in very different and hard-to-compare ways.

And the way I’ve always done ethics before in my life is, like, I basically have some voice inside me and it says, “This is what’s right.” And that voice has to carry some weight. Even on your model, that voice has to carry some weight, because you — you, the hardcore utilitarian, not Rob, because we all know you’re not at all — it’s like, even the most systematic theories of ethics, they’re all using that little voice inside you that says what’s right. That’s the arbiter of all the thought experiments. So we’re all putting weight on it somewhere, somehow. And I’m like, cool. That’s gotta be how this works. There’s a voice inside me saying, “this feels right” or “this feels wrong” — that voice has gotta get some weight.

That voice is saying, “You know what? It is really interesting to think about these risks to humanity’s future, but also, it’s weird.” This work is not shaped like the other work. It doesn’t have as good feedback loops. It feels icky — like a lot of this work is about just basically supporting people who think like us, or it feels that way a lot of the time, and it just feels like it doesn’t have the same ring of ethics to it.

And then on the other hand, it just feels like I’d be kind of a jerk if… Like, Open Phil, I believe — and you could disagree with me — is not only the biggest, but the most effective farm animal welfare funder in the world. I think we’ve had enormous impact and made animal lives dramatically better. And you’re coming to say to me, “No, you should take all that money and put it into the diminishing margin of supporting people to think about some future x-risk in a domain where you mostly have a lot of these concerns about insularity.” Like, you’ve got to make the case to me — because the normal way all this stuff works is you listen to that voice inside your head, and you care what it says. And some of the opportunities Open Phil has to do a lot of good are quite extreme, and we do them.

So that’s the first thing: we’ve got to put the burden of proof in the right place. Because I think utilitarianism is definitely interesting and has some things going for it, especially if you patch it and make it UDASSA, although that makes it less appealing. But where’s the burden of proof?

Rob Wiblin: Yeah. OK, so I suppose, to buy into this the hardcore utilitarian view, one way to do it would be that you’re committed to moral realism. I guess you might be committed to hedonism as a theory of value — so it’s only pleasure and pain. I guess then you also add on kind of a total view, so it’s just about the complete aggregate, that’s all that matters. You’re going to say there’s no side-constraints, and all of your other conflicting moral intuitions are worthless, so you should completely ignore those.

Are there any other more philosophical commitments that underpin this view that you think are implausible and haven’t been demonstrated to a sufficient extent?

Holden Karnofsky: I don’t think you need all those at all. I mean, I’ve steelmanned the hell out of this position as well as I could. I’ve written up this series called “Future-proof ethics.” I think the title kind of has been confusing and I regret it, but it’s trying to get at this idea that I want an ethics that — whether because it’s correct and real, or because it’s what I would come to on reflection — is in some sense better. That is, in some sense, what I would have come up with if I had more time to think about it.

And what would that ethics look like? I don’t think you need moral realism to care about this. You can make a case for utilitarianism that just starts from, like, gosh, humanity has a horrible track record of treating people horribly. We should really try and get ahead of the curve. We shouldn’t be listening to commonsense intuitions that are actually going to be quite correlated with the rest of our society, and that looks bad from a track record perspective. We need to figure out the fundamental principles of morality as well as we can — we’re not going to do it perfectly — and that’s going to put us ahead of the curve, and make us less likely to be the kind of people that would think we were moral monsters if we thought about it more. So you don’t need moral realism.

You don’t need hedonism at all. I mean, most people do do this with hedonism, but I think you can just say, like, if you want to use Harsanyi’s Aggregation Theorem — which means if you basically want it to be the case that, every time, everyone would prefer one kind of state of affairs to another, you do that first state of affairs — you can get from there and some other assumptions to basically at least a form of utilitarianism that says a large enough number of small benefits can outweigh everything else. I call this the “utility legions corollary” — it’s like a play on utility monsters. It’s like once you decide that something is valuable — like helping a chicken or helping a person get a chance to exist — there’s some number of that thing that can outweigh everything else. And that doesn’t reference hedonism; it’s just this idea of like, come up with anything that you think is nontrivially beneficial, and a very large number of it beats everything and wins over the ethical calculus. That’s a whole set of mathematical or logical steps you can take that don’t invoke hedonism at any point.

So I think the steelman version of this would not have a million premises. It would say that we really want to be ahead of the curve. That means we want to be systematic: we want to have a minimal set of principles so that we can be systematic, and make sure we’re really only basing our morality on the few things we feel best about, and that’s how we’re going to avoid being moral monsters. One of the things we feel best about is this utilitarianism idea which has the utility legions corollary. Once we’ve established that, now we can establish that a person who gets to live instead of not living is a benefit — and therefore enough of them can outweigh everything else. And then we can say, look. If there’s 10⁵⁰ of them in the future in expectation, that outweighs everything else that we could ever plausibly come up with.

That, to me, is the least-assumption route. And then you could tack on some other stuff. Like, people who have thought this way in the past did amazing. Jeremy Bentham was the first utilitarian, and he was early on women’s rights, and animal rights, and antislavery, and gay rights, and all this other stuff. So this is just like, yep, it’s a simple system. It looks great looking backward. It’s built on these rock-solid principles of utilitarianism and systematicity, and maybe sentientism — which is a thing I didn’t quite cover, which I should have. But, you know, radical impartiality — caring about everyone no matter where they are; as long as they’re physically identical, you have to care about them the same — you could basically derive from that this system, and then that system tells you that there’s so many future generations that just everything has to come down to them. Now maybe you have heuristics about how to actually help them, but everything ultimately has to be how to help them. So that would be my steelman.

Rob Wiblin: Yeah. So I’ve much more often encountered the kind of grounding for this that Sharon Hewitt Rawlette talked about in episode #138, which is this far more philosophical approach to it. But the case you make there, it doesn’t sound like some watertight thing to me, because, especially once you start making arguments like “it has a good historical track record,” you can be like, well, I’m sure I’ve got some stuff wrong. And also, maybe it could be right in the past, but wrong in the future. It’s not an overwhelming argument. What do you say to people who bring to you this basic steelman of the case?

Holden Karnofsky: I say a whole bunch of stuff. The first thing I would say is just that it’s not enough. You know, we’re not talking about a rigorous discipline here. You haven’t done enough. The stakes are too high. You haven’t done enough work to establish this.

The specific things I would get into is, I would first just be like, I don’t believe you buy your own story. I think even the people who believe themselves very hardcore utilitarians, it’s because no one designs thought experiments just to mess with them, and I think you totally can. One thought experiment I’ve used — and anyone is going to reject some of these — but one of them is like: there’s an asteroid, and it’s about to hit Cleveland and destroy it entirely, and kill everyone there. But no one will be blamed. You know somehow that this has a neutral effect on the long-run future. Would you prevent the asteroid from hitting Cleveland for 35 cents?

Well, you could give that 35 cents to the Center for Effective Altruism or 80,000 Hours or MIRI. So as a hardcore utilitarian, your answer has to be no, right? Someone offers you this, no one else can do it: you either give 35 cents or you don’t, to stop the asteroid from hitting Cleveland. You say no because you want to donate that money to something else. I think most people will not go for that.

Rob Wiblin: Nobody, I think. Yeah.

Holden Karnofsky: There’s simpler instances of this. I think most of these hardcore utilitarians — not all of them, but actually, most of them — are super, super into honesty. They try to defend it. They’ll be like, “Clearly, honesty is the way to maximise utility.” And I’m just like, how did you figure that out? Like, what? Your level of honesty is way beyond what most of the most successful and powerful people are doing, so how does that work? How did you determine this? This can’t be right. So I think most of these hardcore utilitarians actually have tensions within themselves that they aren’t recognising, that you can just draw out if you red team them instead of doing the normal philosophical thought experiments thing to normal people.

And then another place I go to challenge this view is that I do think the principles people try to build this thing on — the central ones are the utilitarianism idea, which is this thing that I didn’t explain well with Harsanyi’s Aggregation Theorem. But I do have it written up; you can link to it. I could try and explain it better, but whatever. I think it’s a fine principle, so I’m not going to argue with it.

The other principle people are leaning on is impartiality, and I think that one is screwed, and it doesn’t work at all.

Rob Wiblin: Yeah. So you think the impartiality aspect of this is just completely busted? Do you want to elaborate on why that is?

Holden Karnofsky: Yeah. I think you covered it a bit with Joe, but I have a little bit of a different spin on it, and a little bit more of an aggro spin on it. One way to think about impartiality is a minimum condition for what we might mean by impartiality would be that if two persons — or people or whatever; I call them “persons” to just include animals and AIs and whatever — two persons, let’s say they’re physically identical. Then you should care about them equally. I would claim if you’re not meeting that condition, then it’s weird to call yourself impartial. Something is up, and probably the hardcore person is not a big fan of you.

And I think you just can’t do that. All the infinite ethics stuff, it just completely breaks that. And not in a way that, in just like a weird corner case, sometimes it might not work. It’s just like actually: Should I donate a dollar to charity? Well, across the whole multiverse, incorporating expected value and a finite nonzero probability of an infinite-size universe, then it just follows that my dollar helps and hurts infinite numbers of people. And there’s no answer to whether it is a good dollar or bad dollar, because if it helps one person then hurts 1,000, then helps one then hurts 1,000 onto infinity, versus if it helps 1,000 then hurts one, then helps 1,000 then hurts one onto infinity, those are the same. They’re just rearranged. There is no way to compare two infinities like that. It cannot be done. It’s not like no one’s found it yet. It just can’t be done.

Your system actually just breaks completely. It just won’t tell you a single thing. It returns an undefined every time you ask it a question.

Rob Wiblin: Yeah. We’re rushing through this, but that was actually was kind of the bottom line of the episode with Alan Hájek. It was episode #139 on puzzles and paradoxes in probability and expected value. It’s just a bad picture. It’s not pleasant. It’s not a pretty picture.

Holden Karnofsky: Yeah. Exactly. And I have other beefs with impartiality. I think I could actually go on for quite a while, and at some point, I’ll write it all up. But I think just anything you try to do, where you’re just like, “Here’s a physical pattern or a physical process or a physical thing — and everywhere it is, I care about it equally,” I’m just like, you’re going to be so screwed. It’s not going to work. The infinities are the easiest-to-explain way it doesn’t work, but it just doesn’t work.

And so the whole idea that you were building this beautiful utilitarian system on one of the things you could be confident in, well, one of the things you were confident in was impartiality, and it’s gotta go. And Joe presented it as, well, you have these tough choices in infinite ethics because you can’t have all of Pareto and impartiality, which you call anonymity, and transitivity. And I’m like, yeah, you can’t have all of them. You’ve got to obviously drop impartiality. You can’t make it work. The other two are better: keep the other two, drop impartiality.

Once you drop impartiality, I don’t know. Now we’re in the world of just, like, some things are physically identical, but you care more about one than the other. In some ways, that’s a very familiar world. Like, I care more about my family than about other people, really not for any good reason. You just have to lean into that, because that’s what you are as a human being. You care more about some things than others, not for good reasons. You can use that to get out of all the infinite ethics jams. There’s some trick to it, and it’s not super easy, but basically, as long as you’re not committing to caring about everyone, you’re going to be OK. And as long as you are, you’re not. So don’t care about everyone, and this whole fundamental principle that was supposed to be powering this beautiful morality just doesn’t work.

Rob Wiblin: Do you want to explain a little bit the mechanism that you’d use to get away from it? Basically, you define some kind of central point, and then as things get further and further away from that central point, then you value them less. As long as you value them less at a sufficiently rapid rate, then things sum to 1 rather than ending up summing to infinity. And so now you can make comparisons again.

Holden Karnofsky: Yeah. Exactly. This is all me bastardising and oversimplifying stuff, but basically you need some system that says we’re discounting things at a fast enough rate that everything adds up to a finite number. And we’re discounting them even when they’re physically identical to each other — we’ve got to have some other way of discounting them. So a stupid version of this would be that you declare a centre of the universe in space and time, and Everett branch and everything like that, and you just discount by distance from that centre. And if you discount fast enough, you’re fine and you don’t run into the infinities.

The way that I think more people are into, I’ve referred to it a couple times already, UDASSA, is you say, “I’m going to discount you by how long of a computer program I have to write to point to you.” And then you’re going to be like, “What the hell are you talking about? What computer program? In what language?” And I’m like, “Whatever language. Pick a language. It’ll work.” And you’re like, “But that’s so horrible. That’s so arbitrary. So if I picked Python versus I picked Ruby, then that’ll affect who I care about?” And I’m like, “Yeah it will; it’s all arbitrary. It’s all stupid, but at least you didn’t get screwed by the infinities.”

Anyway, I think if I were to take the closest thing to a beautiful, simple, utilitarian system that gets everything right, UDASSA would actually be my best guess. And it’s pretty unappealing, and most people who say they’re hardcore say they hate it. I think it’s the best contender. It’s better than actually adding everything up.

Rob Wiblin: So that’s one approach you could take. The infinity stuff makes me sad, because inasmuch as we’re right that we’re just not going to be able to solve this — so we’re not going to get any elegant solution that resembles our intuitions, or that embodies impartiality in the way that we care about; maybe now we’re valuing one person because it was easier to specify where they are using the Ruby programming language — that doesn’t capture my intuitions about value or about ethics. It’s a very long way from them in actual fact. It feels like any system like that is just so far away from what I entered this entire enterprise caring about that I’m tempted to just give up and embrace nihilism, or…

Holden Karnofsky: Yeah. I think that’s a good temptation. Not nihilism.

Rob Wiblin: I’ll just be like, “I’m just going to do stuff that I want to do.” And, yeah.

Holden Karnofsky: Well, yeah. That’s kind of where I am. I’m like, UDASSA is the best you can do. You probably like it a lot less than what you thought you were doing. And a reasonable response would be, “Screw all this.” And then after you screw all this, OK, what are you doing? And I’m like, “Well, what I’m doing…”

Rob Wiblin: “Well, I still like my job…”

Holden Karnofsky: I still like my job. I still care about my family. And you know what? I still want to be a good person. What does that mean? I don’t know. I don’t know. Like, I notice when I do something that feels like it’s bad. I notice when I do something and it feels like it’s good. I notice that I’m glad I started a charity evaluator that helps people in Africa and India instead of just spending my whole life making money. I don’t know. That didn’t change. I’m still glad I did that. And I just don’t have a beautiful philosophical system that gives you three principles that can derive it, but I’m still glad I did it.

That’s pretty much where I’m at. And that’s where I come back to being like, I’m not that much of a philosophy guy, because I think philosophy isn’t really that promising. But I am a guy who works really hard to try and do a lot of good, because I don’t think you need to be a philosophy guy to do that.

Rob Wiblin: Joe said in the interview that if your philosophy feels like it’s breaking, then that’s probably a problem with the philosophy rather than with you. And I wonder whether we can turn that to this case, where we say we don’t really know why induction works, but nonetheless we all go about our lives as if induction is reasonable. And likewise, we might say we don’t know the solution to these infinity paradoxes and the multiverse and all of that, but nonetheless, impartial welfare maximisation feels right.

So hopefully, at some point, we’ll figure out how to make this work, and how to make it reasonable. And, you know, in the meantime, I’m not going to let these funny philosophical thought experiments take away from me what I thought the core of ethics really was.

Holden Karnofsky: But my question is: why is that the core of ethics? So my thing is I want to come back to the burden of proof. I just want to be like, “Fine. We give up. Now what are we doing?” And I’m like, if someone had a better idea than induction, I’d be pretty interested, but it seems like no one does.

But I do think there is an alternative to these very simple, beautiful systems of ethics that tell you exactly when to break all the normal rules. I think the alternative is just that you don’t have a beautiful system — you’re just a person like everyone else. Just imagine that you’re not very into philosophy, and you still care about being a good person. That’s most people. You can do that. That seems like the right default. Then you’ve got to talk me out of that. You’ve got to be, like, “Holden, here’s something that’s much better than that, even though it breaks.” And I haven’t heard someone do that.

Rob Wiblin: Yep. Well, despite everything you’ve just said, you say that you think the impartial expected welfare maximisation is underrated. You wish that people, like the random person on the street, did it more. Do you want to explain how that can possibly be?

Holden Karnofsky: I don’t think UDASSA is that bad. I mean, there’s no way it’s going to be the final answer or something. But the idea that it’s something like that, and later, we’re going to come up with something better that’s kind of like it. There’s going to be partiality in there. It might be that it’s sort of, like, I try to be partial and arbitrary, but in a very principled way — where I just kind of take the universe as I live in it, and try to be fair and nice to those around me. And I have to weight them a certain way, so I took the simplest way of weighting them. It’s not going to be as compelling as the original vision for utilitarianism was supposed to be, but I don’t think it’s that bad.

And I think there’s some arguments that are actually like, this weird simplicity criterion of how easy it is to find you with a computer program, you could think of that as like, what is your measure in the universe, or how much do you exist, or how much of you is there in the universe? There are some arguments you could think of it that way. So I don’t know. I don’t think UDASSA is totally screwed, but I’m not about to, like, shut down Open Philanthropy’s Farm Animal Welfare programme because of this UDASSA system. So that’s more or less the middle ground that I’ve come to.

I also just think there’s a lot of good in just, without the beautiful system, just challenging yourself. Just saying, “Commonsense morality really has done badly. Can I do better? Can I do some thought experiments until I really believe with my heart that I care a lot more about the future than I thought I did and think a lot about the future?” I think that’s fine. I think the part where you say the 10⁵⁰ number is taken literally — and it is, in the master system, exactly 10⁵⁰x as important as saving one life — that’s the dicier part.

Rob Wiblin: Yeah. I thought you might say that the typical person walking around, who hasn’t thought about any of these issues, nonetheless cares about other people and about having their lives go well, at least a bit. And they might not have appreciated just how large an impact they could have if they turned a bit of their attention to that, how much they might be able to help other people. So without any deep philosophy or any great reflection or changing in their values, it’s actually just pretty appealing to help to do things that effectively help other people. And that’s kind of what motivates you, I imagine.

Holden Karnofsky: Totally. I love trying to do the most good possible, defined in kind of a sloppy way that isn’t a beautiful system. And I even like the philosophical thought experiments. They have made me move a bit more toward caring about future generations, and especially whether they get to exist, which I think intuitively is not exciting to me at all. It still isn’t that exciting to me, but it is more than it used to be. So I think there’s value in here, but the value comes from wrestling with the stuff, thinking about it, and deciding where your heart comes out in the end. But I just think the dream of a beautiful system isn’t there.

The final thing I want to throw in there too, and I mentioned this earlier in the podcast, but if you did go in on UDASSA — or you had the beautiful system, or you somehow managed to be totally impartial — I do think longtermism is a weird conclusion from that. You at least should realise that what you actually should care about is something far weirder than future generations. And if you’re still comfortable with it, great. And if you’re not, you may want to also rethink things.

Rob Wiblin: Yeah. A slightly funny thing about having this conversation in 2023 is that I think worldview diversification doesn’t get us as far as it used to. Or the idea of wanting to split your bets across different worldviews, as AI becomes just a more dominant and obviously important consideration in how things are going to play out — and not just for strange existential-risk-related reasons. It seems incredibly related now to how we’ll be able to get people out of poverty; we’ll be able to solve lots of medical problems. It wouldn’t be that crazy to try to help farm animals by doing something related to ML models at some point in the next few decades. And also, if you think that it’s plausible that we could go extinct because of AI in the next 10 years, then just in terms of saving lives of people alive right now, it seems like an incredibly important and neglected issue.

It’s just a funny situation to be in, where the different worldviews that we kind of picked out 10 years ago now might all kind of be converging, at least temporarily, on a very similar set of activities, because of some very odd, historically abnormal, and indeed, deeply suspicious empirical facts that we happen to be living through right now.

Holden Karnofsky: That’s exactly what I feel. And it’s all very awkward, because it makes it hard for me to explain what I even disagree with people on, because I’m like, well, I do believe we should be mostly focused on AI risk — though not exclusively — and I am glad that Open Phil puts money in other things. But I do believe AI risk is the biggest headline because of these crazy historical events that could be upon us. I disagree with these other people who say we should be into AI risk because of this insight about the size of the future. Well, that’s awkward. And it’s kind of a strange state of affairs, and I haven’t always known what to do with it.

But yeah, I do feel that the effective altruism community has kind of felt philosophy-first. It’s kind of felt like our big insight is there’s a lot of people in the future, and then we’ve kind of worked out the empirics and determined the biggest threat to them is AI. And I just like reversing it; I just like being like: We are in an incredible historical period. No matter what your philosophy, you should care a lot about AI. And this stuff about most of the people being in the future, I think it’s interesting. I don’t think it’s worthless; I put some weight on it, but I don’t think it’s as important an argument, really. And I think if you try and really nail it, and build a beautiful philosophical system around it, the system will either just break, or it will go into much weirder territory.

Rob Wiblin: Yeah. Carl Shulman made this case a couple of years ago as well in episode #112: Carl Shulman on the common-sense case for existential risk work. He was saying similar things, that we’re on much weaker ground with a lot of philosophy than people might imagine. And, really, the thing that in practice would cause many more people to value work to reduce existential risk is just agreeing with us on how high the risk is — that the juice comes from the empirical claim that, in fact, we’re at enormous risk, whereas most people don’t believe that. Or they believe that in some very vague sense, but they haven’t really understood what that implies.

Holden Karnofsky: Another thing I would just like to say is, on this topic, I think people are listening to me and being like, “Well, Holden’s totally wrong. Like, I’ve got a beautiful hardcore system.” I would encourage you to, A, just like arguments for authority, I would encourage you to read “Base camp for Mt. Ethics” by Nick Bostrom. I think he’s clearly got some different intuitions about morality from this super hardcore thing, and he’s often seen as the person who did the most for the super hardcore longtermist point of view. I would also just say, I would listen to the Joe podcast, and I would really think about, even if you want to embrace this stuff, where it goes, because I don’t think future generations are where most of the value is on this view.

I think that actually there becomes a really good case for just, like, being cooperative and nice. And if anything, if you want to be really hardcore and impartial, you might start caring more about helping little old ladies cross the street, and being nice and being cooperative, and making sure we treat the AI systems well instead of worrying so much about how they’re going to treat us. There’s arguments that that kind of stuff is actually going to become enormous.

And I think if you’re finding that all too weird, you should probably just become not a super philosophy person, like me. But if you want to be weird, I would encourage listening to the Joe interview and going down that trail instead.

Rob Wiblin: I don’t know. It doesn’t sound too bad. Maybe I’ll become hardcore in that event. Yeah, I’ll flip around, become hardcore, but then do like, normal stuff.

Holden Karnofsky: Maybe. It’s all like, maybe and confusing. It’s all like, maybe this is the biggest thing, or maybe it’s not at all. And maybe thinking about it is bad.

Rob Wiblin: Or you should do the opposite of that.

Holden Karnofsky: Or maybe thinking about it is super dangerous, and the thing to do is to think about whether you should think about it. It’s going to be a mess, is my prediction.

The track record of futurists seems… fine [03:02:27]

Rob Wiblin: All right. This has all been a little bit heavy. So let’s wrap up with a slightly funner one. Over the years, I’ve heard many people say that they’re sceptical about any long-term planning for humanity’s future, because they think that efforts 50 to 100 years ago to predict the future just have an incredibly poor track record. People just kind of embarrassed themselves with the way that they thought the present would be. And because of that, efforts to guide the future in a positive direction are probably also similarly futile, because we’re not going to be able to expect how things are going to be.

But you took an actual look at this last year, and wrote up an article titled, “The track record of futurists seems… fine.”
What did you end up concluding?

Holden Karnofsky: Well, I ended up concluding that I don’t know — mostly we should just say we don’t know what the track record of futurists is. There are people who say the future is so hard to predict, and everyone just looks stupid after they try to predict the future, and you’re going to look stupid too. And I think that’s just too strong. I also do think predicting the future is hard, and I don’t think I have evidence that it’s easy or anyone has done a great job of it.

But I did a really easy, simple thing. I was just like, who are three famous people who think about the future? The big three science fiction writers: Isaac Asimov, Robert Heinlein, Arthur Clarke. Let’s just find a bunch of their predictions and score them and see how they did. We contracted with Arb Research to do that.

It came out, and I looked at the results and I was like, you know, I’m pretty impressed by Asimov. That’s not like a quantitative statement I can prove. We didn’t have a benchmark we were comparing him to, or a clear place where his Brier score needed to be; he wasn’t giving probabilities. But when I look at his scored list of predictions, and I read the predictions, and I asked myself how hard they were, I’m like, Asimov did a pretty… He had a lot of big misses, he had a lot of big hits, and definitely doesn’t seem like a monkey throwing darts at the dartboard. Heinlein did kind of a terrible job, but had a couple hits. Clarke was somewhere in between.

And I think I just came out being like, let’s all slow our roll here. Mostly people have not tried to predict the far future with a lot of seriousness. These are sci-fi writers just saying stuff. You rarely see something more rigorous than that. You almost never see something where people put probabilities on things and gave their reasons; we looked at some of those too and came to similar conclusions. So I would just say, let’s slow our roll. We just don’t know how good we are at predicting the future. Most people haven’t tried that hard. And when the stakes are this high, we should do our best and see how it goes.

Since then, I’ve seen criticisms of this piece of mine, and I’ve seen two different analyses done comparing the criticisms to my original piece, and I just haven’t had time to process it all, and I think I eventually will. I think I’m going to come out at a similar place — just like, we just don’t really know — but I might come out saying, actually, all these people humiliated themselves, and maybe we should be less confident.

Rob Wiblin: Can you recall what the most impressive prediction from Asimov was?

Holden Karnofsky: It’s like 1964, and he’s like, in 2014, only unmanned ships will have landed on Mars, though a manned expedition will be in the works. Stuff like that. You have to read the predictions and just think about how you feel about them. I don’t claim that there’s, like, a number that I can shove in your face to be like, “They nailed it.” I think that the more literally you take it and the more you’re like, “They had to be dead-on or I’m scoring them incorrect,” the worse it’s going to look.

But if you think everyone talking about these AI risks is just going to look like a total fool and everything turned out totally different to what they think, I just don’t think that’s something you should be confident in. And if you look at Asimov’s predictions, he doesn’t look like that much of a fool. He looks like he got some stuff wrong.

Rob Wiblin: Yeah. I would definitely bet against that as well.

Parenting [03:05:48]

Rob Wiblin: Last time we did an interview, you said you were about to take some leave because your partner was about to have a kid. I’m now intending to start a family as well, fingers crossed.

Holden Karnofsky: Oh, nice.

Rob Wiblin: I got to talk with Elie Hassenfeld about his experience making a family a couple of weeks ago. How’s it been for you? What surprising stuff has happened?

Holden Karnofsky: I think the biggest surprise for me has been on the positive side. It’s weird. I’m in a weird part of the discourse, where the discourse on kids is just like, “You’ll hate every minute of it. It’ll have all these downsides. But you know you want to do it, or maybe part of you needs to do it, or you’ll be glad you did it.” Or something. And I think the happiness impact for me has been way more positive than I expected. So the hard parts have been, whatever, as hard as they were supposed to be. But both my wife and I are just, like, wow, we did not think it would be this much fun. We did not think it would increase our happiness. That was not what we were expecting.

And I don’t know. I just feel like it’s a complicated experience. Everyone experiences it differently, but that was the thing that surprised me most. It’s just, like, the best part of my day, like 80% of the time or something, is just hanging out with the kid and kind of doing nothing. So yeah, it’s just weird and surprising.

Rob Wiblin: Yeah. Do you have any idea of what the mechanism is? The standard narrative is that you sacrifice short-term happiness for long-term fulfilment or something. But no?

Holden Karnofsky: That’s why I was surprised. I just don’t understand it. I just feel like I just enjoy hanging out with him for absolutely no good reason at all. I don’t know. I just expected it to be like, “I want to do it, but it’s not fun.” But no, it’s just super fun. It’s like way more fun than other things I do for fun. It’s weird.

Rob Wiblin: Sounds like your genes have solved the alignment problem with Holden a little bit more than people might have thought. Well, I’m very partial to myself, so fingers crossed I get lots of pleasure from it as well.

My guest today has been Holden Karnofsky. Thanks so much for coming on The 80,000 Hours Podcast, Holden.

Holden Karnofsky: Thank you, Rob.

Rob’s outro [03:07:47]

Rob Wiblin: Hey listeners, I just wanted to comment on two things that came up in the last episode with Ezra Klein.

The first one is quite charming. You might recall Ezra said that Chuck Schumer had given a talk announcing his proposed SAFE Innovation Framework, at the Center for Security and Emerging Technology — when that talk was actually at the similarly named CSIS or Center for Strategic and International Studies. And I noted that, but we kept it all in because frankly I thought the point Ezra was making was correct even if this wasn’t really an example of it.

Well, it turns out CSIS might in fact be an even better example of a fairly small investment in doing relevant work in DC and building relationships with legislators paying off, because as a listener helpfully pointed out, this foreign relations think tank CSIS actually does have a small programme on AI that’s funded mostly or perhaps even entirely by people worried about existential threats from AI, and even that pretty modest programme, smaller than the Center for Security and Emerging Technology by far, was enough for Chuck Schumer to decide to announce his SAFE Innovation Framework there.

But we got strong feedback from listeners on a second point as well.

If you listened, you might recall that Ezra said he was troubled that from his anecdotal experience just meeting and talking to people, so many folks who are concerned about AI existential risk were doing high-level work around how AI governance ideally might be set up years into the future — but few were engaging with current legislative proposals and suggesting improvements. Or meeting with people in Congress in order to build relationships and trust and inform legislators in ways that could be useful in the long term as well.

That was also my anecdotal perception as well. And it’s my perception that this is generally true across almost all policy areas. I loved an interview Ezra did with Jennifer Pahlka earlier this year — titled “The Book I Wish Every Policymaker Would Read” — which basically talks about how by and large people are more concerned with discussing high-level principles legislation should follow, and less about implementation details. That skew creates all sorts of problems for the people meant to deliver government services.

Now, one could plausibly argue back that perhaps it’s actually, at this stage, better to keep working on the high-level questions — because really we need a radically different approach from anything currently being suggested, and so incremental amendments to current legislation isn’t helpful.

I also know that years ago, plenty of folks were nervous about doing government work on AI because they feared that it could hasten the militarisation of AI and they thought that would be for the worse.

But actually, I mostly heard back from people objecting pretty strongly to whether this is actually true in this case that there’s a lack of nuts-and-bolts policy discussion with people in DC.

Of course, as Ezra mentioned, there’s CSET — which, like 80,000 Hours, has gotten funding from Open Philanthropy, and has been around for years working on these and related issues. I interviewed Helen Toner, one of its founders and now an OpenAI board member, for episode #61 — Helen Toner on emerging technology, national security, and China.

And this week, Dario Amodei — who spoke about AI risk on this podcast in 2017, now leads Anthropic, and has been attending meetings at the White House — testified in front of the Senate about risks posed by AI, and put forward specific policy ideas that I imagine staff at Anthropic have been working on. Yoshua Bengio and Stuart Russell — the most cited computer scientist, and author of the main AI textbook, respectively — gave testimony alongside him.

There’s also the Future of Life Institute’s involvement with details of the EU AI Act and their advocacy around regulation of killer robots over many years now. You can find out about that by googling “Strengthening the European Union AI Act” if you like. I believe the reason they’ve focused on regulation in the EU is simply that they started drafting their big piece of AI legislation a few years earlier.

And there are lots of people with lower profiles who are out in the trenches doing this sort of work in DC and beyond who aren’t so visible to Ezra and me, and I very much wouldn’t want those people to feel undervalued by the discussion on that podcast.

Anyway, while I obviously still would love to see more people in these roles, I’ve been happy to learn this week that the space isn’t quite as neglected as I’d previously thought. I think that’s really good news in my mind.

All right, I hope you enjoyed that episode with Holden. And maybe if you hadn’t listened to that episode with Ezra — those clarifications might tempt you to go back and check it out.

All right, The 80,000 Hours Podcast is produced and edited by Keiran Harris.

The audio engineering team is led by Ben Cordell, with mastering and technical editing for this episode by Simon Monsour.

Full transcripts and an extensive collection of links to learn more are available on our site and put together by Katy Moore.

Thanks for joining, talk to you again soon.

Learn more

The 80,000 Hours Podcast on Artificial Intelligence and related topics

Preventing an AI-related catastrophe

Information security in high-impact areas

AI governance and policy

Related episodes

August 19, 2021

#109 – Holden Karnofsky on the most important century

Listen now

May 5, 2023

#150 – Tom Davidson on how quickly AI could transform the world

Listen now

August 26, 2021

#110 – Holden Karnofsky on building aptitudes and kicking ass

Listen now

May 12, 2023

#151 – Ajeya Cotra on accidentally teaching AI models to deceive us

Listen now

February 27, 2018

#21 – The world’s most intellectual foundation is hiring. Holden Karnofsky, founder of GiveWell, on how philanthropy can have maximum impact by taking big risks.

Listen now

June 14, 2022

#132 – Nova DasSarma on why information security may be critical to the safe development of AI systems

Listen now

October 5, 2021

#112 – Carl Shulman on the common-sense case for existential risk work and its practical implications

Listen now

May 19, 2023

#152 – Joe Carlsmith on navigating serious philosophical confusion

Listen now

About the show

The 80,000 Hours Podcast features unusually in-depth conversations about the world's most pressing problems and how you can use your career to solve them. We invite guests pursuing a wide range of career paths — from academics and activists to entrepreneurs and policymakers — to analyse the case for and against working on different issues and which approaches are best for solving them.

The 80,000 Hours Podcast is produced and edited by Keiran Harris. Get in touch with feedback or guest suggestions by emailing [email protected].

What should I listen to first?

We've carefully selected 10 episodes we think it could make sense to listen to first, on a separate podcast feed:

Check out 'Effective Altruism: An Introduction'

Subscribe here, or anywhere you get podcasts:

If you're new, see the podcast homepage for ideas on where to start, or browse our full episode archive.

On this page:

Highlights

Explosively fast progress

AI population explosion

Misaligned AI might not kill us all, and aligned AI could be catastrophic

Getting the AIs to do our alignment homework for us

The value of having a successful and careful AI lab

Why information security is so important

Holden vs hardcore utilitarianism

Articles, books, and other media discussed in the show

Transcript

Cold open [00:00:00]

Rob’s intro [00:01:01]

The interview begins [00:02:18]

Explosively fast progress [00:06:55]

AI population explosion [00:23:54]

Misaligned AI might not kill us all, and aligned AI could be catastrophic [00:27:18]

Where Holden sees overconfidence among AI safety researchers [00:35:08]

Where Holden disagrees with ML researchers [00:41:22]

What Holden really believes [00:45:40]

The value of laying out imaginable success stories [00:49:19]

Holden’s four-intervention playbook [01:06:16]

Standards and monitoring [01:09:20]

Designing evaluations [01:17:17]

The value of having a successful and careful AI lab [01:33:44]

Why information security is so important [01:39:48]

What governments could be doing differently [01:52:55]

What people with an audience could be doing differently [01:59:38]

Jobs and careers to help with AI [02:06:47]

Audience questions on AI [02:14:25]

Holden vs hardcore utilitarianism [02:23:33]

The track record of futurists seems… fine [03:02:27]

Parenting [03:05:48]

Rob’s outro [03:07:47]

Learn more

The 80,000 Hours Podcast on Artificial Intelligence and related topics

Preventing an AI-related catastrophe

Information security in high-impact areas

AI governance and policy

Related episodes

About the show

What should I listen to first?