#226 – Holden Karnofsky on dozens of opportunities to make AI safer lying on the table — and all his AGI takes
For years, working on AI safety usually meant theorising about the ‘alignment problem’ or trying to convince other people to give a damn. If you could find any way to help, the work was frustrating and low feedback.
According to Anthropic’s Holden Karnofsky, this situation has now reversed completely.
There are now large amounts of useful, concrete, shovel-ready projects with clear goals and deliverables. Holden thinks people haven’t appreciated the scale of the shift, and wants everyone to see the large range of ‘well-scoped object-level work’ they could personally help with, in both technical and non-technical areas.
In today’s interview, Holden — previously cofounder and CEO of Open Philanthropy (now Coefficient Giving) — lists 39 projects he’s excited to see happening, including:
- Training deceptive AI models to study deception and how to detect it
- Developing classifiers to block jailbreaking
- Implementing security measures to stop ‘backdoors’ or ‘secret loyalties’ from being added to models in training
- Developing policies on model welfare, AI-human relationships, and what instructions to give models
- Training AIs to work as alignment researchers
And that’s all just stuff he’s happened to observe directly, which is probably only a small fraction of the options available.
All this low-hanging fruit is one factor behind his decision to join Anthropic this year. That said, his wife was also a cofounder and president of the company, giving him a big financial stake in its success — and making it impossible for him to be seen as independent no matter where he worked.
Holden makes a case that, for many people, working at an AI company like Anthropic will be the best way to steer AGI in a positive direction. He notes there are “ways that you can reduce AI risk that you can only do if you’re a competitive frontier AI company.” At the same time, he believes external groups have their own advantages and can be equally impactful.
Outside critics worry that Anthropic’s efforts to stay at that frontier encourage competitive racing towards AGI — significantly or entirely offsetting any useful research they do. Holden thinks this seriously misunderstands the strategic situation we’re in.
“I work at an AI company, and a lot of people think that’s just inherently unethical,” he says. “They’re imagining [that] everyone wishes they could go slowly, but they’re going fast so they can beat everyone else. […] But I emphatically think this is not what’s going on in AI.”
The reality, in Holden’s view:
I think there’s too many players in AI who […] don’t want to slow down. They don’t believe in the risks. Maybe they don’t even care about the risks. […] If Anthropic were to say, “We’re out, we’re going to slow down,” they would say, ‘This is awesome! Now we have a better chance of winning, and this is even good for our recruiting” — because they have a better chance of getting people who want to be on the frontier and want to win.
Holden believes a frontier AI company can reduce risk by:
- Developing cheap, practical safety measures other companies might adopt
- Prototyping policies regulators could mandate
- Gathering crucial data about what advanced AI can actually do
Host Rob Wiblin and Holden discuss the case for and against those strategies, and much more, in today’s episode.
This episode was recorded on July 25 and 28, 2025.
Video editing: Simon Monsour, Luke Monsour, Dominic Armstrong, and Milo McGuire
Audio engineering: Milo McGuire, Simon Monsour, and Dominic Armstrong
Music: CORBIT
Coordination, transcriptions, and web: Katy Moore












