#197 – Nick Joseph on whether Anthropic's AI safety policy is up to the task
The three biggest AI companies — Anthropic, OpenAI, and DeepMind — have now all released policies designed to make their AI models less likely to go rogue or cause catastrophic damage as they approach, and eventually exceed, human capabilities. Are they good enough?
That’s what host Rob Wiblin tries to hash out in this interview (recorded May 30) with Nick Joseph — one of the 11 people who left OpenAI to launch Anthropic, its current head of training, and a big fan of Anthropic’s “responsible scaling policy” (or “RSP”). Anthropic is the most safety focused of the AI companies, known for a culture that treats the risks of its work as deadly serious.
As Nick explains, these scaling policies commit companies to dig into what new dangerous things a model can do — after it’s trained, but before it’s in wide use. The companies then promise to put in place safeguards they think are sufficient to tackle those capabilities before availability is extended further. For instance, if a model could significantly help design a deadly bioweapon, then its weights need to be properly secured so they can’t be stolen by terrorists interested in using it that way.
As capabilities grow further — for example, if testing shows that a model could exfiltrate itself and spread autonomously in the wild — then new measures would need to be put in place to make that impossible, or demonstrate that such a goal can never arise.
Nick points out three big virtues to the RSP approach:
- It allows us to set aside the question of when any of these things will be possible, and focus the conversation on what would be necessary if they are possible — something there is usually much more agreement on.
- It means we don’t take costly precautions that developers will resent and resist before they are actually called for.
- As the policies don’t allow models to be deployed until suitable safeguards are in place, they align a firm’s commercial incentives with safety — for example, a profitable product release could be blocked by insufficient investments in computer security or alignment research years earlier.
Rob then pushes Nick on some of the best objections to the RSP mechanisms he’s found, including:
- It’s hard to trust that profit-motivated companies will stick to their scaling policies long term and not water them down to make their lives easier — particularly as commercial pressure heats up.
- Even if you’re trying hard to find potential safety concerns, it’s difficult to truly measure what models can and can’t do. And if we fail to pick up a dangerous ability that’s really there under the hood, then perhaps all we’ve done is lull ourselves into a false sense of security.
- Importantly, in some cases humanity simply hasn’t invented safeguards up to the task of addressing AI capabilities that could show up soon. Maybe that will change before it’s too late — but if not, we’re being written a cheque that will bounce when it comes due.
Nick explains why he thinks some of these worries are overblown, while others are legitimate but just point to the hard work we all need to put in to get a good outcome.
Nick and Rob also discuss whether it’s essential to eventually hand over operation of responsible scaling policies to external auditors or regulatory bodies, if those policies are going to be able to hold up against the intense commercial pressures that might end up arrayed against them.
In addition to all of that, Nick and Rob talk about:
- What Nick thinks are the current bottlenecks in AI progress: people and time (rather than data or compute).
- What it’s like working in AI safety research at the leading edge, and whether pushing forward capabilities (even in the name of safety) is a good idea.
- What it’s like working at Anthropic, and how to get the skills needed to help with the safe development of AI.
And as a reminder, if you want to let us know your reaction to this interview, or send any other feedback, our inbox is always open at [email protected].
Producer and editor: Keiran Harris
Audio engineering by Ben Cordell, Milo McGuire, Simon Monsour, and Dominic Armstrong
Video engineering: Simon Monsour
Transcriptions: Katy Moore