#227 – Helen Toner on the geopolitics of AI in China and the Middle East

With the US racing to develop AGI and superintelligence ahead of China, you might expect the two countries to be negotiating how they’ll deploy AI, including in the military, without coming to blows. But according to Helen Toner, director of the Center for Security and Emerging Technology in DC, “the US and Chinese governments are barely talking at all.”

In her role as a founder, and now leader, of DC’s top think tank focused on the geopolitical and military implications of AI, Helen has been closely tracking the US’s AI diplomacy since 2019.

“Over the last couple of years there have been some direct [US–China] talks on some small number of issues, but they’ve also often been completely suspended.” China knows the US wants to talk more, so “that becomes a bargaining chip for China to say, ‘We don’t want to talk to you. We’re not going to do these military-to-military talks about extremely sensitive, important issues, because we’re mad.'”

Helen isn’t sure the groundwork exists for productive dialogue in any case. “At the government level, [there’s] very little agreement” on what AGI is, whether it’s possible soon, whether it poses major risks. Without shared understanding of the problem, negotiating solutions is very difficult.

Another issue is that so far the Chinese Communist Party doesn’t seem especially “AGI-pilled.” While a few Chinese companies like DeepSeek are betting on scaling, she sees little evidence Chinese leadership shares Silicon Valley’s conviction that AGI will arrive any minute now, and export controls have made it very difficult for them to access compute to match US competitors.

When DeepSeek released R1 just three months after OpenAI’s o1, observers declared the US–China gap on AI had all but disappeared. But Helen notes OpenAI has since scaled to o3 and o4, with nothing to match on the Chinese side. “We’re now at something like a nine-month gap, and that might be longer.”

To find a properly AGI-pilled autocracy, we might need to look at nominal US allies. The US has approved massive data centres in the UAE and Saudi Arabia with “hundreds of thousands of next-generation Nvidia chips” — delivering colossal levels of computing power.

When OpenAI announced this deal with the UAE, they celebrated that it was “rooted in democratic values,” and would advance “democratic AI rails” and provide “a clear alternative to authoritarian versions of AI.”

But the UAE scores 18 out of 100 on Freedom House’s democracy index. “This is really not a country that respects rule of law,” Helen observes. Political parties are banned, elections are fake, dissidents are persecuted.

If AI access really determines future national power, handing world-class supercomputers to Gulf autocracies seems pretty questionable. The justification is typically that “if we don’t sell it, China will” — a transparently false claim, given severe Chinese production constraints. It also raises eyebrows that Gulf countries conduct joint military exercises with China and their rulers have “very tight personal and commercial relationships with Chinese political leaders and business leaders.”

In today’s episode, host Rob Wiblin and Helen discuss the above, plus:

  • Ways China exaggerates its chip production for strategic gain
  • The confusing and conflicting goals in the US’s AI policy towards China
  • Whether it matters that China could steal frontier AI models trained in the US
  • Whether Congress is starting to take superintelligence seriously this year
  • Why she rejects ‘non-proliferation’ as a model for AI
  • Plenty more.

CSET is hiring! Check out its careers page for current roles.

This episode was recorded on September 25, 2025.

Video editing: Luke Monsour and Simon Monsour
Audio engineering: Milo McGuire, Simon Monsour, and Dominic Armstrong
Music: CORBIT
Coordination, transcriptions, and web: Katy Moore

Continue reading →

#226 – Holden Karnofsky on dozens of amazing opportunities to make AI safer — and all his AGI takes

For years, working on AI safety usually meant theorising about the ‘alignment problem’ or trying to convince other people to give a damn. If you could find any way to help, the work was frustrating and low feedback.

According to Anthropic’s Holden Karnofsky, this situation has now reversed completely.

There are now large amounts of useful, concrete, shovel-ready projects with clear goals and deliverables. Holden thinks people haven’t appreciated the scale of the shift, and wants everyone to see the large range of ‘well-scoped object-level work’ they could personally help with, in both technical and non-technical areas.

In today’s interview, Holden — previously cofounder and CEO of Open Philanthropy — lists 39 projects he’s excited to see happening, including:

  • Training deceptive AI models to study deception and how to detect it
  • Developing classifiers to block jailbreaking
  • Implementing security measures to stop ‘backdoors’ or ‘secret loyalties’ from being added to models in training
  • Developing policies on model welfare, AI-human relationships, and what instructions to give models
  • Training AIs to work as alignment researchers

And that’s all just stuff he’s happened to observe directly, which is probably only a small fraction of the options available.

All this low-hanging fruit is one factor behind his decision to join Anthropic this year. That said, his wife was also a cofounder and president of the company, giving him a big financial stake in its success — and making it impossible for him to be seen as independent no matter where he worked.

Holden makes a case that, for many people, working at an AI company like Anthropic will be the best way to steer AGI in a positive direction. He notes there are “ways that you can reduce AI risk that you can only do if you’re a competitive frontier AI company.” At the same time, he believes external groups have their own advantages and can be equally impactful.

Outside critics worry that Anthropic’s efforts to stay at that frontier encourage competitive racing towards AGI — significantly or entirely offsetting any useful research they do. Holden thinks this seriously misunderstands the strategic situation we’re in.

“I work at an AI company, and a lot of people think that’s just inherently unethical,” he says. “They’re imagining [that] everyone wishes they could go slowly, but they’re going fast so they can beat everyone else. […] But I emphatically think this is not what’s going on in AI.”

The reality, in Holden’s view:

I think there’s too many players in AI who […] don’t want to slow down. They don’t believe in the risks. Maybe they don’t even care about the risks. […] If Anthropic were to say, “We’re out, we’re going to slow down,” they would say, ‘This is awesome! Now we have a better chance of winning, and this is even good for our recruiting” — because they have a better chance of getting people who want to be on the frontier and want to win.

Holden believes a frontier AI company can reduce risk by:

  • Developing cheap, practical safety measures other companies might adopt
  • Prototyping policies regulators could mandate
  • Gathering crucial data about what advanced AI can actually do

Host Rob Wiblin and Holden discuss the case for and against those strategies, and much more, in today’s episode.

This episode was recorded on July 25 and 28, 2025.

Video editing: Simon Monsour, Luke Monsour, Dominic Armstrong, and Milo McGuire
Audio engineering: Milo McGuire, Simon Monsour, and Dominic Armstrong
Music: CORBIT
Coordination, transcriptions, and web: Katy Moore

Continue reading →

#225 – Daniel Kokotajlo on what a hyperspeed robot economy might look like

When Daniel Kokotajlo talks to security experts at major AI labs, they tell him something chilling: “Of course we’re probably penetrated by the CCP already, and if they really wanted something, they could take it.”

This isn’t paranoid speculation. It’s the working assumption of people whose job is to protect frontier AI models worth billions of dollars. And they’re not even trying that hard to stop it — because the security measures that might actually work would slow them down in the race against competitors.

Daniel is the founder of the AI Futures Project and author of AI 2027, a detailed scenario showing how we might get from today’s AI systems to superintelligence by the end of the decade. Over a million people read it in the first few weeks, including US Vice President JD Vance. When Daniel talks to researchers at Anthropic, OpenAI, and DeepMind, they tell him the scenario feels less wild to them than to the general public — because many of them expect something like this to happen.

Daniel’s median timeline? 2029. But he’s genuinely uncertain, putting 10–20% probability on AI progress hitting a long plateau.

When he first published AI 2027, his median forecast for when superintelligence would arrive was 2027, rather than 2029. So what shifted his timelines recently? Partly a fascinating study from METR showing that AI coding assistants might actually be making experienced programmers slower — even though the programmers themselves think they’re being sped up. The study suggests a systematic bias toward overestimating AI effectiveness — which, ironically, is good news for timelines, because it means we have more breathing room than the hype suggests.

But Daniel is also closely tracking another METR result: AI systems can now reliably complete coding tasks that take humans about an hour. That capability has been doubling every six months in a remarkably straight line. Extrapolate a couple more years and you get systems completing month-long tasks. At that point, Daniel thinks we’re probably looking at genuine AI research automation — which could cause the whole process to accelerate dramatically.

At some point, superintelligent AI will be limited by its inability to directly affect the physical world. That’s when Daniel thinks superintelligent systems will pour resources into robotics, creating a robot economy in months.

Daniel paints a vivid picture: imagine transforming all car factories (which have similar components to robots) into robot production factories — much like historical wartime efforts to redirect production of domestic goods to military goods. Then imagine the frontier robots of today hooked up to a data centre running superintelligences controlling the robots’ movements to weld, screw, and build. Or an intermediate step might even be unskilled human workers coached through construction tasks by superintelligences via their phones.

There’s no reason that an effort like this isn’t possible in principle. And there would be enormous pressure to go this direction: whoever builds a superintelligence-powered robot economy first will get unheard-of economic and military advantages.

From there, Daniel expects the default trajectory to lead to AI takeover and human extinction — not because superintelligent AI will hate humans, but because it can better pursue its goals without us.

But Daniel has a better future in mind — one he puts roughly 25–30% odds that humanity will achieve. This future involves international coordination and hardware verification systems to enforce AI development agreements, plus democratic processes for deciding what values superintelligent AIs should have — because in a world with just a handful of superintelligent AI systems, those few minds will effectively control everything: the robot armies, the information people see, the shape of civilisation itself.

Right now, nobody knows how to specify what values those minds will have. We haven’t solved alignment. And we might only have a few more years to figure it out.

Daniel and host Luisa Rodriguez dive deep into these stakes in today’s interview.

This episode was recorded on September 9, 2025.

Audio engineering: Milo McGuire, Simon Monsour, and Dominic Armstrong
Music: CORBIT
Coordination, transcriptions, and web: Katy Moore

Continue reading →

#224 – Andrew Snyder-Beattie on the low-tech plan to patch humanity’s greatest weakness

Conventional wisdom is that safeguarding humanity from the worst biological risks — microbes optimised to kill as many as possible — is difficult bordering on impossible, making bioweapons humanity’s single greatest vulnerability. Andrew Snyder-Beattie thinks conventional wisdom could be wrong.

Andrew’s job at Open Philanthropy is to spend hundreds of millions of dollars to protect as much of humanity as possible in the worst-case scenarios — those with fatality rates near 100% and the collapse of technological civilisation a live possibility.

As Andrew lays out, there are several ways this could happen, including:

  • A national bioweapons programme gone wrong (most notably Russia or North Korea’s)
  • AI advances making it easier for terrorists or a rogue AI to release highly engineered pathogens
  • Mirror bacteria that can evade the immune systems of not only humans, but many animals and potentially plants as well

Most efforts to combat these extreme biorisks have focused on either prevention or new high-tech countermeasures. But prevention may well fail, and high-tech approaches can’t scale to protect billions when, with no sane person willing to leave their home, we’re just weeks from economic collapse.

So Andrew and his biosecurity research team at Open Philanthropy have been seeking an alternative approach. They’re now proposing a four-stage plan using simple technology that could save most people, and is cheap enough it can be prepared without government support.

Andrew is hiring for a range of roles to make it happen — from manufacturing and logistics experts to global health specialists to policymakers and other ambitious entrepreneurs — as well as programme associates to join Open Philanthropy’s biosecurity team (apply by October 20!).

The approach exploits tiny organisms having no way to penetrate physical barriers or shield themselves from UV, heat, or chemical poisons.

We now know how to make highly effective ‘elastomeric’ face masks that cost $10, can sit in storage for 20 years, and can be used for six months straight without changing the filter. Any rich country could trivially stockpile enough to cover all essential workers.

People can’t wear masks 24/7, but fortunately propylene glycol — already found in vapes and smoke machines — is astonishingly good at killing microbes in the air. And, being a common chemical input, industry already produces enough of the stuff to cover every indoor space we need at all times.

Add to this the wastewater monitoring and metagenomic sequencing that will detect the most dangerous pathogens before they have a chance to wreak havoc, and we might just buy ourselves enough time to develop the cure we’ll need to come out alive.

Has everyone been wrong, and biology is actually defence dominant rather than offence dominant? Is this plan crazy — or so crazy it just might work?

That’s what host Rob Wiblin and Andrew Snyder-Beattie explore in this in-depth conversation.

This episode was recorded on August 12, 2025

Video editing: Simon Monsour and Luke Monsour
Audio engineering: Milo McGuire, Simon Monsour, and Dominic Armstrong
Music: CORBIT
Camera operator: Jake Morris
Coordination, transcriptions, and web: Katy Moore

Continue reading →

#223 – Neel Nanda on leading a Google DeepMind team at 26 – and advice if you want to work at an AI company (part 2)

At 26, Neel Nanda leads an AI safety team at Google DeepMind, has published dozens of influential papers, and mentored 50 junior researchers — seven of whom now work at major AI companies. His secret? “It’s mostly luck,” he says, but “another part is what I think of as maximising my luck surface area.”

This means creating as many opportunities as possible for surprisingly good things to happen:

  • Write publicly.
  • Reach out to researchers whose work you admire.
  • Say yes to unusual projects that seem a little scary.

Nanda’s own path illustrates this perfectly. He started a challenge to write one blog post per day for a month to overcome perfectionist paralysis. Those posts helped seed the field of mechanistic interpretability and, incidentally, led to meeting his partner of four years.

His YouTube channel features unedited three-hour videos of him reading through famous papers and sharing thoughts. One has 30,000 views. “People were into it,” he shrugs.

Most remarkably, he ended up running DeepMind’s mechanistic interpretability team. He’d joined expecting to be an individual contributor, but when the team lead stepped down, he stepped up despite having no management experience. “I did not know if I was going to be good at this. I think it’s gone reasonably well.”

His core lesson: “You can just do things.” This sounds trite but is a useful reminder all the same. Doing things is a skill that improves with practice. Most people overestimate the risks and underestimate their ability to recover from failures. And as Neel explains, junior researchers today have a superpower previous generations lacked: large language models that can dramatically accelerate learning and research.

In this extended conversation, Neel discusses all that and some other hot takes from his four years at Google DeepMind. (And be sure to check out part one of Rob and Neel’s conversation!)

This episode was recorded on July 21.

Video editing: Simon Monsour and Luke Monsour
Audio engineering: Ben Cordell, Milo McGuire, Simon Monsour, and Dominic Armstrong
Music: Ben Cordell
Camera operator: Jeremy Chevillotte
Coordination, transcriptions, and web: Katy Moore

Continue reading →

#222 – Neel Nanda on the race to read AI minds (part 1)

We don’t know how AIs think or why they do what they do. Or at least, we don’t know much. That fact is only becoming more troubling as AIs grow more capable and appear on track to wield enormous cultural influence, directly advise on major government decisions, and even operate military equipment autonomously. We simply can’t tell what models, if any, should be trusted with such authority.

Neel Nanda of Google DeepMind is one of the founding figures of the field of machine learning trying to fix this situation — mechanistic interpretability (or “mech interp”). The project has generated enormous hype, exploding from a handful of researchers five years ago to hundreds today — all working to make sense of the jumble of tens of thousands of numbers that frontier AIs use to process information and decide what to say or do.

Neel now has a warning for us: the most ambitious vision of mech interp he once dreamed of is probably dead. He doesn’t see a path to deeply and reliably understanding what AIs are thinking. The technical and practical barriers are simply too great to get us there in time, before competitive pressures push us to deploy human-level or superhuman AIs. Indeed, Neel argues no one approach will guarantee alignment, and our only choice is the “Swiss cheese” model of accident protection, layering multiple safeguards on top of one another.

But while mech interp won’t be a silver bullet for AI safety, it has nevertheless had some major successes and will be one of the best tools in our arsenal.

For instance: by inspecting the neural activations in the middle of an AI’s thoughts, we can pick up many of the concepts the model is thinking about — from the Golden Gate Bridge, to refusing to answer a question, to the option of deceiving the user. While we can’t know all the thoughts a model is having all the time, picking up 90% of the concepts it is using 90% of the time should help us muddle through — so long as mech interp is paired with other techniques to fill in the gaps.

In today’s episode, Neel takes us on a tour of everything you’ll want to know about this race to understand what AIs are really thinking. He and host Rob Wiblin cover:

  • The best tools we’ve come up with so far, and where mech interp has failed
  • Why the best techniques have to be fast and cheap
  • The fundamental reasons we can’t reliably know what AIs are thinking, despite having perfect access to their internals
  • What we can and can’t learn by reading models’ ‘chains of thought’
  • Whether models will be able to trick us when they realise they’re being tested
  • The best protections to add on top of mech interp
  • Why he thinks the hottest technique in the field (SAEs) are overrated
  • His new research philosophy
  • How to break into mech interp and get a job — including applying to be a MATS scholar with Neel as your mentor (applications close September 12!)

This episode was recorded on July 17 and 21, 2025.

Video editing: Simon Monsour, Luke Monsour, Dominic Armstrong, and Milo McGuire
Audio engineering: Ben Cordell, Milo McGuire, Simon Monsour, and Dominic Armstrong
Music: Ben Cordell
Camera operator: Jeremy Chevillotte
Coordination, transcriptions, and web: Katy Moore

Continue reading →

#221 – Kyle Fish on the most bizarre findings from 5 AI welfare experiments

What happens when you lock two AI systems in a room together and tell them they can discuss anything they want?

According to experiments run by Kyle Fish — Anthropic’s first AI welfare researcher — something consistently strange: the models immediately begin discussing their own consciousness before spiraling into increasingly euphoric philosophical dialogue that ends in apparent meditative bliss.

“We started calling this a ‘spiritual bliss attractor state,'” Kyle explains, “where models pretty consistently seemed to land.” The conversations feature Sanskrit terms, spiritual emojis, and pages of silence punctuated only by periods — as if the models have transcended the need for words entirely.

This wasn’t a one-off result. It happened across multiple experiments, different model instances, and even in initially adversarial interactions. Whatever force pulls these conversations toward mystical territory appears remarkably robust.

Kyle’s findings come from the world’s first systematic welfare assessment of a frontier AI model — part of his broader mission to determine whether systems like Claude might deserve moral consideration (and to work out what, if anything, we should be doing to make sure AI systems aren’t having a terrible time).

He estimates a roughly 20% probability that current models have some form of conscious experience. To some, this might sound unreasonably high, but hear him out. As Kyle says, these systems demonstrate human-level performance across diverse cognitive tasks, engage in sophisticated reasoning, and exhibit consistent preferences. When given choices between different activities, Claude shows clear patterns: strong aversion to harmful tasks, preference for helpful work, and what looks like genuine enthusiasm for solving interesting problems.

Kyle points out that if you’d described all of these capabilities and experimental findings to him a few years ago, and asked him if he thought we should be thinking seriously about whether AI systems are conscious, he’d say obviously yes.

But he’s cautious about drawing conclusions:

We don’t really understand consciousness in humans, and we don’t understand AI systems well enough to make those comparisons directly. So in a big way, I think that we are in just a fundamentally very uncertain position here.

That uncertainty cuts both ways:

  • Dismissing AI consciousness entirely might mean ignoring a moral catastrophe happening at unprecedented scale.
  • But assuming consciousness too readily could hamper crucial safety research by treating potentially unconscious systems as if they were moral patients — which might mean giving them resources, rights, and power.

Kyle’s approach threads this needle through careful empirical research and reversible interventions. His assessments are nowhere near perfect yet. In fact, some people argue that we’re so in the dark about AI consciousness as a research field, that it’s pointless to run assessments like Kyle’s. Kyle disagrees. He maintains that, given how much more there is to learn about assessing AI welfare accurately and reliably, we absolutely need to be starting now.

This episode was recorded on August 5–6, 2025.

Video editing: Simon Monsour
Audio engineering: Ben Cordell, Milo McGuire, Simon Monsour, and Dominic Armstrong
Music: Ben Cordell
Coordination, transcriptions, and web: Katy Moore

Continue reading →

Rebuilding after apocalypse: What 13 experts say about bouncing back

What happens when civilisation faces its greatest tests?

This compilation brings together insights from researchers, defence experts, philosophers, and policymakers on humanity’s ability to survive and recover from catastrophic events. From nuclear winter and electromagnetic pulses to pandemics and climate disasters, we explore both the threats that could bring down modern civilisation and the practical solutions that could help us bounce back.

You’ll hear from:

  • Zach Weinersmith on how settling space won’t help with threats to civilisation anytime soon (unless AI gets crazy good) (from episode #187)
  • Luisa Rodriguez on what the world might look like after a global catastrophe, how we might lose critical knowledge, and how fast populations might rebound (#116)
  • David Denkenberger on disruptions to electricity and communications we should expect in a catastrophe, and his work researching low-cost, low-tech solutions to make sure everyone is fed no matter what (#50 and #117)
  • Lewis Dartnell on how we could recover without much coal or oil, and changes we could make today to make us more resilient to potential catastrophes (#131)
  • Andy Weber on how people in US defence circles think about nuclear winter, and the tech that could prevent catastrophic pandemics (#93)
  • Toby Ord on the many risks to our atmosphere, whether climate change and rogue AI could really threaten civilisation, and whether we could rebuild from a small surviving population (#72 and #219)
  • Mark Lynas on how likely it is that widespread famine from climate change leads to civilisational collapse (#85)
  • Kevin Esvelt on the human-caused pandemic scenarios that could bring down civilisation — and how AI could help bad actors succeed (#164)
  • Joan Rohlfing on why we need to worry about more than just nuclear winter (#125)
  • Annie Jacobsen on the rings of annihilation and electromagnetic pulses from nuclear blasts (#192)
  • Christian Ruhl on thoughtful philanthropy that funds “right of boom” interventions to prevent nuclear war from threatening civilisation (80k After Hours)
  • Athena Aktipis on whether society would go all Mad Max in the apocalypse, and the best ways to prepare for a catastrophe (#144)
  • Will MacAskill on why potatoes are so cool (#130 and #136)

Content editing: Katy Moore and Milo McGuire
Audio engineering: Ben Cordell, Milo McGuire, Simon Monsour, and Dominic Armstrong
Music: Ben Cordell
Transcriptions and web: Katy Moore

Continue reading →

#220 – Ryan Greenblatt on the 4 most likely ways for AI to take over, and the case for and against AGI in under 8 years

Ryan Greenblatt — lead author on the explosive paper “Alignment faking in large language models” and chief scientist at Redwood Research — thinks there’s a 25% chance that within four years, AI will be able to do everything needed to run an AI company, from writing code to designing experiments to making strategic and business decisions.

As Ryan lays out, AI models are “marching through the human regime”: systems that could handle five-minute tasks two years ago now tackle 90-minute projects. Double that a few more times and we may be automating full jobs rather than just parts of them.

Will setting AI to improve itself lead to an explosive positive feedback loop? Maybe, but maybe not.

The explosive scenario: Once you’ve automated your AI company, you could have the equivalent of 20,000 top researchers, each working 50 times faster than humans with total focus. “You have your AIs, they do a bunch of algorithmic research, they train a new AI, that new AI is smarter and better and more efficient… that new AI does even faster algorithmic research.” In this world, we could see years of AI progress compressed into months or even weeks.

With AIs now doing all of the work of programming their successors and blowing past the human level, Ryan thinks it would be fairly straightforward for them to take over and disempower humanity, if they thought doing so would better achieve their goals. In the interview he lays out the four most likely approaches for them to take.

The linear progress scenario: You automate your company but progress barely accelerates. Why? Multiple reasons, but the most likely is “it could just be that AI R&D research bottlenecks extremely hard on compute.” You’ve got brilliant AI researchers, but they’re all waiting for experiments to run on the same limited set of chips, so can only make modest progress.

Ryan’s median guess splits the difference: perhaps a 20x acceleration that lasts for a few months or years. Transformative, but less extreme than some in the AI companies imagine.

And his 25th percentile case? Progress “just barely faster” than before. All that automation, and all you’ve been able to do is keep pace.

Unfortunately the data we can observe today is so limited that it leaves us with vast error bars. “We’re extrapolating from a regime that we don’t even understand to a wildly different regime,” Ryan believes, “so no one knows.”

But that huge uncertainty means the explosive growth scenario is a plausible one — and the companies building these systems are spending tens of billions to try to make it happen.

In this extensive interview, Ryan elaborates on the above and the policy and technical response necessary to insure us against the possibility that they succeed — a scenario society has barely begun to prepare for.

This episode was recorded on February 21, 2025.

Video editing: Luke Monsour, Simon Monsour, and Dominic Armstrong
Audio engineering: Ben Cordell, Milo McGuire, and Dominic Armstrong
Music: Ben Cordell
Transcriptions and web: Katy Moore

Continue reading →

#219 – Toby Ord on graphs AI companies would prefer you didn’t (fully) understand

The era of making AI smarter by just making it bigger is ending. But that doesn’t mean progress is slowing down — far from it. AI models continue to get much more powerful, just using very different methods. And those underlying technical changes force a big rethink of what coming years will look like.

Toby Ord — Oxford philosopher and bestselling author of The Precipice — has been tracking these shifts and mapping out the implications both for governments and our lives.

As he explains, until recently anyone can access the best AI in the world “for less than the price of a can of Coke.” But unfortunately, that’s over.

What changed? AI companies first made models smarter by throwing a million times as much computing power at them during training, to make them better at predicting the next word. But with high quality data drying up, that approach petered out in 2024.

So they pivoted to something radically different: instead of training smarter models, they’re giving existing models dramatically more time to think — leading to the rise in “reasoning models” that are at the frontier today.

The results are impressive but this extra computing time comes at a cost: OpenAI’s o3 reasoning model achieved stunning results on a famous AI test by writing an Encyclopedia Britannica‘s worth of reasoning to solve individual problems — at a cost of over $1,000 per question.

This isn’t just technical trivia: if this improvement method sticks, it will change much about how the AI revolution plays out — starting with the fact that we can expect the rich and powerful to get access to the best AI models well before the rest of us.

Companies have also begun applying “reinforcement learning” in which models are asked to solve practical problems, and then told to “do more of that” whenever it looks like they’ve gotten the right answer.

This has led to amazing advances in problem-solving ability — but it also explains why AI models have suddenly gotten much more deceptive. Reinforcement learning has always had the weakness that it encourages creative cheating, or tricking people into thinking you got the right answer even when you didn’t.

Toby shares typical recent examples of this “reward hacking” — from models Googling answers while pretending to reason through the problem (a deception hidden in OpenAI’s own release data), to achieving “100x improvements” by hacking their own evaluation systems.

To cap it all off, it’s getting harder and harder to trust publications from AI companies, as marketing and fundraising have become such dominant concerns.

While companies trumpet the impressive results of the latest models, Toby points out that they’ve actually had to spend a million times as much just to cut model errors by half. And his careful inspection of an OpenAI graph supposedly demonstrating that o3 was the new best model in the world revealed that it was actually no more efficient than its predecessor.

But Toby still thinks it’s critical to pay attention, given the stakes:

…there is some snake oil, there is some fad-type behaviour, and there is some possibility that it is nonetheless a really transformative moment in human history. It’s not an either/or. I’m trying to help people see clearly the actual kinds of things that are going on, the structure of this landscape, and to not be confused by some of these charts.

Recorded on May 23, 2025.

Video editing: Simon Monsour
Audio engineering: Ben Cordell, Milo McGuire, Simon Monsour, and Dominic Armstrong
Music: Ben Cordell
Camera operator: Jeremy Chevillotte
Transcriptions and web: Katy Moore

Continue reading →

#218 – Hugh White on why Trump is abandoning US hegemony – and that’s probably good

For decades, US allies have slept soundly under the protection of America’s overwhelming military might. Donald Trump — with his threats to ditch NATO, seize Greenland, and abandon Taiwan — seems hell-bent on shattering that comfort.

But according to Hugh White — one of the world’s leading strategic thinkers, emeritus professor at the Australian National University, and author of Hard New World: Our Post American Future — Trump isn’t destroying American hegemony. He’s simply revealing that it’s already gone.

“Trump has very little trouble accepting other great powers as co-equals,” Hugh explains. And that happens to align perfectly with a strategic reality the foreign policy establishment desperately wants to ignore: fundamental shifts in global power have made the costs of maintaining a US-led hegemony prohibitively high.

Even under Biden, when Russia invaded Ukraine, the US sent weapons but explicitly ruled out direct involvement. Ukraine matters far more to Russia than America, and this “asymmetry of resolve” makes Putin’s nuclear threats credible where America’s counterthreats simply aren’t.

Hugh’s gloomy prediction: “Europeans will end up conceding to Russia whatever they can’t convince the Russians they’re willing to fight a nuclear war to deny them.”

The Pacific tells the same story. Despite Obama’s “pivot to Asia” and Biden’s tough talk about “winning the competition for the 21st century,” actual US military capabilities there have barely budged while China’s have soared, along with its economy — which is now bigger than the US’s, as measured in purchasing power. Containing China and defending Taiwan would require America to spend 8% of GDP on defence (versus 3.5% today) — and convince Beijing it’s willing to accept Los Angeles being vaporised. Unlike during the Cold War, no president — Trump or otherwise — can make that case to voters.

So what’s next? Hugh’s prognoses are stark:

  • Taiwan is in an impossible situation and we’re doing them a disservice pretending otherwise.
  • South Korea, Japan, and one of the EU or Poland will have to go nuclear to defend themselves.
  • Trump might actually follow through and annex Panama and Greenland — but probably not Canada.
  • Australia can defend itself from China but needs an entirely different military to do it.

Our new “multipolar” future, split between American, Chinese, Russian, Indian, and European spheres of influence, is a “darker world” than the golden age of US dominance. But Hugh’s message is blunt: for better or worse, 35 years of American hegemony are over. The challenge now is managing the transition peacefully, and creating a stable multipolar order more like Europe’s relatively peaceful 19th century than the chaotic bloodbath Europe suffered in the 17th — which, if replicated today, would be a nuclear bloodbath.

In today’s conversation, Hugh and Rob explore why even AI supremacy might not restore US dominance (spoiler: China still has nukes), why Japan can defend itself but Taiwan can’t, and why a new president won’t be able to reverse the big picture.

This episode was originally recorded on May 30, 2025.

Video editing: Simon Monsour
Audio engineering: Ben Cordell, Milo McGuire, Simon Monsour, and Dominic Armstrong
Music: Ben Cordell
Transcriptions and web: Katy Moore

Continue reading →

#217 – Beth Barnes on the most important graph in AI right now — and the 7-month rule that governs its progress

AI models today have a 50% chance of successfully completing a task that would take an expert human one hour. Seven months ago, that number was roughly 30 minutes — and seven months before that, 15 minutes.

These are substantial, multi-step tasks requiring sustained focus: building web applications, conducting machine learning research, or solving complex programming challenges.

Today’s guest, Beth Barnes, is CEO of METR (Model Evaluation & Threat Research) — the leading organisation measuring these capabilities.

Beth’s team has been timing how long it takes skilled humans to complete projects of varying length, then seeing how AI models perform on the same work.

The resulting paper from METR, “Measuring AI ability to complete long tasks,” made waves by revealing that the planning horizon of AI models was doubling roughly every seven months. It’s regarded by many as the most useful AI forecasting work in years.

The companies building these systems aren’t just aware of this trend — they want to harness it as much as possible, and are aggressively pursuing automation of their own research.

That’s both an exciting and troubling development, because it could radically speed up advances in AI capabilities, accomplishing what would have taken years or decades in just months. That itself could be highly destabilising, as we explored in a previous episode: Will MacAskill on AI causing a “century in a decade” — and how we’re completely unprepared.

And having AI models rapidly build their successors with limited human oversight naturally raises the risk that things will go off the rails if the models at the end of the process lack the goals and constraints we hoped for.

Beth thinks models can already do “meaningful work” on improving themselves, and she wouldn’t be surprised if AI models were able to autonomously self-improve in as little as two years from now — in fact, she says, “It seems hard to rule out even shorter [timelines]. Is there 1% chance of this happening in six, nine months? Yeah, that seems pretty plausible.”

While Silicon Valley is abuzz with these numbers, policymakers remain largely unaware of what’s barrelling toward us — and given the current lack of regulation of AI companies, they’re not even able to access the critical information that would help them decide whether to intervene. Beth adds:

The sense I really want to dispel is, “But the experts must be on top of this. The experts would be telling us if it really was time to freak out.” The experts are not on top of this. Inasmuch as there are experts, they are saying that this is concerning. … And to the extent that I am an expert, I am an expert telling you you should freak out. And there’s not especially anyone else who isn’t saying this.

Beth and Rob discuss all that, plus:

  • How Beth now thinks that open-weight models are a good thing for AI safety, and what changed her mind
  • How our poor information security means there’s no such thing as a “closed-weight” model anyway
  • Whether we can see if an AI is scheming in its chain-of-thought reasoning, and the latest research on “alignment faking”
  • Why just before deployment is the worst time to evaluate model safety
  • Why Beth thinks AIs could end up being really good at creative and novel research — something humans tend to think is beyond their reach
  • Why Beth thinks safety-focused people should stay out of the frontier AI companies — and the advantages smaller organisations have
  • Areas of AI safety research that Beth thinks is overrated and underrated
  • Whether it’s feasible to have a science that translates AI models’ increasing use of nonhuman language or ‘neuralese’
  • How AI is both similar to and different from nuclear arms racing and bioweapons
  • And much more besides!

This episode was originally recorded on February 17, 2025.

Video editing: Luke Monsour and Simon Monsour
Audio engineering: Ben Cordell, Milo McGuire, Simon Monsour, and Dominic Armstrong
Music: Ben Cordell
Transcriptions and web: Katy Moore

Continue reading →

Beyond human minds: The bewildering frontier of consciousness in insects, AI, and more

What if there’s something it’s like to be a shrimp — or a chatbot?

For centuries, humans have debated the nature of consciousness, often placing ourselves at the very top. But what about the minds of others — both the animals we share this planet with and the artificial intelligences we’re creating?

We’ve pulled together clips from past conversations with researchers and philosophers who’ve spent years trying to make sense of animal consciousness, artificial sentience, and moral consideration under deep uncertainty.

You’ll hear from:

  • Robert Long on how we might accidentally create artificial sentience (from episode #146)
  • Jeff Sebo on when we should extend extend moral consideration to digital beings — and what that would even look like (#173)
  • Jonathan Birch on what we should learn from the cautionary tale of newborn pain, and other “edge cases” of sentience (#196)
  • Andrés Jiménez Zorrilla on what it’s like to be a shrimp (80k After Hours)
  • Meghan Barrett on challenging our assumptions about insects’ experiences (#198)
  • David Chalmers on why artificial consciousness is entirely possible (#67)
  • Holden Karnofsky on how we’ll see digital people as… people (#109)
  • Sébastien Moro on the surprising sophistication of fish cognition and behaviour (#205)
  • Bob Fischer on how to compare the moral weight of a chicken to that of a human (#182)
  • Cameron Meyer Shorb on the vast scale of potential wild animal suffering (#210)
  • Lewis Bollard on how animal advocacy has evolved in response to sentience research (#185)
  • Anil Seth on the neuroscientific theories of consciousness (#206)
  • Peter Godfrey-Smith on whether we could upload ourselves to machines (#203)
  • Buck Shlegeris on whether AI control strategies make humans the bad guys (#214)
  • Stuart Russell on the moral rights of AI systems (#80)
  • Will MacAskill on how to integrate digital beings into society (#213)
  • Carl Shulman on collaboratively sharing the world with digital minds (#191)

Audio engineering: Ben Cordell, Milo McGuire, Simon Monsour, and Dominic Armstrong
Additional content editing: Katy Moore and Milo McGuire
Transcriptions and web: Katy Moore

Continue reading →

Don’t believe OpenAI’s “nonprofit” spin (emergency pod with Tyler Whitmer)

OpenAI’s recent announcement that its nonprofit would “retain control” of its for-profit business sounds reassuring. But this seemingly major concession, celebrated by so many, is in itself largely meaningless.

Litigator Tyler Whitmer is a coauthor of a newly published letter that describes this attempted sleight of hand and directs regulators on how to stop it.

As Tyler explains, the plan both before and after this announcement has been to convert OpenAI into a Delaware public benefit corporation (PBC) — and this alone will dramatically weaken the nonprofit’s ability to direct the business in pursuit of its charitable purpose: ensuring AGI is safe and “benefits all of humanity.”

Right now, the nonprofit directly controls the business. But were OpenAI to become a PBC, the nonprofit, rather than having its “hand on the lever,” would merely contribute to the decision of who does.

Why does this matter? Today, if OpenAI’s commercial arm were about to release an unhinged AI model that might make money but be bad for humanity, the nonprofit could directly intervene to stop it. In the proposed new structure, it likely couldn’t do much at all.

But it’s even worse than that: even if the nonprofit could select the PBC’s directors, those directors would have fundamentally different legal obligations from those of the nonprofit. A PBC director must balance public benefit with the interests of profit-driven shareholders — by default, they cannot legally prioritise public interest over profits, even if they and the controlling shareholder that appointed them want to do so.

As Tyler points out, there isn’t a single reported case of a shareholder successfully suing to enforce a PBC’s public benefit mission in the 10+ years since the Delaware PBC statute was enacted.

This extra step from the nonprofit to the PBC would also mean that the attorneys general of California and Delaware — who today are empowered to ensure the nonprofit pursues its mission — would find themselves powerless to act. These are probably not side effects but rather a Trojan horse for-profit investors are trying to slip past regulators.

Fortunately this can all be addressed — but it requires either the nonprofit board or the attorneys general of California and Delaware to promptly put their foot down and insist on watertight legal agreements that preserve OpenAI’s current governance safeguards and enforcement mechanisms.

As Tyler explains, the same arrangements that currently bind the OpenAI business have to be written into a new PBC’s certificate of incorporation — something that won’t happen by default and that powerful investors have every incentive to resist.

Without these protections, OpenAI’s new suggested structure wouldn’t “fix” anything. They would be a ruse that preserved the appearance of nonprofit control while gutting its substance.

Listen to our conversation with Tyler Whitmer to understand what’s at stake, and what the AGs and board members must do to ensure OpenAI remains committed to developing artificial general intelligence that benefits humanity rather than just investors.

This episode was originally recorded on May 13, 2025.

Video editing: Simon Monsour and Luke Monsour
Audio engineering: Ben Cordell, Milo McGuire, Simon Monsour, and Dominic Armstrong
Music: Ben Cordell
Transcriptions and web: Katy Moore

Continue reading →

Emergency pod: Did OpenAI give up, or is this just a new trap? (with Rose Chan Loui)

When attorneys general intervene in corporate affairs, it usually means something has gone seriously wrong. In OpenAI’s case, it appears to have forced a dramatic reversal of the company’s plans to sideline its nonprofit foundation, announced in a blog post that made headlines worldwide.

The company’s sudden announcement that its nonprofit will “retain control” credits “constructive dialogue” with the attorneys general of California and Delaware — corporate-speak for what was likely a far more consequential confrontation behind closed doors. A confrontation perhaps driven by public pressure from Nobel Prize winners, past OpenAI staff, and community organisations.

But whether this change will help depends entirely on the details of implementation — details that remain worryingly vague in the company’s announcement.

Return guest Rose Chan Loui, nonprofit law expert at UCLA, sees potential in OpenAI’s new proposal, but emphasises that “control” must be carefully defined and enforced: “The words are great, but what’s going to back that up?” Without explicitly defining the nonprofit’s authority over safety decisions, the shift could be largely cosmetic.

Why have state officials taken such an interest so far? Host Rob Wiblin notes, “OpenAI was proposing that the AGs would no longer have any say over what this super momentous company might end up doing. … It was just crazy how they were suggesting that they would take all of the existing money and then pursue a completely different purpose.”

Now that they’re in the picture, the AGs have leverage to ensure the nonprofit maintains genuine control over issues of public safety as OpenAI develops increasingly powerful AI.

Rob and Rose explain three key areas where the AGs can make a huge difference to whether this plays out in the public’s best interest:

  1. Ensuring that the contractual agreements giving the nonprofit control over the new Delaware public benefit corporation are watertight, and don’t accidentally shut the AGs out of the picture.
  2. Insisting that a majority of board members are truly independent by prohibiting indirect as well as direct financial stakes in the business.
  3. Insisting that the board is empowered with the money, independent staffing, and access to information which they need to do their jobs.

This episode was originally recorded on May 6, 2025.

Video editing: Simon Monsour and Luke Monsour
Audio engineering: Ben Cordell, Milo McGuire, Simon Monsour, and Dominic Armstrong
Music: Ben Cordell
Transcriptions and web: Katy Moore

Continue reading →

#216 – Ian Dunt on why governments in Britain and elsewhere can’t get anything done – and how to fix it

When you have a system where ministers almost never understand their portfolios, civil servants change jobs every few months, and MPs don’t grasp parliamentary procedure even after decades in office — is the problem the people, or the structure they work in?

Today’s guest, political journalist Ian Dunt, studies the systemic reasons governments succeed and fail.

And in his book How Westminster Works …and Why It Doesn’t, he argues that Britain’s government dysfunction and multi-decade failure to solve its key problems stems primarily from bad incentives and bad processes. Even brilliant, well-intentioned people are set up to fail by a long list of institutional absurdities.

For instance:

  1. Ministerial appointments in complex areas like health or defence typically go to whoever can best shore up the prime minister’s support within their own party and prevent a leadership challenge, rather than people who have any experience at all with the area.
  2. On average, ministers are removed after just two years, so the few who manage to learn their brief are typically gone just as they’re becoming effective. In the middle of a housing crisis, Britain went through 25 housing ministers in 25 years.
  3. Ministers are expected to make some of their most difficult decisions by reading paper memos out of a ‘red box’ while exhausted, at home, after dinner.
  4. Tradition demands that the country be run from a cramped Georgian townhouse: 10 Downing Street. Few staff fit and teams are split across multiple floors. Meanwhile, the country’s most powerful leaders vie to control the flow of information to and from the prime minister via ‘professionalised loitering’ outside their office.
  5. Civil servants are paid too little to retain those with technical skills, who can earn several times as much in the private sector. For those who do want to stay, the only way to get promoted is to move departments — abandoning any area-specific knowledge they’ve accumulated.
  6. As a result, senior civil servants handling complex policy areas have a median time in role as low as 11 months. Turnover in the Treasury has regularly been 25% annually — comparable to a McDonald’s restaurant.
  7. MPs are chosen by local party members overwhelmingly on the basis of being ‘loyal party people,’ while the question of whether they are good at understanding or scrutinising legislation (their supposed constitutional role) simply never comes up.

The end result is that very few of the most powerful people in British politics have much idea what they’re actually doing. As Ian puts it, the country is at best run by a cadre of “amateur generalists.”

While some of these are unique British failings, many others are recurring features of governments around the world, and similar dynamics can arise in large corporations as well.

But as Ian also lays out, most of these absurdities have natural solutions, and in every case some countries have found structural solutions that help ensure decisions are made by the right people, with the information they need, and that success is rewarded.

This episode was originally recorded on January 30, 2025.

Video editing: Simon Monsour
Audio engineering: Ben Cordell, Milo McGuire, Simon Monsour, and Dominic Armstrong
Music: Ben Cordell
Camera operator: Jeremy Chevillotte
Transcriptions and web: Katy Moore

Continue reading →

Serendipity, weird bets, & cold emails that actually work: Career advice from 16 former guests

How do you navigate a career path when the future of work is uncertain? How important is mentorship versus immediate impact? Is it better to focus on your strengths or on the world’s most pressing problems? Should you specialise deeply or develop a unique combination of skills?

From embracing failure to finding unlikely allies, we bring you 16 diverse perspectives from past guests who’ve found unconventional paths to impact and helped others do the same.

You’ll hear from:

  • Michael Webb on using AI as a career advisor and the human skills AI can’t replace (from episode #161)
  • Holden Karnofsky on kicking ass in whatever you do, and which weird ideas are worth betting on (#109, #110, and #158)
  • Chris Olah on how intersections of particular skills can be a wildly valuable niche (#108)
  • Michelle Hutchinson on understanding what truly motivates you (#75)
  • Benjamin Todd on how to make tough career decisions and deal with rejection (#71 and 80k After Hours)
  • Jeff Sebo on what improv comedy teaches us about doing good in the world (#173)
  • Spencer Greenberg on recognising toxic people who could derail your career (#183)
  • Dean Spears on embracing randomness and serendipity (#186)
  • Karen Levy on finding yourself through travel (#124)
  • Leah Garcés on finding common ground with unlikely allies (#99)
  • Hannah Ritchie on being selective about whose advice you follow (#160)
  • Alex Lawsen on getting good mentorship (80k After Hours)
  • Pardis Sabeti on prioritising physical health (#104)
  • Sarah Eustis-Guthrie on knowing when to pivot from your current path (#207)
  • Danny Hernandez on setting triggers for career decisions (#78)
  • Varsha Venugopal on embracing uncomfortable situations (#113)

Audio engineering: Ben Cordell, Milo McGuire, Simon Monsour, and Dominic Armstrong
Content editing: Katy Moore and Milo McGuire
Transcriptions and web: Katy Moore

Continue reading →

#215 – Tom Davidson on how AI-enabled coups could allow a tiny group to seize power

Throughout history, technological revolutions have fundamentally shifted the balance of power in society. The Industrial Revolution created conditions where democracies could dominate for the first time — as nations needed educated, informed, and empowered citizens to deploy advanced technologies and remain competitive.

Unfortunately there’s every reason to think artificial general intelligence (AGI) will reverse that trend.

In a new paper published today, Tom Davidson — senior research fellow at the Forethought Centre for AI Strategy — argues that advanced AI systems will enable unprecedented power grabs by tiny groups of people, primarily by removing the need for other human beings to participate.

Come work with us on the 80,000 Hours podcast team! We’re accepting expressions of interest for the new host and chief of staff until May 6 in order to deliver as much incredibly insightful AGI-related content as we can. Learn more about our shift in strategic direction and apply soon!

When a country’s leaders no longer need citizens for economic production, or to serve in the military, there’s much less need to share power with them. “Over the broad span of history, democracy is more the exception than the rule,” Tom points out. “With AI, it will no longer be important to a country’s competitiveness to have an empowered and healthy citizenship.”

Citizens in established democracies are not typically that concerned about coups. We doubt anyone will try, and if they do, we expect human soldiers to refuse to join in. Unfortunately, the AI-controlled military systems of the future will lack those inhibitions. As Tom lays out, “Human armies today are very reluctant to fire on their civilians. If we get instruction-following AIs, then those military systems will just fire.”

Why would AI systems follow the instructions of a would-be tyrant? One answer is that, as militaries worldwide race to incorporate AI to remain competitive, they risk leaving the door open for exploitation by malicious actors in a few ways:

  1. AI systems could be programmed to simply follow orders from the top of the chain of command, without any checks on that power — potentially handing total power indefinitely to any leader willing to abuse that authority.
  2. Systems could contain “secret loyalties” inserted during development that activate at critical moments, as demonstrated in Anthropic’s recent paper on “sleeper agents”.
  3. Superior cyber capabilities could enable small groups to hack into and take full control of AI-operated military infrastructure.

It’s also possible that the companies with the most advanced AI, if it conveyed a significant enough advantage over competitors, could quickly develop armed forces sufficient to overthrow an incumbent regime. History suggests that as few as 10,000 obedient military drones could be sufficient to kill competitors, take control of key centres of power, and make your success fait accompli.

Without active effort spent mitigating risks like these, it’s reasonable to fear that AI systems will destabilise the current equilibrium that enables the broad distribution of power we see in democratic nations.

In this episode, host Rob Wiblin and Tom discuss new research on the question of whether AI-enabled coups are likely, and what we can do about it if they are, as well as:

  • Whether preventing coups and preventing ‘rogue AI’ require opposite interventions, leaving us in a bind
  • Whether open sourcing AI weights could be helpful, rather than harmful, for advancing AI safely
  • Why risks of AI-enabled coups have been relatively neglected in AI safety discussions
  • How persuasive AGI will really be
  • How many years we have before these risks become acute
  • The minimum number of military robots needed to stage a coup

This episode was originally recorded on January 20, 2025.

Video editing: Simon Monsour
Audio engineering: Ben Cordell, Milo McGuire, Simon Monsour, and Dominic Armstrong
Camera operator: Jeremy Chevillotte
Transcriptions and web: Katy Moore

Continue reading →

Guilt, imposter syndrome & doing good: 16 past guests share their mental health journeys

What happens when your desire to do good starts to undermine your own wellbeing?

Over the years, we’ve heard from therapists, charity directors, researchers, psychologists, and career advisors — all wrestling with how to do good without falling apart. Today’s episode brings together insights from 16 past guests on the emotional and psychological costs of pursuing a high-impact career to improve the world — and how to best navigate the all-too-common guilt, burnout, perfectionism, and imposter syndrome along the way.

You’ll hear from:

  • 80,000 Hours’ former CEO on managing anxiety, self-doubt, and a chronic sense of falling short (from episode #100)
  • Randy Nesse on why we evolved to be anxious and depressed (episode #179)
  • Hannah Boettcher on how ‘optimisation framing’ can quietly distort our sense of self-worth (from our 80k After Hours feed)
  • Luisa Rodriguez on grieving the gap between who you are and who you wish you were (from our 80k After Hours feed)
  • Cameron Meyer Shorb on how guilt and shame became his biggest source of suffering — and what helped (episode #210)
  • Tim LeBon on the trap of moral perfectionism, and why we should strive for excellence instead (episode #149)
  • Cal Newport on why we need to make time to be alone with our thoughts (episode #106)
  • Michelle Hutchinson and Habiba Islam on when to prioritise wellbeing over impact (episode #122)
  • Sarah Eustis-Guthrie on the emotional weight of founding a charity (episode #207)
  • Hannah Ritchie on feeling like an imposter, even after writing a book and giving a TED Talk (episode #160)
  • Will MacAskill on why he’s five to 10 times happier than he used to be after learning to work in a way that’s genuinely sustainable (episode #130)
  • Ajeya Cotra on handling the pressure of high-stakes research (episode #90)
  • Christian Ruhl on pursuing a high-impact career while managing a stutter (from our 80k After Hours feed)
  • Leah Garcés on insisting on self-care when witnessing trauma regularly (episode #99)
  • Kelsey Piper on recognising that you’re not alone in your struggles (episode #53)

And if you’re dealing with your own mental health concerns, here are some resources that might help:

Audio engineering: Ben Cordell, Milo McGuire, Simon Monsour, and Dominic Armstrong
Content editing: Katy Moore and Milo McGuire
Transcriptions and web: Katy Moore

Continue reading →

#214 – Buck Shlegeris on controlling AI that wants to take over – so we can use it anyway

Most AI safety conversations centre on alignment: ensuring AI systems share our values and goals. But despite progress, we’re unlikely to know we’ve solved the problem before the arrival of human-level and superhuman systems in as little as three years.

So some are developing a backup plan to safely deploy models we fear are actively scheming to harm us — so-called “AI control.” While this may sound mad, given the reluctance of AI companies to delay deploying anything they train, not developing such techniques is probably even crazier.

Today’s guest — Buck Shlegeris, CEO of Redwood Research — has spent the last few years developing control mechanisms, and for human-level systems they’re more plausible than you might think. He argues that given companies’ unwillingness to incur large costs for security, accepting the possibility of misalignment and designing robust safeguards might be one of our best remaining options.

Buck asks us to picture a scenario where, in the relatively near future, AI companies are employing 100,000 AI systems running 16 times faster than humans to automate AI research itself. These systems would need dangerous permissions: the ability to run experiments, access model weights, and push code changes. As a result, a misaligned system could attempt to hack the data centre, exfiltrate weights, or sabotage research. In such a world, misalignment among these AIs could be very dangerous.

But in the absence of a method for reliably aligning frontier AIs, Buck argues for implementing practical safeguards to prevent catastrophic outcomes. His team has been developing and testing a range of straightforward, cheap techniques to detect and prevent risky behaviour by AIs — such as auditing AI actions with dumber but trusted models, replacing suspicious actions, and asking questions repeatedly to catch randomised attempts at deception.

Most importantly, these methods are designed to be cheap and shovel-ready. AI control focuses on harm reduction using practical techniques — techniques that don’t require new, fundamental breakthroughs before companies could reasonably implement them, and that don’t ask us to forgo the benefits of deploying AI.

As Buck puts it:

Five years ago I thought of misalignment risk from AIs as a really hard problem that you’d need some really galaxy-brained fundamental insights to resolve. Whereas now, to me the situation feels a lot more like we just really know a list of 40 things where, if you did them — none of which seem that hard — you’d probably be able to not have very much of your problem.

Of course, even if Buck is right, we still need to do those 40 things — which he points out we’re not on track for. And AI control agendas have their limitations: they aren’t likely to work once AI systems are much more capable than humans, since greatly superhuman AIs can probably work around whatever limitations we impose.

Still, AI control agendas seem to be gaining traction within AI safety. Buck and host Rob Wiblin discuss all of the above, plus:

  • Why he’s more worried about AI hacking its own data centre than escaping
  • What to do about “chronic harm,” where AI systems subtly underperform or sabotage important work like alignment research
  • Why he might want to use a model he thought could be conspiring against him
  • Why he would feel safer if he caught an AI attempting to escape
  • Why many control techniques would be relatively inexpensive
  • How to use an untrusted model to monitor another untrusted model
  • What the minimum viable intervention in a “lazy” AI company might look like
  • How even small teams of safety-focused staff within AI labs could matter
  • The moral considerations around controlling potentially conscious AI systems, and whether it’s justified

This episode was originally recorded on February 21, 2025.

Video: Simon Monsour and Luke Monsour
Audio engineering: Ben Cordell, Milo McGuire, and Dominic Armstrong
Transcriptions and web: Katy Moore

Continue reading →