#222 – Neel Nanda on the race to read AI minds
We don’t know how AIs think or why they do what they do. Or at least, we don’t know much. That fact is only becoming more troubling as AIs grow more capable and appear on track to wield enormous cultural influence, directly advise on major government decisions, and even operate military equipment autonomously. We simply can’t tell what models, if any, should be trusted with such authority.
Neel Nanda of Google DeepMind is one of the founding figures of the field of machine learning trying to fix this situation — mechanistic interpretability (or “mech interp”). The project has generated enormous hype, exploding from a handful of researchers five years ago to hundreds today — all working to make sense of the jumble of tens of thousands of numbers that frontier AIs use to process information and decide what to say or do.
Neel now has a warning for us: the most ambitious vision of mech interp he once dreamed of is probably dead. He doesn’t see a path to deeply and reliably understanding what AIs are thinking. The technical and practical barriers are simply too great to get us there in time, before competitive pressures push us to deploy human-level or superhuman AIs. Indeed, Neel argues no one approach will guarantee alignment, and our only choice is the “Swiss cheese” model of accident protection, layering multiple safeguards on top of one another.
But while mech interp won’t be a silver bullet for AI safety, it has nevertheless had some major successes and will be one of the best tools in our arsenal.
For instance: by inspecting the neural activations in the middle of an AI’s thoughts, we can pick up many of the concepts the model is thinking about — from the Golden Gate Bridge, to refusing to answer a question, to the option of deceiving the user. While we can’t know all the thoughts a model is having all the time, picking up 90% of the concepts it is using 90% of the time should help us muddle through — so long as mech interp is paired with other techniques to fill in the gaps.
In today’s episode, Neel takes us on a tour of everything you’ll want to know about this race to understand what AIs are really thinking. He and host Rob Wiblin cover:
- The best tools we’ve come up with so far, and where mech interp has failed
- Why the best techniques have to be fast and cheap
- The fundamental reasons we can’t reliably know what AIs are thinking, despite having perfect access to their internals
- What we can and can’t learn by reading models’ ‘chains of thought’
- Whether models will be able to trick us when they realise they’re being tested
- The best protections to add on top of mech interp
- Why he thinks the hottest technique in the field (SAEs) are overrated
- His new research philosophy
- How to break into mech interp and get a job — including applying to be a MATS scholar with Neel as your mentor (applications close September 12!)
This episode was recorded on July 17 and 21, 2025.
Video editing: Simon Monsour, Luke Monsour, Dominic Armstrong, and Milo McGuire
Audio engineering: Ben Cordell, Milo McGuire, Simon Monsour, and Dominic Armstrong
Music: Ben Cordell
Camera operator: Jeremy Chevillotte
Coordination, transcriptions, and web: Katy Moore
The interview in a nutshell
Neel Nanda, who runs the mechanistic interpretability team at Google DeepMind, has shifted from hoping mech interp would fully reverse-engineer AI models to seeing it as one useful tool among many for AI safety:
1. Mech interp won’t solve alignment alone — but remains crucial
Neel’s perspective has evolved from “low chance of incredibly big deal” to “high chance of medium big deal”:
- We won’t achieve full understanding: Models are too complex and messy to give robust guarantees like “this model isn’t deceptive”
- But partial understanding is valuable: Even 90% understanding helps with evaluation, monitoring, and incident analysis
- Nothing else will provide guarantees either: This isn’t a unique limitation of mech interp — no approach will definitively prove safety
Mech interp can help throughout the AGI development pipeline:
- Testing: Determining if models have hidden goals or deceptive tendencies
- Monitoring: Using cheap techniques like probes to detect harmful thoughts in production
- Incident analysis: Understanding why models exhibit concerning behaviours
Key successes demonstrating real-world value
- Auditing hidden goals: Sam Marks at Anthropic ran competitions where teams had to find secret objectives in models — teams with mech interp’s most popular technique (sparse autoencoders) won
- Extracting superhuman knowledge: Lisa Schut and Been Kim taught chess grandmasters (including former world champions) new strategic concepts from AlphaZero that humans had never discovered
- Detecting harmful prompts: Probes achieved 99.9% accuracy identifying harmful requests even when jailbreaks bypassed normal refusals
2. Simple techniques often outperform complex ones
Recent experiments have revealed the surprising effectiveness of basic approaches:
Probes beat fancy techniques:
- In Neel’s team’s experiments, linear probes (simple correlations) detected harmful prompts better than sophisticated methods
- They work by immediately noticing the model is using a ‘direction’ correlated with a concept like “this prompt is harmful”
- They’re cheap enough to run on every inference in production (unlike LLM monitors which might cost 100x)
Chain of thought reveals what AIs are thinking:
- Self-preservation was just confusion: When Palisade Research showed models resisting shutdown to complete math problems, Sen Rajamanoharan discovered by reading chain of thought that models thought “if I get shut down, I can’t finish these 10 problems” — not self-preservation for its own sake
- Simple prompt changes eliminated the behaviour: Telling models “another AI will finish the task” or “letting yourself be shut down is more important” reduced shutdown resistance to zero
- Even “blackmail” wasn’t real scheming: Models recommended self-preservation when told they were advising other AIs, showing they were roleplaying rather than genuinely trying to survive themselves
But chain of thought has limitations:
- Models can give “unfaithful” reasoning and may not even know why they said what they said — e.g. models answering “yes” to both “Is magnesium’s atomic number bigger than aluminium’s?” and the reverse
- OpenAI’s o3 already uses abbreviated shorthand like “poly not work, try different” — which could obscure thoughts from human overseers in the future
- Chain of thought could break if we switch to ‘continuous chain of thought’ that doesn’t require them to output reasoning text
- Apollo Research showed Claude can tell when it’s being evaluated, sometimes even guessing the exact purpose of tests, which may result in models being less forthcoming
3. Sparse autoencoders are useful but overhyped
SAEs were the field’s biggest bet, garnering a lot of attention, but haven’t lived up to initial expectations.
Where SAEs excel:
- Unknown concept discovery: Anthropic found unexpected Base64 encoding concepts in tiny one-layer models
- Hidden goal detection: Teams using SAEs consistently beat competitors in Anthropic’s hidden goal auditing games
- Fun demos: Golden Gate Claude, which couldn’t stop talking about the Golden Gate Bridge
Where SAEs were disappointing:
- Finding known concepts: When looking for harmfulness, simple probes outperform SAEs
- Feature absorption problems: SAEs create nonsensical concepts like “starts with E but isn’t the word elephant” to maximise sparsity — an issue that can be solved with effort
- Higher computational costs than alternatives: Neel’s team used 20 petabytes of storage and GPT-3-level compute just for Gemma 2 SAEs
4. Career and field-building insights
Neel advocates for the following research philosophy:
- Start simple: Reading chain of thought solved the self-preservation mystery — no fancy tools needed
- Beware false confidence: The Rome paper on editing facts (making models think the Eiffel Tower is in Rome) actually just added louder signals rather than truly editing knowledge
- Expect to be wrong: Neel’s own “Toy model of universality” paper needed two followups to correct errors — “I think I believe the third paper, but I’m not entirely sure”
Why mech interp is probably too popular relative to other alignment research:
- “An enormous nerd snipe” — the romance of understanding alien minds attracts researchers
- Better educational resources than newer safety fields (ARENA tutorials, Neel’s guides)
- Lower compute requirements for getting started than most ML research
The field still needs people because:
- AI safety overall is massively underinvested in relative to its importance
- Some people are much better suited to mech interp than other research projects
Practical career advice:
- Don’t read 20 papers before starting — mech interp is learned by doing
- Start with tiny two-week projects; abandoning them is fine if you’re learning
- The MATS Program takes people from zero to conference papers in a few months — and Neel is currently accepting applications for his next cohort (apply by September 12)
- Math Olympiad skills aren’t required — just linear algebra basics and good intuition