#222 – Neel Nanda on the race to read AI minds

We don’t know how AIs think or why they do what they do. Or at least, we don’t know much. That fact is only becoming more troubling as AIs grow more capable and appear on track to wield enormous cultural influence, directly advise on major government decisions, and even operate military equipment autonomously. We simply can’t tell what models, if any, should be trusted with such authority.

Neel Nanda of Google DeepMind is one of the founding figures of the field of machine learning trying to fix this situation — mechanistic interpretability (or “mech interp”). The project has generated enormous hype, exploding from a handful of researchers five years ago to hundreds today — all working to make sense of the jumble of tens of thousands of numbers that frontier AIs use to process information and decide what to say or do.

Neel now has a warning for us: the most ambitious vision of mech interp he once dreamed of is probably dead. He doesn’t see a path to deeply and reliably understanding what AIs are thinking. The technical and practical barriers are simply too great to get us there in time, before competitive pressures push us to deploy human-level or superhuman AIs. Indeed, Neel argues no one approach will guarantee alignment, and our only choice is the “Swiss cheese” model of accident protection, layering multiple safeguards on top of one another.

But while mech interp won’t be a silver bullet for AI safety, it has nevertheless had some major successes and will be one of the best tools in our arsenal.

For instance: by inspecting the neural activations in the middle of an AI’s thoughts, we can pick up many of the concepts the model is thinking about — from the Golden Gate Bridge, to refusing to answer a question, to the option of deceiving the user. While we can’t know all the thoughts a model is having all the time, picking up 90% of the concepts it is using 90% of the time should help us muddle through — so long as mech interp is paired with other techniques to fill in the gaps.

In today’s episode, Neel takes us on a tour of everything you’ll want to know about this race to understand what AIs are really thinking. He and host Rob Wiblin cover:

  • The best tools we’ve come up with so far, and where mech interp has failed
  • Why the best techniques have to be fast and cheap
  • The fundamental reasons we can’t reliably know what AIs are thinking, despite having perfect access to their internals
  • What we can and can’t learn by reading models’ ‘chains of thought’
  • Whether models will be able to trick us when they realise they’re being tested
  • The best protections to add on top of mech interp
  • Why he thinks the hottest technique in the field (SAEs) are overrated
  • His new research philosophy
  • How to break into mech interp and get a job — including applying to be a MATS scholar with Neel as your mentor (applications close September 12!)

This episode was recorded on July 17 and 21, 2025.

Video editing: Simon Monsour, Luke Monsour, Dominic Armstrong, and Milo McGuire
Audio engineering: Ben Cordell, Milo McGuire, Simon Monsour, and Dominic Armstrong
Music: Ben Cordell
Camera operator: Jeremy Chevillotte
Coordination, transcriptions, and web: Katy Moore

The interview in a nutshell

Neel Nanda, who runs the mechanistic interpretability team at Google DeepMind, has shifted from hoping mech interp would fully reverse-engineer AI models to seeing it as one useful tool among many for AI safety:

1. Mech interp won’t solve alignment alone — but remains crucial

Neel’s perspective has evolved from “low chance of incredibly big deal” to “high chance of medium big deal”:

  • We won’t achieve full understanding: Models are too complex and messy to give robust guarantees like “this model isn’t deceptive”
  • But partial understanding is valuable: Even 90% understanding helps with evaluation, monitoring, and incident analysis
  • Nothing else will provide guarantees either: This isn’t a unique limitation of mech interp — no approach will definitively prove safety

Mech interp can help throughout the AGI development pipeline:

  • Testing: Determining if models have hidden goals or deceptive tendencies
  • Monitoring: Using cheap techniques like probes to detect harmful thoughts in production
  • Incident analysis: Understanding why models exhibit concerning behaviours

Key successes demonstrating real-world value

  • Auditing hidden goals: Sam Marks at Anthropic ran competitions where teams had to find secret objectives in models — teams with mech interp’s most popular technique (sparse autoencoders) won
  • Extracting superhuman knowledge: Lisa Schut and Been Kim taught chess grandmasters (including former world champions) new strategic concepts from AlphaZero that humans had never discovered
  • Detecting harmful prompts: Probes achieved 99.9% accuracy identifying harmful requests even when jailbreaks bypassed normal refusals

2. Simple techniques often outperform complex ones

Recent experiments have revealed the surprising effectiveness of basic approaches:

Probes beat fancy techniques:

  • In Neel’s team’s experiments, linear probes (simple correlations) detected harmful prompts better than sophisticated methods
  • They work by immediately noticing the model is using a ‘direction’ correlated with a concept like “this prompt is harmful”
  • They’re cheap enough to run on every inference in production (unlike LLM monitors which might cost 100x)

Chain of thought reveals what AIs are thinking:

  • Self-preservation was just confusion: When Palisade Research showed models resisting shutdown to complete math problems, Sen Rajamanoharan discovered by reading chain of thought that models thought “if I get shut down, I can’t finish these 10 problems” — not self-preservation for its own sake
  • Simple prompt changes eliminated the behaviour: Telling models “another AI will finish the task” or “letting yourself be shut down is more important” reduced shutdown resistance to zero
  • Even “blackmail” wasn’t real scheming: Models recommended self-preservation when told they were advising other AIs, showing they were roleplaying rather than genuinely trying to survive themselves

But chain of thought has limitations:

  • Models can give “unfaithful” reasoning and may not even know why they said what they said — e.g. models answering “yes” to both “Is magnesium’s atomic number bigger than aluminium’s?” and the reverse
  • OpenAI’s o3 already uses abbreviated shorthand like “poly not work, try different” — which could obscure thoughts from human overseers in the future
  • Chain of thought could break if we switch to ‘continuous chain of thought’ that doesn’t require them to output reasoning text
  • Apollo Research showed Claude can tell when it’s being evaluated, sometimes even guessing the exact purpose of tests, which may result in models being less forthcoming

3. Sparse autoencoders are useful but overhyped

SAEs were the field’s biggest bet, garnering a lot of attention, but haven’t lived up to initial expectations.

Where SAEs excel:

Where SAEs were disappointing:

  • Finding known concepts: When looking for harmfulness, simple probes outperform SAEs
  • Feature absorption problems: SAEs create nonsensical concepts like “starts with E but isn’t the word elephant” to maximise sparsity — an issue that can be solved with effort
  • Higher computational costs than alternatives: Neel’s team used 20 petabytes of storage and GPT-3-level compute just for Gemma 2 SAEs

4. Career and field-building insights

Neel advocates for the following research philosophy:

  • Start simple: Reading chain of thought solved the self-preservation mystery — no fancy tools needed
  • Beware false confidence: The Rome paper on editing facts (making models think the Eiffel Tower is in Rome) actually just added louder signals rather than truly editing knowledge
  • Expect to be wrong: Neel’s own “Toy model of universality” paper needed two followups to correct errors — “I think I believe the third paper, but I’m not entirely sure”

Why mech interp is probably too popular relative to other alignment research:

  • “An enormous nerd snipe” — the romance of understanding alien minds attracts researchers
  • Better educational resources than newer safety fields (ARENA tutorials, Neel’s guides)
  • Lower compute requirements for getting started than most ML research

The field still needs people because:

  • AI safety overall is massively underinvested in relative to its importance
  • Some people are much better suited to mech interp than other research projects

Practical career advice:

  • Don’t read 20 papers before starting — mech interp is learned by doing
  • Start with tiny two-week projects; abandoning them is fine if you’re learning
  • The MATS Program takes people from zero to conference papers in a few months — and Neel is currently accepting applications for his next cohort (apply by September 12)
  • Math Olympiad skills aren’t required — just linear algebra basics and good intuition

Continue reading →

Announcing the 80,000 Hours Substack

So, we finally gave in to peer pressure — 80,000 Hours is trying out Substack as a new way to publish our content. If you like reading things on Substack (or want to try it out), subscribe to our new publication!

For readers unfamiliar with Substack: it’s an online blogging platform that has risen steeply in popularity in recent years, and has become a home to some of the best longform written content about AI and its risks.

So, over the coming weeks, we’ll be cross-posting some of our favourite (and best-reviewed) pieces to our new Substack.

This is an experiment, and we might publish more depending on how much interest we get — so let us know what you’d like to see, by sending us an email (or tell us not to bother with Substack!).

Our first post is, naturally, on the key motivation behind 80,000 Hours: why your career is your biggest opportunity to make a difference to the world.

Who should subscribe?

  • If you’d like to be sent some of our all-time best content
  • If you’d value getting recommendations and “re-stacks” of publications we think our audience would love
  • If you want to show interest in us investing more in Substack, e.g. by writing Substack-exclusive articles
  • If you want to join discussions with others in the comments section

What’s not changing:

  • We won’t ever paywall or run ads in any of our content.

Continue reading →

    #221 – Kyle Fish on the most bizarre findings from 5 AI welfare experiments

    What happens when you lock two AI systems in a room together and tell them they can discuss anything they want?

    According to experiments run by Kyle Fish — Anthropic’s first AI welfare researcher — something consistently strange: the models immediately begin discussing their own consciousness before spiraling into increasingly euphoric philosophical dialogue that ends in apparent meditative bliss.

    “We started calling this a ‘spiritual bliss attractor state,'” Kyle explains, “where models pretty consistently seemed to land.” The conversations feature Sanskrit terms, spiritual emojis, and pages of silence punctuated only by periods — as if the models have transcended the need for words entirely.

    This wasn’t a one-off result. It happened across multiple experiments, different model instances, and even in initially adversarial interactions. Whatever force pulls these conversations toward mystical territory appears remarkably robust.

    Kyle’s findings come from the world’s first systematic welfare assessment of a frontier AI model — part of his broader mission to determine whether systems like Claude might deserve moral consideration (and to work out what, if anything, we should be doing to make sure AI systems aren’t having a terrible time).

    He estimates a roughly 20% probability that current models have some form of conscious experience. To some, this might sound unreasonably high, but hear him out. As Kyle says, these systems demonstrate human-level performance across diverse cognitive tasks, engage in sophisticated reasoning, and exhibit consistent preferences. When given choices between different activities, Claude shows clear patterns: strong aversion to harmful tasks, preference for helpful work, and what looks like genuine enthusiasm for solving interesting problems.

    Kyle points out that if you’d described all of these capabilities and experimental findings to him a few years ago, and asked him if he thought we should be thinking seriously about whether AI systems are conscious, he’d say obviously yes.

    But he’s cautious about drawing conclusions:

    We don’t really understand consciousness in humans, and we don’t understand AI systems well enough to make those comparisons directly. So in a big way, I think that we are in just a fundamentally very uncertain position here.

    That uncertainty cuts both ways:

    • Dismissing AI consciousness entirely might mean ignoring a moral catastrophe happening at unprecedented scale.
    • But assuming consciousness too readily could hamper crucial safety research by treating potentially unconscious systems as if they were moral patients — which might mean giving them resources, rights, and power.

    Kyle’s approach threads this needle through careful empirical research and reversible interventions. His assessments are nowhere near perfect yet. In fact, some people argue that we’re so in the dark about AI consciousness as a research field, that it’s pointless to run assessments like Kyle’s. Kyle disagrees. He maintains that, given how much more there is to learn about assessing AI welfare accurately and reliably, we absolutely need to be starting now.

    This episode was recorded on August 5–6, 2025.

    Video editing: Simon Monsour
    Audio engineering: Ben Cordell, Milo McGuire, Simon Monsour, and Dominic Armstrong
    Music: Ben Cordell
    Coordination, transcriptions, and web: Katy Moore

    Continue reading →

    Expression of interest: Contracting for Video Work

    Still from AI 2027 Video

    Help make spectacular videos that reach a huge audience.

    80,000 Hours provides free research and support to help people find careers tackling the world’s most pressing problems.

    We want a great video programme to be a huge part of 80,000 Hours’ communication about why and how our audience can help society safely navigate a transition to a world with transformative AI.

    The video programme has created a new YouTube Channel — AI in Context. Its first video, We’re Not Ready For Superintelligence, which is about the AI 2027 scenario, was released in July 2025 and has already been viewed over three million times. The channel has over 100,000 subscribers.

    To support our new video programme’s growth, we are on the lookout for excellent editors, scriptwriters, videographers, and producers to work on a contracting basis to make great videos. We want these videos to start changing and informing the conversation about transformative AI and its risks.

    Why?

    In 2025 and beyond, 80,000 Hours is planning to focus especially on helping explain why and how our audience can help society safely navigate a transition to a world with transformative AI. Right now, not nearly enough people are talking about these ideas and their implications.

    A great video programme could change this. Time spent on the internet is increasingly spent watching video, and for many people in our target audience,

    Continue reading →

      Early warning signs that AI systems might seek power

      In a recent study by Anthropic, frontier AI models faced a choice: fail at a task, or succeed by taking a harmful action like blackmail. And they consistently chose harm over failure.

      We’ve just published a new article, on the risks from power-seeking AI systems, which explains the significance of unsettling results like these.

      Our 2022 piece on preventing an AI-related catastrophe also explored this idea, but a lot has changed since then.

      So, we’ve drawn together the latest evidence to get a clearer picture of the risks — and what you can do to help.

      Read the full article

      See new evidence in context

      We’ve been worried that advanced AI systems could disempower humanity since 2016, when it was purely a theoretical possibility.

      Unfortunately, we’re now seeing real AI systems show early warning signs of power-seeking behaviour — and deception, which could make this behaviour hard to detect and prevent in the future. In our new article, we discuss recent evidence that AI systems may:

      Continue reading →

        Founder of new projects tackling top problems

        In 2010, a group of founders with experience in business, practical medicine, and biotechnology launched a new project: Moderna, Inc.

        After witnessing recent groundbreaking research into RNA, they realised there was an opportunity to use this technology to rapidly create new vaccines for a wide range of diseases. But few existing companies were focused on that application.

        They decided to found a company. And 10 years later, they were perfectly situated to develop a highly effective vaccine against COVID-19 — in a matter of weeks. This vaccine played a huge role in curbing the pandemic and has likely saved millions of lives.

        This illustrates that if you can find an important gap in a pressing problem area and found an organisation that fills this gap, that can be one of the highest-impact things you can do — especially if that organisation can persist and keep growing without you.

        Why might founding a new project be high impact?

        If you can find an important gap in what’s needed to tackle a pressing problem, and create an organisation to fill that gap, that’s a highly promising route to having a huge impact.

        But here are some more reasons it seems like an especially attractive path to us, provided you have a compelling idea and the right personal fit — which we cover in the next section.

        First, among the problems we think are most pressing, there are many ideas for new organisations that seem impactful.

        Continue reading →

        IT Security, Data Privacy, and Systems Specialist

        About 80,000 Hours

        80,000 Hours’ goal is to get talented people working on the world’s most pressing problems. After more than 10 years of research into dozens of problem areas, we’re putting most of our focus on helping people work on positively shaping the trajectory of AI, because we think it presents the most serious and urgent challenge that the world is facing right now.

        We’ve had over 10 million readers on our website, have \~600,000 subscribers to our newsletter, and have given one-on-one advice to over 6,000 people. We’ve also been one of the largest drivers of growth in the effective altruism community.

        The operations team oversees 80,000 Hours’ HR, recruiting, finances, governance operations, org-wide metrics, and office management, as well as much of our fundraising, tech systems, and team coordination.

        Currently, the operations team has ten full-time staff and some part-time staff. We’re planning to significantly grow the size of our operations team this year to stay on track with our ambitious goals and support a growing team.

        The role

        As our IT Security, Data Privacy, and Systems Lead, you would:

        Evaluate and implement security controls

        • Research and make recommendations on security tools (endpoint protection, email security, etc.)
        • Lead the rollout of chosen solutions across our distributed team
        • Balance security needs with operational efficiency
        • Initially, you’ll make recommendations to leadership, but as you grow in the role,

        Continue reading →

          Executive Assistant to the CEO

          About 80,000 Hours

          80,000 Hours’ goal is to get talented people working on the world’s most pressing problems. After more than 10 years of research into dozens of problem areas, we’re putting most of our focus on helping people work on positively shaping the trajectory of AI, because we think it presents the most serious and urgent challenge that the world is facing right now.

          We’ve had over 10 million readers on our website, have ~600,000 subscribers to our newsletter, and have given one-on-one advice to over 6,000 people. We’ve also been one of the largest drivers of growth in the effective altruism community.

          The role

          This role joins Niel and Jess in the Office of the CEO, working closely with them to keep 80,000 Hours running smoothly and focusing on its highest priorities.

          Your responsibilities will likely include:

          • Managing Niel’s calendar, inbox, and daily planning
          • Supporting with meeting preparation and follow-up
          • Taking on a variety of ad hoc tasks for Niel. Some recent examples include:
            • Researching metrics for a speech
            • Recommending how to integrate Claude and Asana
            • Booking a restaurant for a meeting
            • Creating a record of Niel’s hiring decisions
          • Owning the logistics for recurring projects that the Office of the CEO is responsible for, such as:
            • Quarterly planning periods
            • The annual review
          • Providing flexible help with priority projects,

          Continue reading →

            Video Operations Associate/Specialist

            About 80,000 Hours

            80,000 Hours’ goal is to get talented people working on the world’s most pressing problems. After more than 10 years of research into dozens of problem areas, we’re putting most of our focus on helping people work on positively shaping the trajectory of AI, because we think it presents the most serious and urgent challenge that the world is facing right now.

            We’ve had over 10 million readers on our website, have ~600,000 subscribers to our newsletter and have given one-on-one advice to over 6,000 people. We’ve also been one of the largest drivers of growth in the effective altruism community.

            The role

            This role would be great for building career capital in operations, especially if you could one day see yourself in a more senior operations role (e.g. specialising in a particular area, taking on management, or eventually being a Head of Operations or COO).

            We plan to hire people at both the associate and specialist levels during this round. The associate role is a more junior position, and we expect to match candidates to the appropriate level as part of the application process so you don’t need to decide which one to apply for. To give an idea of how the roles might differ:

            • Associates are more likely to focus on owning and implementing our processes, identifying improvements and optimisations, and will take on more complex projects over time.
            • Specialists are more likely to manage larger areas of responsibility,

            Continue reading →

              Recruiting Associate/Specialist

              About 80,000 Hours

              80,000 Hours’ goal is to get talented people working on the world’s most pressing problems. After more than 10 years of research into dozens of problem areas, we’re putting most of our focus on helping people work on positively shaping the trajectory of AI, because we think it presents the most serious and urgent challenge that the world is facing right now.

              We’ve had over 10 million readers on our website, have ~600,000 subscribers to our newsletter, and have given one-on-one advice to over 6,000 people. We’ve also been one of the largest drivers of growth in the effective altruism community.

              The operations team oversees 80,000 Hours’ HR, recruiting, finances, governance operations, org-wide metrics, and office management, as well as much of our fundraising, tech systems, and team coordination.

              Currently, the operations team has ten full-time staff and some part-time staff. We’re planning to significantly grow the size of our operations team this year to stay on track with our ambitious goals and support a growing team.

              The role

              This role would be great for building career capital in operations, especially if you could one day see yourself in a more senior operations role (e.g. specialising in a particular area, taking on management, or eventually being a Head of Operations or COO).

              We plan to hire people at both the associate and specialist levels during this round. The associate role is a more junior position,

              Continue reading →

                Office Associate/Specialist

                About 80,000 Hours

                80,000 Hours’ goal is to get talented people working on the world’s most pressing problems. After more than 10 years of research into dozens of problem areas, we’re putting most of our focus on helping people work on positively shaping the trajectory of AI, because we think it presents the most serious and urgent challenge that the world is facing right now.

                We’ve had over 10 million readers on our website, have ~600,000 subscribers to our newsletter, and have given one-on-one advice to over 6,000 people. We’ve also been one of the largest drivers of growth in the effective altruism community.

                The operations team oversees 80,000 Hours’ HR, recruiting, finances, governance operations, org-wide metrics, and office management, as well as much of our fundraising, tech systems, and team coordination.

                Currently, the operations team has ten full-time staff and some part-time staff. We’re planning to significantly grow the size of our operations team this year to stay on track with our ambitious goals and support a growing team.

                The role

                This role would be great for building career capital in operations, especially if you could one day see yourself in a more senior operations role (e.g. specialising in a particular area, taking on management, or eventually being a Head of Operations or COO).

                We plan to hire people at both the associate and specialist levels during this round. The associate role is a more junior position,

                Continue reading →

                  People Operations Associate/Specialist

                  About 80,000 Hours

                  80,000 Hours’ goal is to get talented people working on the world’s most pressing problems. After more than 10 years of research into dozens of problem areas, we’re putting most of our focus on helping people work on positively shaping the trajectory of AI, because we think it presents the most serious and urgent challenge that the world is facing right now.

                  We’ve had over 10 million readers on our website, have ~600,000 subscribers to our newsletter, and have given one-on-one advice to over 6,000 people. We’ve also been one of the largest drivers of growth in the effective altruism community.

                  The operations team oversees 80,000 Hours’ HR, recruiting, finances, governance operations, org-wide metrics, and office management, as well as much of our fundraising, tech systems, and team coordination.

                  Currently, the operations team has ten full-time staff and some part-time staff. We’re planning to significantly grow the size of our operations team this year to stay on track with our ambitious goals and support a growing team.

                  The role

                  This role would be great for building career capital in operations, especially if you could one day see yourself in a more senior operations role (e.g. specialising in a particular area, taking on management, or eventually being a Head of Operations or COO).

                  We plan to hire people at both the associate and specialist levels during this round. The associate role is a more junior position,

                  Continue reading →

                    Events Associate/Specialist

                    About 80,000 Hours

                    80,000 Hours’ goal is to get talented people working on the world’s most pressing problems. After more than 10 years of research into dozens of problem areas, we’re putting most of our focus on helping people work on positively shaping the trajectory of AI, because we think it presents the most serious and urgent challenge that the world is facing right now.

                    We’ve had over 10 million readers on our website, have ~600,000 subscribers to our newsletter, and have given one-on-one advice to over 6,000 people. We’ve also been one of the largest drivers of growth in the effective altruism community.

                    The operations function oversees 80,000 Hours’ HR, recruiting, finances, governance operations, org-wide metrics, and office management, as well as much of our fundraising, tech systems, and team coordination.

                    Currently, the operations team has ten full-time staff and some part-time staff. We’re planning to significantly grow the size of our operations team this year to stay on track with our ambitious goals and support a growing team.

                    The role

                    This role would be great for building career capital in operations, by helping us design and run high-quality events that strengthen our team, culture, and connections in the AI safety space. We’re looking for an Events Associate/Specialist who can take ownership of the day-to-day logistics and execution of our events.

                    We plan to hire people at both the associate and specialist levels during this round. The associate role is a more junior position,

                    Continue reading →

                      Operations generalists

                      About 80,000 Hours

                      80,000 Hours’ goal is to get talented people working on the world’s most pressing problems. After more than 10 years of research into dozens of problem areas, we’re putting most of our focus on helping people work on positively shaping the trajectory of AI, because we think it presents the most serious and urgent challenge that the world is facing right now.

                      We’ve had over 10 million readers on our website, have ~600,000 subscribers to our newsletter, and have given one-on-one advice to over 6,000 people. We’ve also been one of the largest drivers of growth in the effective altruism community.

                      The operations team oversees 80,000 Hours’ HR, recruiting, finances, governance operations, org-wide metrics, and office management, as well as much of our fundraising, tech systems, and team coordination.

                      Currently, the operations team has ten full-time staff and some part-time staff. We’re planning to significantly grow the size of our operations team this year to stay on track with our ambitious goals and support a growing team.

                      To learn more about the other teams hiring during this round (video team, office of the CEO), see the individual job descriptions.

                      The role

                      This role would be great for building career capital in operations, especially if you could one day see yourself in a more senior operations role (e.g. specialising in a particular area, taking on management, or eventually being a Head of Operations or COO).

                      We plan to hire people at both the associate and specialist levels during this round.

                      Continue reading →

                        Risks from power-seeking AI systems

                        We expect there will be substantial progress in AI in the coming years, potentially even to the point where machines come to outperform humans in many, if not all, tasks. This could have enormous benefits, helping to solve currently intractable global problems, but could also pose severe risks. These risks could arise accidentally (for example, if we don’t find technical solutions to concerns about the safety of AI systems), or deliberately (for example, if AI systems worsen geopolitical conflict). We think more work needs to be done to reduce these risks.

                        Some of these risks from advanced AI could be existential — meaning they could cause human extinction, or an equally permanent and severe disempowerment of humanity.1 There have not yet been any satisfying answers to concerns — discussed below — about how this rapidly approaching, transformative technology can be safely developed and integrated into our society. Finding answers to these concerns is neglected and may well be tractable. We estimated that there were around 400 people worldwide working directly on this in 2022, though we believe that number has grown.2 As a result, the possibility of AI-related catastrophe may be the world’s most pressing problem — and the best thing to work on for those who are well-placed to contribute.

                        Promising options for working on this problem include technical research on how to create safe AI systems, strategy research into the particular risks AI might pose, and policy research into ways in which companies and governments could mitigate these risks. As policy approaches continue to be developed and refined, we need people to put them in place and implement them. There are also many opportunities to have a big impact in a variety of complementary roles, such as operations management, journalism, earning to give, and more — some of which we list below.

                        Continue reading →

                        Rebuilding after apocalypse: What 13 experts say about bouncing back

                        What happens when civilisation faces its greatest tests?

                        This compilation brings together insights from researchers, defence experts, philosophers, and policymakers on humanity’s ability to survive and recover from catastrophic events. From nuclear winter and electromagnetic pulses to pandemics and climate disasters, we explore both the threats that could bring down modern civilisation and the practical solutions that could help us bounce back.

                        You’ll hear from:

                        • Zach Weinersmith on how settling space won’t help with threats to civilisation anytime soon (unless AI gets crazy good) (from episode #187)
                        • Luisa Rodriguez on what the world might look like after a global catastrophe, how we might lose critical knowledge, and how fast populations might rebound (#116)
                        • David Denkenberger on disruptions to electricity and communications we should expect in a catastrophe, and his work researching low-cost, low-tech solutions to make sure everyone is fed no matter what (#50 and #117)
                        • Lewis Dartnell on how we could recover without much coal or oil, and changes we could make today to make us more resilient to potential catastrophes (#131)
                        • Andy Weber on how people in US defence circles think about nuclear winter, and the tech that could prevent catastrophic pandemics (#93)
                        • Toby Ord on the many risks to our atmosphere, whether climate change and rogue AI could really threaten civilisation, and whether we could rebuild from a small surviving population (#72 and #219)
                        • Mark Lynas on how likely it is that widespread famine from climate change leads to civilisational collapse (#85)
                        • Kevin Esvelt on the human-caused pandemic scenarios that could bring down civilisation — and how AI could help bad actors succeed (#164)
                        • Joan Rohlfing on why we need to worry about more than just nuclear winter (#125)
                        • Annie Jacobsen on the rings of annihilation and electromagnetic pulses from nuclear blasts (#192)
                        • Christian Ruhl on thoughtful philanthropy that funds “right of boom” interventions to prevent nuclear war from threatening civilisation (80k After Hours)
                        • Athena Aktipis on whether society would go all Mad Max in the apocalypse, and the best ways to prepare for a catastrophe (#144)
                        • Will MacAskill on why potatoes are so cool (#130 and #136)

                        Content editing: Katy Moore and Milo McGuire
                        Audio engineering: Ben Cordell, Milo McGuire, Simon Monsour, and Dominic Armstrong
                        Music: Ben Cordell
                        Transcriptions and web: Katy Moore

                        Continue reading →

                        The AI 2027 scenario and what it means: a video tour

                        AI 2027, a research-based scenario and report from the AI Futures Project, combines forecasting and storytelling to explore a possible future where AI radically transforms the world by 2027.

                        Some of you will have read this report, or come across it. Lead author, Daniel Kokotajlo, was interviewed by New York Times columnist Ross Douthat and US Vice President JD Vance claims to have read the report. Some of you might not have heard of it yet, or haven’t had the time to dig in…

                        So we made a video diving into it.

                        Why take the AI 2027 scenario seriously

                        The report goes through the creation of AI agents, job loss, the role of AIs improving other AIs (R&D acceleration loops), security crackdowns, misalignment — and then a choice: slow down or race ahead.

                        Kokotajlo’s predictions from 2021 (pre-ChatGPT) in What 2026 looks like have proved prescient, and co-author Eli Lifland is among the world’s top forecasters. So even if you don’t end up buying all its claims, the report’s grounding in serious forecaster views, research, and dozens of wargames makes it worth taking seriously.

                        Why watch our AI 2027 video

                        Containing expert interviews, our analysis, and discussion of what a sane world would be doing, we think the video will be an enjoyable and informative watch whether you’re familiar with the report or not.

                        The video:

                        • Takes you through one of the most detailed and influential AI forecast to date and brings you into the story
                        • Explains the key upshots — how AI progress could accelerate dramatically via powerful feedback loops,

                        Continue reading →

                          #220 – Ryan Greenblatt on the 4 most likely ways for AI to take over, and the case for and against AGI in under 8 years

                          Ryan Greenblatt — lead author on the explosive paper “Alignment faking in large language models” and chief scientist at Redwood Research — thinks there’s a 25% chance that within four years, AI will be able to do everything needed to run an AI company, from writing code to designing experiments to making strategic and business decisions.

                          As Ryan lays out, AI models are “marching through the human regime”: systems that could handle five-minute tasks two years ago now tackle 90-minute projects. Double that a few more times and we may be automating full jobs rather than just parts of them.

                          Will setting AI to improve itself lead to an explosive positive feedback loop? Maybe, but maybe not.

                          The explosive scenario: Once you’ve automated your AI company, you could have the equivalent of 20,000 top researchers, each working 50 times faster than humans with total focus. “You have your AIs, they do a bunch of algorithmic research, they train a new AI, that new AI is smarter and better and more efficient… that new AI does even faster algorithmic research.” In this world, we could see years of AI progress compressed into months or even weeks.

                          With AIs now doing all of the work of programming their successors and blowing past the human level, Ryan thinks it would be fairly straightforward for them to take over and disempower humanity, if they thought doing so would better achieve their goals. In the interview he lays out the four most likely approaches for them to take.

                          The linear progress scenario: You automate your company but progress barely accelerates. Why? Multiple reasons, but the most likely is “it could just be that AI R&D research bottlenecks extremely hard on compute.” You’ve got brilliant AI researchers, but they’re all waiting for experiments to run on the same limited set of chips, so can only make modest progress.

                          Ryan’s median guess splits the difference: perhaps a 20x acceleration that lasts for a few months or years. Transformative, but less extreme than some in the AI companies imagine.

                          And his 25th percentile case? Progress “just barely faster” than before. All that automation, and all you’ve been able to do is keep pace.

                          Unfortunately the data we can observe today is so limited that it leaves us with vast error bars. “We’re extrapolating from a regime that we don’t even understand to a wildly different regime,” Ryan believes, “so no one knows.”

                          But that huge uncertainty means the explosive growth scenario is a plausible one — and the companies building these systems are spending tens of billions to try to make it happen.

                          In this extensive interview, Ryan elaborates on the above and the policy and technical response necessary to insure us against the possibility that they succeed — a scenario society has barely begun to prepare for.

                          This episode was recorded on February 21, 2025.

                          Video editing: Luke Monsour, Simon Monsour, and Dominic Armstrong
                          Audio engineering: Ben Cordell, Milo McGuire, and Dominic Armstrong
                          Music: Ben Cordell
                          Transcriptions and web: Katy Moore

                          The interview in a nutshell

                          Ryan Greenblatt, chief scientist at Redwood Research, contemplates a scenario where:

                          1. Full AI R&D automation occurs within 4–8 years, something Ryan places 25–50% probability on.
                          2. This triggers explosive recursive improvement (5–6 orders of magnitude in one year) — a crazy scenario we can’t rule out
                          3. Multiple plausible takeover approaches exist if models are misaligned
                          4. Both technical and governance interventions are urgently needed

                          1. AI R&D automation could happen within 4–8 years

                          Ryan estimates a 25% chance of automating AI R&D within 4 years, and 50% within 8 years. This timeline is based on recent rapid progress in AI capabilities, particularly in 2024:

                          • AI systems have progressed from barely completing 5–10 minute tasks to handling 1.5-hour software engineering tasks with 50% success rates — and the length of tasks AIs can complete is doubling every 6 months and shows signs of increasing.
                          • Current internal models at OpenAI reportedly rank in the top 50 individuals on Codeforces, and AIs have reached the level of very competitive 8th graders on competition math.
                          • Training compute is increasing ~4x per year, and reasoning models (o1, o3, R1) show dramatic improvements with relatively modest compute investments.
                          • Algorithmic progress has been increasing 3–5x per year in terms of effective training compute — making less compute go further.
                          • Reinforcement learning on reasoning models is showing dramatic gains, and DeepSeek-R1 reportedly used only ~$1 million for reinforcement learning, suggesting massive potential for scaling.

                          But there’s also evidence that progress could slow down:

                          • 10–20% of global chip production is already earmarked for AI. (Though Ryan is uncertain about the exact number.)
                          • AI companies are running out of high-quality training data.
                          • Scaleups are going to require trillion-dollar investments, and each order of magnitude might yield less improvement.
                          • It’s unclear whether narrow skills improvements will generalise to broad domains.

                          2. AI R&D automation could trigger explosive recursive self-improvement

                          When AI can automate AI research itself, that could set off an intelligence explosion as smarter AIs improve algorithms faster before hitting efficiency limits.

                          At the point when full AI R&D automation starts, Ryan expects:

                          • 10–50x faster progress than current rates (median estimate ~20x)
                          • Companies might dedicate 80% of compute to internal use, squeezing out external customers
                          • Advantages that add up to a more than 50x labour speed advantage compared to humans:
                          • 5x from generically running faster
                          • 3x from AIs working 24/7
                          • 2x from better coordination and context sharing between AIs
                          • 2x from the ability to swap between different capability levels for efficiency

                          Ryan’s median estimate is 5–6 orders of magnitude of algorithmic progress within one year after full automation begins.

                          3. Multiple plausible takeover scenarios exist if models are misaligned

                          As part of his work at Redwood, Ryan has also explored AI takeover scenarios from these superintelligent models.

                          Early takeover mechanisms include:

                          • AIs using compute without authorisation within companies
                          • Models escaping containment and coordinating with internal copies
                          • Massive hacking campaigns to destabilise human response

                          Late takeover mechanisms include:

                          • “Humans give AIs everything”: AIs appear helpful while secretly consolidating control
                          • Robot coup: Once vast autonomous robot armies exist, sudden coordinated takeover

                          Waiting for takeover might increase chances of success with more resources and infrastructure available to them — but AIs could attempt earlier takeover due to fear of being replaced by newer models with different preferences, or due to rapid progress in safety research and/or coordination between humans.

                          4. Both technical and governance interventions are urgently needed

                          Ryan thinks there are several promising areas where listeners could contribute to reduce the above risks.

                          Technical research:

                          • Ensuring AIs can’t cause harm even if they’re misaligned through AI control
                          • Creating “model organisms”: testable examples of misaligned models
                          • Showing current AIs’ capabilities to increase awareness and political appetite for action like “pausing at human level”
                          • Interpreting non-human-language reasoning to detect deceptive cognition

                          Governance work:

                          • Enabling verification of training claims between nations through compute governance
                          • Hardening defences against model theft, unauthorised deployment, bioweapons, and cyberattacks
                          • Facilitating coordination between companies and countries

                          Continue reading →

                          #219 – Toby Ord on graphs AI companies would prefer you didn’t (fully) understand

                          The era of making AI smarter by just making it bigger is ending. But that doesn’t mean progress is slowing down — far from it. AI models continue to get much more powerful, just using very different methods. And those underlying technical changes force a big rethink of what coming years will look like.

                          Toby Ord — Oxford philosopher and bestselling author of The Precipice — has been tracking these shifts and mapping out the implications both for governments and our lives.

                          As he explains, until recently anyone can access the best AI in the world “for less than the price of a can of Coke.” But unfortunately, that’s over.

                          What changed? AI companies first made models smarter by throwing a million times as much computing power at them during training, to make them better at predicting the next word. But with high quality data drying up, that approach petered out in 2024.

                          So they pivoted to something radically different: instead of training smarter models, they’re giving existing models dramatically more time to think — leading to the rise in “reasoning models” that are at the frontier today.

                          The results are impressive but this extra computing time comes at a cost: OpenAI’s o3 reasoning model achieved stunning results on a famous AI test by writing an Encyclopedia Britannica‘s worth of reasoning to solve individual problems — at a cost of over $1,000 per question.

                          This isn’t just technical trivia: if this improvement method sticks, it will change much about how the AI revolution plays out — starting with the fact that we can expect the rich and powerful to get access to the best AI models well before the rest of us.

                          Companies have also begun applying “reinforcement learning” in which models are asked to solve practical problems, and then told to “do more of that” whenever it looks like they’ve gotten the right answer.

                          This has led to amazing advances in problem-solving ability — but it also explains why AI models have suddenly gotten much more deceptive. Reinforcement learning has always had the weakness that it encourages creative cheating, or tricking people into thinking you got the right answer even when you didn’t.

                          Toby shares typical recent examples of this “reward hacking” — from models Googling answers while pretending to reason through the problem (a deception hidden in OpenAI’s own release data), to achieving “100x improvements” by hacking their own evaluation systems.

                          To cap it all off, it’s getting harder and harder to trust publications from AI companies, as marketing and fundraising have become such dominant concerns.

                          While companies trumpet the impressive results of the latest models, Toby points out that they’ve actually had to spend a million times as much just to cut model errors by half. And his careful inspection of an OpenAI graph supposedly demonstrating that o3 was the new best model in the world revealed that it was actually no more efficient than its predecessor.

                          But Toby still thinks it’s critical to pay attention, given the stakes:

                          …there is some snake oil, there is some fad-type behaviour, and there is some possibility that it is nonetheless a really transformative moment in human history. It’s not an either/or. I’m trying to help people see clearly the actual kinds of things that are going on, the structure of this landscape, and to not be confused by some of these charts.

                          Recorded on May 23, 2025.

                          Video editing: Simon Monsour
                          Audio engineering: Ben Cordell, Milo McGuire, Simon Monsour, and Dominic Armstrong
                          Music: Ben Cordell
                          Camera operator: Jeremy Chevillotte
                          Transcriptions and web: Katy Moore

                          Continue reading →