Imagine if some alien organism landed on Earth and could do these things.

Everybody would be falling over themselves to figure out how… And so really the thing that is calling out in all this work for us to go and answer is, “What in the wide world is going on inside these systems??”

Chris Olah

Big machine learning models can identify plant species better than any human, write passable essays, beat you at a game of Starcraft 2, figure out how a photo of Tobey Maguire and the word ‘spider’ are related, solve the 60-year-old ‘protein folding problem’, diagnose some diseases, play romantic matchmaker, write solid computer code, and offer questionable legal advice.

Humanity made these amazing and ever-improving tools. So how do our creations work? In short: we don’t know.

Today’s guest, Chris Olah, finds this both absurd and unacceptable. Over the last ten years he has been a leader in the effort to unravel what’s really going on inside these black boxes. As part of that effort he helped create the famous DeepDream visualisations at Google Brain, reverse engineered the CLIP image classifier at OpenAI, and is now continuing his work at Anthropic, a new $100 million research company that tries to “co-develop the latest safety techniques alongside scaling of large ML models”.

Despite having a huge fan base thanks to his tweets and lay explanations of ML, today’s episode is the first long interview Chris has ever given. It features his personal take on what we’ve learned so far about what ML algorithms are doing, and what’s next for this research agenda at Anthropic.

His decade of work has borne substantial fruit, producing an approach for looking inside the mess of connections in a neural network and back out what functional role each piece is serving. Among other things, Chris and team found that every visual classifier seems to converge on a number of simple common elements in their early layers — elements so fundamental they may exist in our own visual cortex in some form.

They also found networks developing ‘multimodal neurones’ that would trigger in response to the presence of high-level concepts like ‘romance’, across both images and text, mimicking the famous ‘Halle Berry neuron’ from human neuroscience.

While reverse engineering how a mind works would make any top-ten list of the most valuable knowledge to pursue for its own sake, Chris’s work is also of urgent practical importance. Machine learning models are already being deployed in medicine, business, the military, and the justice system, in ever more powerful roles. The competitive pressure to put them into action as soon as they can turn a profit is great, and only getting greater.

But if we don’t know what these machines are doing, we can’t be confident they’ll continue to work the way we want as circumstances change. Before we hand an algorithm the proverbial nuclear codes, we should demand more assurance than “well, it’s always worked fine so far”.

But by peering inside neural networks and figuring out how to ‘read their minds’ we can potentially foresee future failures and prevent them before they happen. Artificial neural networks may even be a better way to study how our own minds work, given that, unlike a human brain, we can see everything that’s happening inside them — and having been posed similar challenges, there’s every reason to think evolution and ‘gradient descent’ often converge on similar solutions.

Among other things, Rob and Chris cover:

  • Why Chris thinks it’s necessary to work with the largest models
  • Whether you can generalise from visual to language models
  • What fundamental lessons we’ve learned about how neural networks (and perhaps humans) think
  • What it means that neural networks are learning high-level concepts like ‘superheroes’, mental health, and Australiana, and can identify these themes across both text and images
  • How interpretability research might help make AI safer to deploy, and Chris’ response to skeptics
  • Why there’s such a fuss about ‘scaling laws’ and what they say about future AI progress
  • What roles Anthropic is hiring for, and who would be a good fit for them

Get this episode by subscribing to our podcast on the world’s most pressing problems and how to solve them: type 80,000 Hours into your podcasting app. Or read the transcript below.

Producer: Keiran Harris
Audio mastering: Ben Cordell
Transcriptions: Sofia Davis-Fogel



Chris Olah: Well, in the last couple of years, neural networks have been able to accomplish all of these tasks that no human knows how to write a computer program to do directly. We can’t write a computer program to go and classify images, but we can write a neural network to create a computer program that can classify images. We can’t go and write computer programs directly to go and translate text highly accurately, but we can train the neural network to go and translate texts much better than any program we could have written.

Chris Olah: And it’s always seemed to me that the question that is crying out to be answered there is, “How is it that these models are doing these things that we don’t know how to do?” I think a lot of people might debate exactly what interpretability is. But the question that I’m interested in is, “How do these systems accomplish these tasks? What’s going on inside of them?”

Chris Olah: Imagine if some alien organism landed on Earth and could go and do these things. Everybody would be rushing and falling over themselves to figure out how the alien organism was doing things. You’d have biologists fighting each other for the right to go and study these alien organisms. Or imagine that we discovered some binary just floating on the internet in 2012 that could do all these things. Everybody would be rushing to go and try and reverse engineer what that binary is doing. And so it seems to me that really the thing that is calling out in all this work for us to go and answer is, “What in the wide world is going on inside these systems??”

Chris Olah: The really amazing thing is that as you start to understand what different neurons are doing, you actually start to be able to go and read algorithms off of the weights… We can genuinely understand how large chunks of neural networks work. We can actually reverse engineer chunks of neural networks and understand them so well that we can go and hand-write weights… I think that is a very high standard for understanding systems.

Chris Olah: Ultimately the reason I study (and think it’s useful to study) these smaller chunks of neural network is that it gives us an epistemic foundation for thinking about interpretability. The cost is that we’re talking to these small parts and we’re setting ourselves up for a struggle to be able to build up the understanding of large neural networks and make this sort of analysis really useful. But it has the upside that we’re working with such small pieces that we can really objectively understand what’s going on… And I think there’s just a lot of disagreement and confusion, and I think it’s just genuinely really hard to understand neural networks and very easy to misunderstand them, so having something like that seems really useful.


Chris Olah: Just like animals have very similar anatomies — I guess in the case of animals due to evolution — it seems neural networks actually have a lot of the same things forming, even when you train them on different data sets, even when they have different architectures, even though the scaffolding is different. The same features and the same circuits form. And actually I find that the fact that the same circuits form to be the most remarkable part. The fact that the same features form is already pretty cool, that the neural network is learning the same fundamental building blocks of understanding vision or understanding images.

Chris Olah: But then, even though it’s scaffolded on differently, it’s literally learning the same weights, connecting the same neurons together. So we call that ‘universality.’ And that’s pretty crazy. It’s really tempting when you start to find things like that to think “Oh, maybe the same things form also in humans. Maybe it’s actually something fundamental.” Maybe these models are discovering the basic building blocks of vision that just slice up our understanding of images in this very fundamental way.

Chris Olah: And in fact, for some of these things, we have found them in humans. So some of these lower-level vision things seem to mirror results from neuroscience. And in fact, in some of our most recent work, we’ve discovered something that was previously only seen in humans, these multimodal neurons.

Multimodal neurons

Chris Olah: So we were investigating this model called CLIP from OpenAI, which you can roughly think of as being trained to caption images or to pair images with their captions. So it’s not classifying images, it’s doing something a little bit different. And we found a lot of things that were really deeply qualitatively different inside it. So if you look at low-level vision actually, a lot of it is very similar, and again is actually further evidence for universality.

Chris Olah: A lot of the same things we find in other vision models occur also in early vision in CLIP. But towards the end, we find these incredibly abstract neurons that are just very different from anything we’d seen before. And one thing that’s really interesting about these neurons is they can read. They can go and recognize text and images, and they fuse this together, so they fuse it together with the thing that’s being detected.

Chris Olah: So there’s a yellow neuron for instance, which responds to the color yellow, but it also responds if you write the word yellow out. That will fire as well. And actually it’ll fire if you write out the words for objects that are yellow. So if you write the word ‘lemon’ it’ll fire, or the word ‘banana’ will fire. This is really not the sort of thing that you expect to find in a vision model. It’s in some sense a vision model, but it’s almost doing linguistic processing in some way, and it’s fusing it together into what we call these multimodal neurons. And this is a phenomenon that has been found in neuroscience. So you find these neurons also for people. There’s a Spider-Man neuron that fires both for the word Spider-Man as an image, like a picture of the word Spider-Man, and also for pictures of Spider-Man and for drawings of Spider-Man.

Chris Olah: And this mirrors a really famous result from neuroscience of the Halle Berry neuron, or the Jennifer Aniston neuron, which also responds to pictures of the person and the drawings of the person and to the person’s name. And so these neurons seem in some sense much more abstract and almost conceptual, compared to the previous neurons that we found. And they span an incredibly wide range of topics.

Chris Olah: In fact, a lot of the neurons, you just go through them and it feels like something out of a kindergarten class, or an early grade-school class. You have your color neurons, you have your shape neurons, you have neurons corresponding to seasons of the year, and months, to weather, to emotions, to regions of the world, to the leader of your country, and all of them have this incredible abstract nature to them.

Chris Olah: So there’s a morning neuron that responds to alarm clocks and times of the day that are early, and to pictures of pancakes and breakfast foods — all of this incredible diversity of stuff. Or season neurons that respond to the names of the season and the type of weather associated with them and all of these things. And so you have all this incredible diversity of neurons that are all incredibly abstract in this different way, and it just seems very different from the relatively concrete neurons that we were seeing before that often correspond to a type of object or such.

Can this approach scale?

Chris Olah: I think this is a very reasonable concern, and is the main downside of circuits. So right now, I guess probably the largest circuit that we’ve really carefully understood is at 50,000 parameters. And meanwhile, the largest language models are in the hundreds of billions of parameters. So there’s quite a few orders of magnitudes of difference that we need to get past if we want to even just get to the modern language models, let alone future generations of neural networks. Despite that, I am optimistic. I think we actually have a lot of approaches to getting past this problem.

In the interview, Chris lays out several paths to addressing the scaling challenge. First, the basic approach to circuits might be more scalable than it seems at first glance, both because large models may become easier to understand in some ways, and because understanding recurring “motifs” can sometimes give order of magnitude simplifications (eg. equivariance). Relatedly, if the stakes are high enough, we might be willing to use large amounts of human labor to audit neural networks. Finally, circuits is a kind of epistemic foundation that we can build an understanding of “larger scale structure” like branches or “tissues” on top of. Possibly, these larger structures may either directly answer safety questions or help us focus our efforts to understand safety on portions of the model.

How wonderful it would be if this could succeed

Chris Olah: We’ve talked a lot about the ways this could fail, and I think it’s worth saying how wonderful it would be if this could succeed. It’s both that it’s potentially something that makes neural networks much safer, but there’s also just some way in which I think it would aesthetically be really wonderful if we could live in a world where we have… We could just learn so much and so many amazing things from these neural networks. I’ve already learned a lot about silly things, like how to classify dogs, but just lots of things that I didn’t understand before that I’ve learned from these models.

Chris Olah: You could imagine a world where neural networks are safe, but where there’s just some way in which the future is kind of sad. Where we’re just kind of irrelevant, and we don’t understand what’s going on, and we’re just humans who are living happy lives in a world we don’t understand. I think there’s just potential for a future — even with very powerful AI systems — that isn’t like that. And that’s much more humane and much more a world where we understand things and where we can reason about things. I just feel a lot more excited for that world, and that’s part of what motivates me to try and pursue this line of work.

Chris Olah: There’s this idea of a microscope AI. So people sometimes will talk about agent AIs that go and do things, and oracle AIs that just sort of give us wise advice on what to do. And another vision for what a powerful AI system might be like — and I think it’s a harder one to achieve than these others, and probably less competitive in some sense, but I find it really beautiful — is a microscope AI that just allows us to understand the world better, or shares its understanding of the world with us in a way that makes us smarter and gives us a richer perspective on the world. It’s something that I think is only possible if we could really succeed at this kind of understanding of models, but it’s… Yeah, aesthetically, I just really prefer it.

Ways that interpretability research could help us avoid disaster

Chris Olah: On the most extreme side, you could just imagine us fully, completely understanding transformative AI systems. We just understand absolutely everything that’s going on inside them, and we can just be really confident that there’s nothing unsafe going on in them. We understand everything. They’re not lying to us. They’re not manipulating us. They are just really genuinely trying to be maximally helpful to us. And sort of an even stronger version of that is that we understand them so well that we ourselves are able to become smarter, and we sort of have a microscope AI that gives us this very powerful way to see the world and to be empowered agents that can help create a wonderful future.

Chris Olah: Okay. Now let’s imagine that actually interpretability doesn’t succeed in that way. We don’t get to the point where we can totally understand a transformative AI system. That was too optimistic. Now what do we do? Well, maybe we’re able to go and have this kind of very careful analysis of small slices. So maybe we can understand social reasoning and we can understand whether the model… We can’t understand the entire model, but we can understand whether it’s being manipulative right now, and that’s able to still really reduce our concerns about safety. But maybe even that’s too much to ask. Maybe we can’t even succeed at understanding that small slice.

Chris Olah: Well, I think then what you can fall back to is maybe just… With some probability you catch problems, you catch things where the model is doing something that isn’t what you want it to do. And you’re not claiming that you would catch even all the problems within some class. You’re just saying that with some probability, we’re looking at the system and we catch problems. And then you sort of have something that’s kind of like a mulligan. You made a mistake and you’re allowed to start over, where you would have had a system that would have been really bad and you realize that it’s bad with some probability, and then you get to take another shot.

Chris Olah: Or maybe as you’re building up to powerful systems, you’re able to go and catch problems with some probability. That sort of gives you a sense of how common safety problems are as you build more powerful systems. Maybe you aren’t very confident you’ll catch problems in the final system, but you can sort of help society be calibrated on how risky these systems are as you build towards that.

Scaling Laws

Chris Olah: Basically when people talk about scaling laws, what they really mean is there’s a straight line on a log-log plot. And you might ask, “Why do we care about straight lines on log-log plots?” There’s different scaling laws for different things, and the axes depend on which scaling law you’re talking about. But probably the most important scaling law — or the scaling law that people are most excited about — has model size on one axis and loss on the other axis. [So] high loss means bad performance. And so the observation you have is that there is a straight line. Where as you make models bigger, the loss goes down, which means the model is performing better. And it’s a shockingly straight line over a wide range.

Chris Olah: The really exciting version of this to me … is that maybe there’s scaling laws for safety. Maybe there’s some sense in which whether a model is aligned with you or not may be a function of model size, and sort of how much signal you give it to do the human-aligned task. And we might be able to reason about the safety of models that are larger than the models we can presently build. And if that’s true, that would be huge. I think there’s this way in which safety is always playing catch-up right now, and if we could create a way to think about safety in terms of scaling laws and not have to play catch-up, I think that would be incredibly remarkable. So I think that’s something that — separate from the capabilities implications of safety laws — is a reason to be really excited about them.

Related episodes

About the show

The 80,000 Hours Podcast features unusually in-depth conversations about the world's most pressing problems and how you can use your career to solve them. We invite guests pursuing a wide range of career paths — from academics and activists to entrepreneurs and policymakers — to analyse the case for and against working on different issues and which approaches are best for solving them.

The 80,000 Hours Podcast is produced and edited by Keiran Harris. Get in touch with feedback or guest suggestions by emailing [email protected].

What should I listen to first?

We've carefully selected 10 episodes we think it could make sense to listen to first, on a separate podcast feed:

Check out 'Effective Altruism: An Introduction'

Subscribe here, or anywhere you get podcasts:

If you're new, see the podcast homepage for ideas on where to start, or browse our full episode archive.