Chris Olah: Well, in the last couple of years, neural networks have been able to accomplish all of these tasks that no human knows how to write a computer program to do directly. We can’t write a computer program to go and classify images, but we can write a neural network to create a computer program that can classify images. We can’t go and write computer programs directly to go and translate text highly accurately, but we can train the neural network to go and translate texts much better than any program we could have written.
Chris Olah: And it’s always seemed to me that the question that is crying out to be answered there is, “How is it that these models are doing these things that we don’t know how to do?” I think a lot of people might debate exactly what interpretability is. But the question that I’m interested in is, “How do these systems accomplish these tasks? What’s going on inside of them?”
Chris Olah: Imagine if some alien organism landed on Earth and could go and do these things. Everybody would be rushing and falling over themselves to figure out how the alien organism was doing things. You’d have biologists fighting each other for the right to go and study these alien organisms. Or imagine that we discovered some binary just floating on the internet in 2012 that could do all these things. Everybody would be rushing to go and try and reverse engineer what that binary is doing. And so it seems to me that really the thing that is calling out in all this work for us to go and answer is, “What in the wide world is going on inside these systems??”
Chris Olah: The really amazing thing is that as you start to understand what different neurons are doing, you actually start to be able to go and read algorithms off of the weights… We can genuinely understand how large chunks of neural networks work. We can actually reverse engineer chunks of neural networks and understand them so well that we can go and hand-write weights… I think that is a very high standard for understanding systems.
Chris Olah: Ultimately the reason I study (and think it’s useful to study) these smaller chunks of neural network is that it gives us an epistemic foundation for thinking about interpretability. The cost is that we’re talking to these small parts and we’re setting ourselves up for a struggle to be able to build up the understanding of large neural networks and make this sort of analysis really useful. But it has the upside that we’re working with such small pieces that we can really objectively understand what’s going on… And I think there’s just a lot of disagreement and confusion, and I think it’s just genuinely really hard to understand neural networks and very easy to misunderstand them, so having something like that seems really useful.
Chris Olah: Just like animals have very similar anatomies — I guess in the case of animals due to evolution — it seems neural networks actually have a lot of the same things forming, even when you train them on different data sets, even when they have different architectures, even though the scaffolding is different. The same features and the same circuits form. And actually I find that the fact that the same circuits form to be the most remarkable part. The fact that the same features form is already pretty cool, that the neural network is learning the same fundamental building blocks of understanding vision or understanding images.
Chris Olah: But then, even though it’s scaffolded on differently, it’s literally learning the same weights, connecting the same neurons together. So we call that ‘universality.’ And that’s pretty crazy. It’s really tempting when you start to find things like that to think “Oh, maybe the same things form also in humans. Maybe it’s actually something fundamental.” Maybe these models are discovering the basic building blocks of vision that just slice up our understanding of images in this very fundamental way.
Chris Olah: And in fact, for some of these things, we have found them in humans. So some of these lower-level vision things seem to mirror results from neuroscience. And in fact, in some of our most recent work, we’ve discovered something that was previously only seen in humans, these multimodal neurons.
Chris Olah: So we were investigating this model called CLIP from OpenAI, which you can roughly think of as being trained to caption images or to pair images with their captions. So it’s not classifying images, it’s doing something a little bit different. And we found a lot of things that were really deeply qualitatively different inside it. So if you look at low-level vision actually, a lot of it is very similar, and again is actually further evidence for universality.
Chris Olah: A lot of the same things we find in other vision models occur also in early vision in CLIP. But towards the end, we find these incredibly abstract neurons that are just very different from anything we’d seen before. And one thing that’s really interesting about these neurons is they can read. They can go and recognize text and images, and they fuse this together, so they fuse it together with the thing that’s being detected.
Chris Olah: So there’s a yellow neuron for instance, which responds to the color yellow, but it also responds if you write the word yellow out. That will fire as well. And actually it’ll fire if you write out the words for objects that are yellow. So if you write the word ‘lemon’ it’ll fire, or the word ‘banana’ will fire. This is really not the sort of thing that you expect to find in a vision model. It’s in some sense a vision model, but it’s almost doing linguistic processing in some way, and it’s fusing it together into what we call these multimodal neurons. And this is a phenomenon that has been found in neuroscience. So you find these neurons also for people. There’s a Spider-Man neuron that fires both for the word Spider-Man as an image, like a picture of the word Spider-Man, and also for pictures of Spider-Man and for drawings of Spider-Man.
Chris Olah: And this mirrors a really famous result from neuroscience of the Halle Berry neuron, or the Jennifer Aniston neuron, which also responds to pictures of the person and the drawings of the person and to the person’s name. And so these neurons seem in some sense much more abstract and almost conceptual, compared to the previous neurons that we found. And they span an incredibly wide range of topics.
Chris Olah: In fact, a lot of the neurons, you just go through them and it feels like something out of a kindergarten class, or an early grade-school class. You have your color neurons, you have your shape neurons, you have neurons corresponding to seasons of the year, and months, to weather, to emotions, to regions of the world, to the leader of your country, and all of them have this incredible abstract nature to them.
Chris Olah: So there’s a morning neuron that responds to alarm clocks and times of the day that are early, and to pictures of pancakes and breakfast foods — all of this incredible diversity of stuff. Or season neurons that respond to the names of the season and the type of weather associated with them and all of these things. And so you have all this incredible diversity of neurons that are all incredibly abstract in this different way, and it just seems very different from the relatively concrete neurons that we were seeing before that often correspond to a type of object or such.
Can this approach scale?
Chris Olah: I think this is a very reasonable concern, and is the main downside of circuits. So right now, I guess probably the largest circuit that we’ve really carefully understood is at 50,000 parameters. And meanwhile, the largest language models are in the hundreds of billions of parameters. So there’s quite a few orders of magnitudes of difference that we need to get past if we want to even just get to the modern language models, let alone future generations of neural networks. Despite that, I am optimistic. I think we actually have a lot of approaches to getting past this problem.
In the interview, Chris lays out several paths to addressing the scaling challenge. First, the basic approach to circuits might be more scalable than it seems at first glance, both because large models may become easier to understand in some ways, and because understanding recurring “motifs” can sometimes give order of magnitude simplifications (eg. equivariance). Relatedly, if the stakes are high enough, we might be willing to use large amounts of human labor to audit neural networks. Finally, circuits is a kind of epistemic foundation that we can build an understanding of “larger scale structure” like branches or “tissues” on top of. Possibly, these larger structures may either directly answer safety questions or help us focus our efforts to understand safety on portions of the model.
How wonderful it would be if this could succeed
Chris Olah: We’ve talked a lot about the ways this could fail, and I think it’s worth saying how wonderful it would be if this could succeed. It’s both that it’s potentially something that makes neural networks much safer, but there’s also just some way in which I think it would aesthetically be really wonderful if we could live in a world where we have… We could just learn so much and so many amazing things from these neural networks. I’ve already learned a lot about silly things, like how to classify dogs, but just lots of things that I didn’t understand before that I’ve learned from these models.
Chris Olah: You could imagine a world where neural networks are safe, but where there’s just some way in which the future is kind of sad. Where we’re just kind of irrelevant, and we don’t understand what’s going on, and we’re just humans who are living happy lives in a world we don’t understand. I think there’s just potential for a future — even with very powerful AI systems — that isn’t like that. And that’s much more humane and much more a world where we understand things and where we can reason about things. I just feel a lot more excited for that world, and that’s part of what motivates me to try and pursue this line of work.
Chris Olah: There’s this idea of a microscope AI. So people sometimes will talk about agent AIs that go and do things, and oracle AIs that just sort of give us wise advice on what to do. And another vision for what a powerful AI system might be like — and I think it’s a harder one to achieve than these others, and probably less competitive in some sense, but I find it really beautiful — is a microscope AI that just allows us to understand the world better, or shares its understanding of the world with us in a way that makes us smarter and gives us a richer perspective on the world. It’s something that I think is only possible if we could really succeed at this kind of understanding of models, but it’s… Yeah, aesthetically, I just really prefer it.
Ways that interpretability research could help us avoid disaster
Chris Olah: On the most extreme side, you could just imagine us fully, completely understanding transformative AI systems. We just understand absolutely everything that’s going on inside them, and we can just be really confident that there’s nothing unsafe going on in them. We understand everything. They’re not lying to us. They’re not manipulating us. They are just really genuinely trying to be maximally helpful to us. And sort of an even stronger version of that is that we understand them so well that we ourselves are able to become smarter, and we sort of have a microscope AI that gives us this very powerful way to see the world and to be empowered agents that can help create a wonderful future.
Chris Olah: Okay. Now let’s imagine that actually interpretability doesn’t succeed in that way. We don’t get to the point where we can totally understand a transformative AI system. That was too optimistic. Now what do we do? Well, maybe we’re able to go and have this kind of very careful analysis of small slices. So maybe we can understand social reasoning and we can understand whether the model… We can’t understand the entire model, but we can understand whether it’s being manipulative right now, and that’s able to still really reduce our concerns about safety. But maybe even that’s too much to ask. Maybe we can’t even succeed at understanding that small slice.
Chris Olah: Well, I think then what you can fall back to is maybe just… With some probability you catch problems, you catch things where the model is doing something that isn’t what you want it to do. And you’re not claiming that you would catch even all the problems within some class. You’re just saying that with some probability, we’re looking at the system and we catch problems. And then you sort of have something that’s kind of like a mulligan. You made a mistake and you’re allowed to start over, where you would have had a system that would have been really bad and you realize that it’s bad with some probability, and then you get to take another shot.
Chris Olah: Or maybe as you’re building up to powerful systems, you’re able to go and catch problems with some probability. That sort of gives you a sense of how common safety problems are as you build more powerful systems. Maybe you aren’t very confident you’ll catch problems in the final system, but you can sort of help society be calibrated on how risky these systems are as you build towards that.
Chris Olah: Basically when people talk about scaling laws, what they really mean is there’s a straight line on a log-log plot. And you might ask, “Why do we care about straight lines on log-log plots?” There’s different scaling laws for different things, and the axes depend on which scaling law you’re talking about. But probably the most important scaling law — or the scaling law that people are most excited about — has model size on one axis and loss on the other axis. [So] high loss means bad performance. And so the observation you have is that there is a straight line. Where as you make models bigger, the loss goes down, which means the model is performing better. And it’s a shockingly straight line over a wide range.
Chris Olah: The really exciting version of this to me … is that maybe there’s scaling laws for safety. Maybe there’s some sense in which whether a model is aligned with you or not may be a function of model size, and sort of how much signal you give it to do the human-aligned task. And we might be able to reason about the safety of models that are larger than the models we can presently build. And if that’s true, that would be huge. I think there’s this way in which safety is always playing catch-up right now, and if we could create a way to think about safety in terms of scaling laws and not have to play catch-up, I think that would be incredibly remarkable. So I think that’s something that — separate from the capabilities implications of safety laws — is a reason to be really excited about them.