Transcript
Cold open [00:00:00]
Neel Nanda: One of the most important lessons I’ve learned is that you can just do things. Part of this is what I think of as maximising your luck surface area: you want to just have as many opportunities as possible for good opportunities to come your way; you want to know people; you want to be someone who sometimes says yes, so people bring things to you.
I kind of ended up running the DeepMind team by accident. I joined DeepMind expecting to be an individual researcher. Unexpectedly, the lead decided to step down a few months after I joined, and in the months since, I kind of ended up stepping into their place. I did not know if I was going to be good at this. I think it’s gone reasonably well.
And to me, this is both an example of the importance of having luck surface area, being in a situation where opportunities like that can arise, but also, you should just say yes to things — even if they seem kind of scary, you’re not confident they’ll go well — so long as the downside is pretty low.
AI is reshaping the world. This is like one of the most important things happening. I just can’t really imagine wanting a career where I’m not at the forefront of this helping it go better.
Who’s Neel Nanda? [00:01:12]
Rob Wiblin: Today I have the pleasure of speaking with the legendary Neel Nanda, who runs the mechanistic interpretability team at Google DeepMind. Mechanistic interpretability is this research project to try to figure out why AI models do what they do and how they’re doing it, and why they’re choosing to do one thing over another — which until a couple of years ago, we had fairly little insight to.
But Neel, I guess you’ve been one of a handful of people who have helped to kind of grow this entire research project from quite a small effort four years ago to now something that’s quite meaningful, with hundreds of people contributing to answering those kinds of questions. So thanks so much for coming back on the podcast.
Neel Nanda: Yeah, I’m really excited to be here.
Luck surface area and making the right opportunities [00:01:46]
Rob Wiblin: One thing people might not immediately realise about you is that you are just 26, and despite having only been involved in mechanistic interpretability research for four years or so, you’ve contributed some foundational results to the field. You’ve worked at Anthropic, and you’re now leading a team at Google DeepMind. You’ve published many papers and gotten plenty of citations.
You’ve also mentored about 50 people over the last four years, and I think seven of them are now doing important work at frontier AI companies, and several of them are in important government or regulatory roles as well.
So I think it’s fair to say you’ve had a busy and very productive last four years. How have you managed to accomplish so much so early in your career?
Neel Nanda: Well, I’d love to say this is just because I’m incredibly smart and brilliant. I think I’m decent at what I do, but it’s mostly luck — which really should be a much more common answer to this kind of question. Maybe fleshing that out slightly: I think it’s a combination of being in the right place at the right time and being good at taking opportunities as they arise, and making the right opportunities for myself.
I think I got into mechanistic interpretability at the right time. There was a lot of excitement, but it was tiny. And if you get into a fast-growing thing like a startup or a research field, you can become one of the most experienced people extremely fast, in ways that are not actually that related to how good you are. And I think that we’ll discuss later some of the things around being good at how to make the right opportunities for yourself.
The other big thing is I just find managing and supervising people really fun. I think lots of researchers just very grudgingly do this, but I seem pretty good at it, and I can have so much more impact by just making like 10 to 20 other researchers better. And when you do this enough, it kind of adds up.
And the most important thing was not wasting five years of my life in a PhD. [laughs]
Rob Wiblin: Yeah, I forgot to mention that you actually only have an undergraduate degree, so you’ve saved many years there potentially.
Writing cold emails that aren’t insta-deleted [00:03:50]
Rob Wiblin: You mention cold emails. I imagine you get a fair few. I imagine that most cold emails to people in your situation, it’s very hard to engage on a deep level. But what kinds of cold emails do get your attention and actually get you to engage?
Neel Nanda: I think the main advice I have for cold emails is: assume the person reading this is busy and will stop reading at an uncertain point in the email. Therefore, you need to be as concise as possible and prioritise the key information as much as possible, ideally bolding keywords or phrases.
Personally, when I get an email, I’ll typically look at it a bit and if I can respond to it within a minute, decent chance I will. If it will require more effort than that, I have a pretty high bar. So one thing to ask is: “Is the thing that I want something that could be done in a minute, like a specific question?” If so, make that extremely prominent.
Otherwise, it’s more about catching someone’s attention. A few things that are good there.
One thing I think can be very impactful is essentially signalling competence. One way to do this is with having done impressive things or credentials. I don’t know, people are sometimes unwilling to boast, but honestly, it’s really helpful if they boast. I want people to just have like, “Here are the most impressive things about me, so you can prioritise.” Obviously this is incredibly noisy and annoying, but given that I get more emails than I have time to respond to, I need to be prioritising fast.
One trick for talking about how you’re impressive without seeming like a dick is saying, “I’m sure you must get many of these emails. So to help you prioritise, here’s some key info about me: blah, blah, blah” — I enthusiastically encourage anyone sending me an email to do that.
I also enthusiastically encourage anyone considering sending me an email to send someone else an email. Because I think a common mistake people make is they reach out to the most salient person in some area — like they reach out to the lead of a team they want to join, or a prominent professor in some research field. But the more prominent they are, the more emails they get, the busier they are, and the less likely they are to respond to you.
And for many things that people might want — like a technical question, mentorship, advice — I think that a more junior person is pretty well placed to give said advice. Especially if you want something time consuming like mentorship, I think junior people are often very able to give quite good mentorship and quite useful mentorship to someone new to the field — e.g. someone who’s joined my team in the last six months, so a lot more likely to have free time but still has a lot of useful advice to give.
One consequence of this is you should try to email first authors of papers a lot more than the kind of fancy academic at the end of the paper’s author list, because they’re just more likely to have time.
And one trick on the conciseness point: sometimes I’ve had people write a doc with a bunch of detail and then give me a one paragraph blurb and then link to the doc if I want to read more — because then I don’t look at it like, “This is a 3,000-word email; I’m not going to read this.” But if I’m interested then I can click through. I’m also a big fan of bullet points. Just be easy to skim and be clear about what you want.
I will also say my response to anyone who emails me asking for mentorship is that the main way I mentor people is through the MATS Program — applications currently open. I should really write that prominently on my website somewhere because I’ve sent that reply to too many people.
I also don’t think credentials are the only way you can signal impressiveness. Often having done something interesting — like, “I contributed a bunch to this meaningful open source library” or “I helped out with this paper in the following way” is more interesting to me than, “I am part of Lab X” or something.
And secondly, you can signal competence by asking good questions. There’s a lot of depth you can go into if you’re asking about a researcher’s work or things like that. And if I look at something, and I’m like, “That’s not a question I normally get asked, but that’s actually very reasonable,” or if someone’s like, “I want to extend the paper like this, is that a good idea?” and it’s a good idea, I’m down to teach them if I have time.
Rob Wiblin: Yeah, I think I’m not nearly as big a deal as you, and I don’t think I get as many cold emails as you do, but I can endorse almost all of that.
How Neel uses LLMs to get much more done [00:09:08]
Rob Wiblin: To what extent do you think junior people can in practice use LLMs or use AI to skill up and get closer to the frontier of knowledge or research ability more quickly today than they could four years ago when these tools were not really that useful?
Neel Nanda: Oh, so much. If you’re trying to get into a field nowadays and you’re not using LLMs, you’re making a mistake. This doesn’t look like using LLMs blindly for everything; it looks like understanding them as a tool, their strengths and weaknesses, and where they can help.
I think this has actually changed quite a lot over the last six or 12 months. I used to not really use LLMs much in my day-to-day life, and then a few months ago I started a quest to become an LLM power user, and now will randomly just work into conversation, “Have you considered using an LLM like this for your problem?”
So how should people think about this? One of the things that LLMs are actually very good at is lowering barriers to entry to a field. They’re quite bad at achieving expert performance in a domain, but they’re pretty good at junior-level performance in a domain. They aren’t necessarily perfectly reliable, but neither are junior people, so this is the wrong bar to have.
A couple of things that I think people don’t always get right. (Some of this might seem extremely obvious for anyone who’s good at using LLMs, but please bear with me.) The first one is just that system prompts are really helpful. You can give the model extremely detailed instructions for the kinds of tasks you want help with, and this will make it better at that. I’m a big fan of the projects feature some of the major providers have, where you can give a bunch of different prompts and maybe some useful contextual documents, and then talk in there.
A few tips for prompting. If you don’t feel like you’re very good at prompting, just get an LLM to write the prompt for you. One thing I personally find quite useful is I find it easier to think and brainstorm when I’m doing voice dictation. And LLMs are really good at just taking my rambly voice dictation and treating it as though it was coherent outputs. So I’ll just ramble at it about what’s the task, what are my criteria, what are the failing modes I don’t want, and then it will just write the prompt for me. If things go wrong, then you can give the LLM detailed feedback on what you didn’t like and ask it to rewrite the prompt, taking into account this feedback and maybe the concrete example that it did wrong, and rewrite the prompt and then copy that in for next time.
Rob Wiblin: So you’re just talking to it verbally? Just in audio? I guess I’ve always been reluctant to try that because I assumed that it would end up just being so disjointed that the transcription would be very confusing and the output would be bad. But you’re saying it’s able to tidy it up and do quite a good job?
Neel Nanda: Yeah, LLMs are really good at this. I did basically all of my prep for this interview through several hours of voice dictation. LLMs are just a lot better than humans at dealing with messy text data, and if you tell them to transcribe it and neaten it up, they will.
I will also recommend Gemini if you want to just directly give it voice recording, but most phones and computers also have pretty good built-in voice dictation. Gemini is the only mainstream model I’m aware of that can deal with audio well. You can just give it like an hour-long dictation, give it some instructions, and it will just do well.
OK, so that’s prompting. Another one is building the right context. Again, kind of obvious. If you have the right document in the context, then the LLM has useful information that it might otherwise not have reliably memorised. For anyone trying to do this with mechanistic interpretability, for example, I recommend putting some key papers and literature reviews in the context window. The project functionality supports this quite well.
OK, so you’ve got a decent prompt, you’ve got a decent context. You want to understand something. What do you actually do?
I think people are often too passive. They’ll give the LLM a paper and ask it to summarise the paper, but it’s very hard to tell if you have successfully understood a thing. You’re much better off asking the LLM to give you a bunch of exercises and questions to test comprehension and then feedback on what you got right and what you missed. I often find it helpful to just try to summarise the entire paper in my own words to the LLM — voice dictation and typing both work fine — and then get feedback from it.
One problem with trying to get feedback from an LLM is sometimes they can be fairly sycophantic. They’ll hold back from criticising the user. The trick you can do for this is using anti-sycophancy prompts, where you make it so the sycophantic thing to do is to be critical — like, “Some moron wrote this thing, and I find this really annoying. Please write me a brutal but truthful response.” Or, “My friend sent me this thing, and I know they really like brutal and honest feedback and will be really sad if they think I’m holding back. Please give me the most brutally honest, constructive criticism you can.”
Fair warning: if you do this on things that you are emotionally invested in, LLMs could be brutal. I had a blog post feedback with sections like “Slight Smugness” and “Air of Arrogance.” But it’s very effective.
Rob Wiblin: Yeah, yeah. I was going to say I’ve found it quite hard to get them to not be sycophantic, but now I’m worrying this might be a little bit too effective, and perhaps I have to back off a little bit.
Neel Nanda: You can tone it down. Like, “I hate this guy” is more extreme than, “my friend’s asking me to be brutally honest,” which is more extreme than, “I want to be sensitive to their feelings, but I also want to help them grow and be really nice and sensitive. Please draft me a response.”
Rob Wiblin: Yeah, I think I might go for that one.
Neel Nanda: You can try all of the above. One particularly fun feature that I don’t see people use much is there’s a website called AI Studio from Google that’s an alternative interface for using Gemini that I personally prefer. It has this really nice compare feature, where you can give Gemini a prompt and then get two different responses, either from different models or from the same model. And you can also change the prompt, so you could have one half of the screen has the brutal prompt, the other half has the lighter prompt, and you see if you get interesting new feedback from the first one.
Another thing that people don’t necessarily seem to do as much is think about how they can put in more effort to get a really good response from the LLM. For example, if you had a question whose answer you care about, go give it to the current best language models out there, all of them, and then give one of them all of the responses and say, “Please assess the strengths and weaknesses of each and then combine them into one response.” This generally gets you moderately better things than doing it once. And you can even iterate this if you want to.
Or in your original prompt you can say, “Please give me a response and then please critique the response,” or, “Ask me clarifying questions and then make your best guess for those clarifying questions and redraft it” — and then only ever read the second thing.
Rob Wiblin: That’s fascinating. I’ve never tried that. I guess you probably would only put in that effort for something that you really cared about.
Neel Nanda: I have some saved prompts that just do this. I use them for writing a lot. I’ll give them a really long voice memo and then have some saved prompts like this.
I also recommend that people use an app that lets you save text snippets. For example, Alfred on Mac is my one of choice. You can just write a really long prompt that you sometimes will want to use that has all of this elaborate “…and then critique yourself like this: blah, blah” — maybe you’ve got an LLM to write it — and you can make it so if you just type… I have mine set that if I type “>debug prompt” then I get my really long debugging prompt. And this means that you actually use this stuff because it’s really low friction.
And the final crucial domain is code. Basically if you’re writing code and not using LLMs, you’re doing something wrong. This is one of the things they are best at. They are not good for everything. Broadly, my advice is: if you’re trying to learn how to code, and the main benefit you get from something is the experience, don’t use an LLM except as a tutor. Write it yourself, because otherwise you won’t understand it.
For example, if you’re going through coding tutorials like Arena, if you are just trying to get something done — you don’t care about the code or if it’s any good, you’ll never build on it — then command-line agents like Gemini CLI or Claude Code, where you basically just have it on your computer, you give it verbal instructions in a terminal and then it goes and writes the program for you, can often work very well. You can often tell them to debug the thing. Typically this will either work in the first few times or it will get confused.
If it gets confused, or you care about the code being good, I recommend the tool Cursor, which is a coding environment with very good LLM integration. It can do anything from, you can ask it questions about the code and it will help you with those, to it can totally write a thing from scratch for you, to it can edit parts or the whole document. I’m sure many people listening to this find all of that obvious. It’s quite popular, one of the fastest growing startups ever or something.
Rob Wiblin: But if you’re not doing it —
Neel Nanda: Then just use it. It’s so good. And generally, just the value of information is really high because things change so much. Like if you haven’t tried an LLM in the last six months, or haven’t tried it for a specific use case: try it once, see what happens. I could probably talk longer about LLMs, but they are a legitimately useful tool.
You can also try to get them to catch unknown unknowns for you, like, “I want to get into mech interp. Here’s my rough plan. What am I missing?” Or, “Here are the papers I’ve reviewed.” They’re really good at literature reviews. Like, “I want to do this project. Find me all relevant papers.” Tools like Deep Research can be quite good here, but even just modern LLMs with search are pretty good. And just asking it what are you missing? Or being like, “I have this problem. I have no idea if this is solvable. Can you solve it?”
For example, I have posture issues. I was chatting with an LLM about this, and it mentioned that you can get posture devices that will buzz if you’re slouching. I now have one. It’s incredibly annoying, but effective. I had no idea this existed.
Rob Wiblin: Yeah, that’s really interesting. I must admit I’ve had kind of mixed results turning to LLMs for medical advice. In general I recommend it because it’s so cheap. It’s so much cheaper than going and seeing a specialist. But I guess you do have to be a bit careful, because they can absolutely hallucinate both symptoms and treatments for conditions. Even stuff that is just directly in the Wikipedia article about a condition, they can just confabulate stuff that’s kind of plausibly associated with it.
Neel Nanda: Yeah, two responses to that. First, the way I was thinking of the posture thing was more as using the LLM for shopping and life utility. Just, “Are there products that could help me with this?” rather than, “How should I treat this?” — which the answer is obviously go see a physiotherapist and set up your desk so it’s ergonomically better. Things I’ve also done.
I do think that if you care about accuracy, another thing you can do is give the LLM’s output to a different LLM with an anti-sycophancy prompt. Like, “A friend sent me this answer to my question, but it’s kind of sus. Please validate me about how sus it is.” And then often the other LLM will be like, “Well, this is wrong for the following reasons.” Sometimes it says it’s right. This doesn’t mean it’s definitely right, and you shouldn’t rely on this for high-stakes things like medical stuff necessarily. But I do think this can buy you a lot more accuracy than just asking as a one-off.
Rob Wiblin: Yeah, absolutely. Very often I’ve detected these problems by comparing answers between different LLMs, and it seems like for whatever reason they don’t tend to hallucinate the same incorrect details. So you can get a lot of mileage out of that.
Neel Nanda: But using an anti-sycophancy prompt is even better, because you copy the first LLM’s output into the second LLM — ideally from a different provider for minimal bias — and then you say, “I think there’s a bunch of flaws in this. Please find the flaws for me.” And now it wants to be helpful so it will look for flaws. It might find too many flaws. This is the right problem to have in my opinion.
Rob Wiblin: It hallucinates errors.
Neel Nanda: None of this is perfectly reliable, but I just think there’s so much room for creativity in how you use these things. You’re just talking to a thing. Use your social mind of how you would use a bunch of not-very-competent interns.
“If your safety work doesn’t advance capabilities, it’s probably bad safety work” [00:23:22]
Rob Wiblin: One provocative take you had in your notes for the episode is if your safety work doesn’t advance capabilities a bit, it’s probably bad safety work. Can you explain and defend that?
Neel Nanda: Yeah. I often see people in the safety community criticising safety work because they’re like, “But isn’t this capabilities work?” In the sense of, “Doesn’t this make the model better?”
I think this just doesn’t make sense as a critique, because the goal of safety is to have models that do what we want. Even related things like control are about trying to have models that don’t do the things we don’t want. This is incredibly commercially useful. We’re already seeing safety issues like reward hacking make systems less commercially applicable. Things like hallucinations, jailbreaks. And I expect as time goes on, the more important safety issues from an AGI safety perspective will also start to matter.
This means that criticising work for being useful just doesn’t really make sense. I would in fact criticise work that doesn’t have a path to being useful as either: it’s something like evals that’s kind of a different theory of impact, or it sure sounds like you don’t have a plan for making systems better at doing what we want with this technique. What’s the point?
I want to emphasise that I’m not just saying you should do whatever you want. I think that there’s work that differentially advances safety over just general capabilities and work that doesn’t. But I just think it’s pretty nuanced, and I think that people might have counterarguments like, “But if it helps the system be a more useful product, won’t companies do it?” And I just don’t think this is a realistic model of how this kind of thing works.
Rob Wiblin: Yeah, I was going to raise that issue: if your safety work is doing a lot to make the model more useful, and especially more commercially useful, then plausibly the company will invest in it, regardless of whether you’re involved or not — because it’s just on the critical path of actually making the product that they can sell. But I guess you just think things are a bit more scrappy and ad hoc than that. It’s not necessarily the case that AI companies are always doing all of the things that would make their products better. Not by any means.
Neel Nanda: Kind of. The way I’d think about it is it’s kind of a depth and prep time thing. So if something becomes an issue, people will try to fix it — but there’ll be a lot of urgency, and there won’t be as much room to research more creative solutions; there won’t be enough room to try to do the thing properly. Often if there’s multiple approaches to a thing, and I think some are better than others, there won’t be interest in trying the less proven thing.
For example, the standard way to fix a safety issue is to just add more fine-tuning data fixing the issue. This often works. But I think that for things like deception, it would not work. And I want to make sure there are other tools that people see as realistic options — like chain of thought monitoring or evaluations with interpretability — that people see as an important part of the process that are realistic options. And I think that if you just wait for commercial incentives to take over, you don’t really get this.
I think that there’s work that makes models better without really getting safety benefits, and there’s work that is centrally about making the thing do what we want more. I think people should try to do the second kind of thing. But the question is: “Does this differentially advance safety?” Not, “Does this advance capabilities at all?” And if it advanced both a lot, that might be excellent.
Why Neel refuses to share his p(doom) [00:27:22]
Rob Wiblin: Another take that you had in your notes is that people who are part of the AI safety ecosystem in general, you think in general that they tend to be quite overconfident about how they picture the future going. Can you explain that?
Neel Nanda: Yeah. A common question people get asked in podcasts or talks if they’re a safety figure is like, “What’s your p(doom)? What are your timelines?”
These are ridiculously complicated questions. Forecasting the future of technology is notoriously incredibly hard. It ties into things like geopolitics, economics, how machine learning will progress, how commercialisable these things will be, how hardware production can be scaled up. Asking about things like how difficult alignment will be is just a really complicated question. And I think people have often been wrong about empirically what’s going to happen.
I generally just think that, as a community, we should have more intellectual humility. I have a policy that I just refuse to answer in public what my p(doom) or timelines are, because I think people just anchor too much on what someone prominent says, without thinking about, “Does the fact that Neel is good at interpretability mean he knows what he’s talking about with timelines?”
More importantly, I just don’t actually think it’s that action relevant. I think that we could live in a world where we’re going to get AGI incredibly soon, we could live in a world with it kind of medium — next 10 to 20 years — and we could live in a world where it’s much longer. All of these are plausible. I just don’t really see a reasonable argument for how you could get enough of them below plausible that we shouldn’t just act as though these are all realistic concerns — and we should either do stuff that is useful among all of them, or we should be picking the one we think is highest priority, i.e. the “AGI soon” one.
Similarly for alignment risk: maybe we’re fine by default, or maybe this is a totally intractable problem as-is and we need totally different approaches. We should be forming plans for all of these worlds.
Rob Wiblin: You’re saying that we don’t know exactly when AGI will come, but we should almost operate as if we believe that it was going to come soon, because that’s a particularly important scenario and one in which there’s particularly important work to be done?
Neel Nanda: Kind of. What I’m saying is we should be uncertain, and we should think about what the portfolio of actions by the safety community, given this uncertainty, should be.
I think quite a lot of it should be on the short timelines world — both because I think this is plausible enough to be concerning, and because I think that lots of the work done there still looks pretty good under the other worldviews. One framing I quite like is that of “AGI tomorrow”: try to do things so that if AGI happens tomorrow, we are in an OK situation.
And I would think it would be a mistake if the community only did things that were really short term. Clearly part of our portfolio should go into things with at least a six-month-plus payoff horizon. Some things should probably go into things with a five-year-plus payoff horizon.
But it’s a question of resource allocation — and maybe the exact ways I’d allocate the portfolio would change depending on my probabilities, but probably not that much, and I think we’re sufficiently far from optimal anyway that it’s just not that decision relevant. It’s plausible enough to be concerning, but we shouldn’t take it as a certainty, and we should be hesitant to do things that would just massively burn bridges or torch the community’s entire credibility if it takes more than five years for AGI.
How Neel went from the couch to an alignment rocketship [00:31:24]
Rob Wiblin: How did you end up deciding that you wanted to do AI technical safety research?
Neel Nanda: It was a fairly meandering path, to be honest. So back in 2013ish, a long time ago, when I was like 14, I came across this really good Harry Potter fan fiction — which led to me learning about EA, reading about AI safety, and broadly saying, “Yeah, these people seem pretty reasonable. I buy these arguments.”
I then proceeded to do absolutely nothing about this for a very long time, and in uni I ended up doing some quant finance internships at places like Jane Street. And contrary to popular belief, quant finance is kind of great. Would recommend if you don’t care about AI safety. And there was a world where I would have ended up going down that route. But I in part, thanks to an 80k career advising call — so thanks for that; would recommend, and people should check them out —
Rob Wiblin: 80000hours.org: like and subscribe!
Neel Nanda: [laughs] — I realised that I just didn’t actually understand AI safety research and what this actually meant. I’d kind of come across things like super mathsy work — this is like five, six years ago — things like MIRI’s logical induction work. And I was like, “I do not understand how this is useful.” I still don’t understand how this is useful, so points to past me.
But I got put in touch with some people in the community who were doing interesting work. I got exposed to arguments like, “Now that we have GPT-3, we can just actually do interesting things that we couldn’t previously do” — well, cryptic references to OpenAI’s internal very advanced system, back in the day.
And I think one of the things that really clicked was… So this kind of came to a head after I finished my undergrad. I was planning on doing a master’s, but then a global pandemic happened which made that a more questionable plan. I had an offer to go work at Jane Street which was a pretty good option, but I realised that sometimes there’s more than two options in life, which took me far too long to notice. And also that if I went and tried AI safety and then it was terrible, I could just not.
So I still wasn’t sure I wanted to do technical AI safety, but I was like, “Well, I should probably check.” So I took a gap year and did three AI safety internships at different labs. Honestly, I didn’t massively enjoy any of them. I didn’t really vibe with the research area that much, and much of this was during COVID which was meh. But I also just learned a lot more about the issue, about the field — and became a lot more convinced that this was real, there was useful work to be done, and I could do it even if it wasn’t the things I’d been doing.
I then lucked out and got an offer from Chris Olah to be one of the early employees at Anthropic. I was like the fourth person onto their interpretability team. It’s now like 25+ people or something. And I just kind of fell in love with the field.
I think something that I kind of struggled with when hearing EA messaging, when I was thinking about my career, was being like, “Man, even if I think an option is the best thing for the world, if I don’t want to do it, I just don’t want to do it. I don’t really think I could day in, day out do a thing that I don’t enjoy.”
And I think one of the things that clicked for me over that year was that the world isn’t fair — which also means that sometimes you just get free wins, and there can be impactful research directions that are just also very fun and intellectually stimulating and exciting. And mech interp clicked for me in a way the other ones hadn’t. And yeah, I have been doing mech interp ever since. So thanks for that advising chat!
I think one thing I also want to maybe call out to listeners is: I think that now if I was making this decision, it would be a lot more obvious that I should go into AI safety. GPT-3 had just happened. This was a long time ago. It was much less clear that these issues were even real. And now AI is reshaping the world. This is one of the most important things happening. I just can’t really imagine wanting a career where I’m not at the forefront of this, helping it go better.
And a lot of my uncertainties — What does working on safety even look like? What are people doing? Is there really empirical work to be done? — are now obviously resolved. I think there’s much better infrastructure, with educational materials and programmes like MATS, and just a lot more roles around. I just think that if someone’s listening to this, and kind of relates with the things I’ve been saying — of, “I should do something to make AI go better, because I agree these are real problems, but I don’t really know how, and this seems kind of intimidating, and I don’t really know when I should do this or what it should look like” — I think you should just do it now.
Rob Wiblin: It’s good to get clear advice. Usually people tend to equivocate a little.
Neel Nanda: One clarification: I think that people — especially technical, mathematically minded or computer-science-focused people — assume they should do technical AI safety research. I think this is a good path.
But at this point I think AI is just reshaping the whole world. If we want the transition to AGI to go well, there’s just so many things that we’ll need to have talented people who understand these issues working in them. We really need good policymakers, both civil servants and people in politics. We need journalists and people generally communicating and educating the public. We need people modelling the economic impacts. We need people forecasting this stuff.
And that’s just off the top of my head. I mean, this is a thing that will, unless I am very wrong about something, at some point in my lifetime, dramatically reshape the world. Even if the current bubble bursts. We just need a lot of people doing a lot of things to make this go well.
Rob Wiblin: What was the main thing that came up on the 80,000 Hours advising call? Did I pick up that one of the influences was to get you to consider a wider range of different options?
Neel Nanda: Honestly, I left the call feeling kind of annoyed that they just kept telling me to do AI safety. [laughs] Points to them. Good call.
Rob Wiblin: Yeah. I guess it’s dangerous to give such concrete advice, because someone might take it and it’s the wrong idea. But I guess in this case…
Neel Nanda: Yes. I think the most useful thing that came out of it was just being put in touch with some more established researchers, and just talking to them and being like, “Huh. You are a sane person doing a legible, high status job whose research makes sense to me. This feels so much more like a career path I could imagine myself getting into.” And one of the things I hope to do with podcasts like this is provide a bit more of that.
Navigating towards impact at a frontier AI company [00:39:24]
Rob Wiblin: What’s something interesting you’ve learned about trying to have an impact in a frontier AI company, especially a large organisation like Google DeepMind?
Neel Nanda: So I definitely learned a lot about how organisations work, in my time actually working in real companies. But maybe to begin, I just want to talk about what I’ve learned about large organisations in general — nothing to do with DeepMind specifically — which is that it’s very easy to think of them as a monolith: there’s some entity with a single decision maker who is acting according to some objective.
But this is just fundamentally not a practical way to run an organisation of more than, say, 1,000 people. If it’s a startup, everyone can kind of know what’s going on, know each other, have context. But if you’ve got enough people, you need structure and bureaucracy. There are a bunch of people who decision making power is delegated to. There are a bunch of stakeholders who are responsible for protecting different things important to the org, who will represent those interests.
These decision makers are busy people, and they will often have advisors or sub-decision makers they listen to. And sometimes decisions get made pretty far down the tree, but if they’re important, or if there’s enough disagreement, they go to more and more senior people until you have someone who’s able to just make a decision. But this means that if you go into things expecting any large organisation to be acting like a single, perfectly coherent entity, you’ll just make incorrect predictions.
Rob Wiblin: Like what? Does it mean that they end up making conflicting decisions? That one group over here might be pushing in this direction, another group over there might be pushing in another direction, and until something has escalated to a manager who has oversight over all of it, you can just have substantial incoherency?
Neel Nanda: So that kind of thing can happen. But maybe the thing that I found most striking is I think of it as these companies are not efficient markets internally.
Unpacking what I mean by that: when I’m considering trading in the stock market, I generally don’t. That’s on the grounds of: if there was money to be made, someone else will have already made it. And I kind of had some similar intuition of, “If there’s a thing to do that will make the company money, someone else is probably making it happen; if it won’t make the company money, probably no one will let this happen. Therefore, I don’t really know how much impact a safety team could have.”
I now think this is incredibly wrong in many ways. Even just within the finance analogy, markets are not actually perfectly efficient: hedge funds sometimes make a lot of money because they have experts who know more and can spot things people are missing. As the AGI safety team, we can spot AGI-safety-relevant opportunities that other people are missing.
But more importantly, financial markets just have a tonne of people whose sole job is spotting inefficiencies and fixing them. Companies generally do not have people whose sole job is looking over the company as a whole and spotting inefficiencies and fixing them — especially when it comes to ways you could add value safety wise. People are often busy. People need to care about many things.
There are often some cultural factors that lead to me prioritising different things. For example, a pretty common mindset in machine learning people is: focus on the problems that are there today. Focus on the things that you’re very confident are issues now, and fix those. It’s really hard to predict future problems. Don’t even bother; just prioritise noticing problems and fixing them.
In many contexts I think this is actually extremely reasonable. I think that it’s more difficult with safety because there can be more subtle problems, and it can take longer to fix the problems, so you need to start early on. And this means that if we can identify opportunities like that, there’s just a lot of things where no one minds the thing happening; people might actively be pro the thing happening — but if safety teams don’t make them happen, it will take a really long time, or won’t happen at all.
Rob Wiblin: It sounds like some of these companies, I would imagine, are very out of equilibrium in some sense, because the rate of change just in the industry and in the technology is very fast. They don’t actually have the necessary staff to take all of the good opportunities or even to analyse all of the opportunities that they have for different projects that they could take on, and sort them from best to worst and do the good ones and not the bad ones.
So it’s a lot chancier what things the company ends up doing versus not. It can depend on the idiosyncratic views of individual people, rather than some lengthy, considered process. Is that fair to say?
Neel Nanda: Broadly. And I want to emphasise that, by and large, I don’t think this is people being negligent or unreasonable or anything like that. It’s just people with different perspectives, different philosophies of how you prioritise solving problems, what they view as their job. And as someone who’s very concerned about AGI safety, I can help kind of nudge this in a better direction.
Rob Wiblin: Yeah. I think you’ve recently started work on kind of an applied mechanistic interpretability team, which is trying to develop tools or products almost that are actually going to be used in the models that are delivered to Google DeepMind’s customers. What have you learned so far about how you need to do things differently when you’re actually going to develop something that’s going to be deployed to hundreds of millions, possibly billions of users?
Neel Nanda: I think it’s been a really educational experience for me. I also just want to emphasise massive credit to Arthur Conmy, who is the one actually running our team, also one of my MATS alumni.
So what kinds of things have I learned? There are maybe four things that are worth thinking about for whether the kinds of people who are actually deciding what does or does not get used in a frontier model care about with some safety technique.
There’s effectiveness: does this actually solve the problem that we care about? And implicitly, is it a problem we actually care about?
Two: what are the side effects of this method? How does this damage performance? I think this is often missed by more academic types. For example, if you want to create a monitor on your system that says, “This prompt is harmful, the user’s trying to do a cybercrime, I should turn it off,” it’s really important that this does not activate much on normal user traffic. This is a thing that papers often just don’t check.
Thirdly, there’s just expense: how much does this increase the cost of running your model? Some things are basically free, like adding in some new fine-tuning data. Some things are really expensive, like running a massive language model on every query, giving some additional info.
And finally, one that I think people often don’t think about is implementation cost. This is maybe best illustrated with an analogy. I’m sure many people have seen Golden Gate Claude, this thing where you used a sparse autoencoder to make Claude think it was the Golden Gate Bridge and be obsessed with it.
Let’s suppose I wanted to make Golden Rule Gemini: I found the “be ethical” concept and just wanted to turn it on. What would this actually involve? Mathematically, it’s a minuscule fraction of the computational cost of a frontier language model. But in order to make this work, you need to go into this really complex, well optimised stack of matrix multiplications and do a different thing there. That is not what they’ve been optimised for.
This both is technically a lot more challenging than it is to just do it on an open source model where you don’t really care about performance, but it also is something where you’ve got to think about the stakeholders you’re interacting with. If you’re changing the codebase that’s used to serve a model, then lots of other people use the same codebase. and now you’ve added some extra code. They need to figure out if that’s relevant to them. Maybe it will break their use case, or maybe they’ll just think it might and I’ll waste their time. There’s actually quite a lot of resistance.
If you wanted to do something like introduce a novel architecture, like change how transformers work, not only do you need to interact with the serving people, but you need to interact with the pretraining people and basically everyone, because you’ve just changed how the thing works. Meanwhile, if it was something like a black box classifier that just takes the text output by the model, that’s a fair bit easier and will generally have less resistance.
One thing that I found pretty striking is how important it is to find common ground between what people care about today, and the techniques that I think will be long-term useful for AGI safety. I think that it’s just so useful in so many ways to have other people who care about your technique and will kind of build a coalition and help push for it being implemented — even if what they want is not actually very safety relevant.
It’s really helpful to have experience and data. You’ll be able to iterate on making it cheaper. You’ll kind of have the precedent set, but also the infrastructure set. It’s a lot easier to later convince people to use a slight variant that’s safety relevant if it’s just like, “Add this additional thing to the thing we’re already doing” than, “Add this whole new component.” It’s also a lot easier to convince people of this if you’ve implemented the components and just say, “Please accept my pull request and it will happen,” rather than, “Please spend some of your incredibly busy engineer time making this thing that I promise is a really good idea.”
How does impact differ inside and outside frontier companies? [00:49:56]
Rob Wiblin: What about for people who are outside of the leading AI companies? I spoke with Beth Barnes a couple of months ago, and I think she had the view that for many AI researchers, they might have an easier time in some ways, doing some lines of research outside of the frontier companies — where if they’re operating more independently or in a smaller research team in some other organisation, then they just have a lot more of a free hand to prioritise and do whatever they prefer.
But that definitely creates a risk that you might develop brilliant techniques and then just have a very hard time getting any of the companies to take them seriously or to apply them. You’re not inside the organisation, so to start with that adds a barrier to implementation. But also you might not have understood what the constraints are inside one of these companies in terms of delivering the product — so you might just be barking up the wrong tree or developing the thing in such a way that it’s going to be actually very hard to ever apply.
Do you have any advice for people who are doing work that they would one day like the companies to take up, but they’re not inside them?
Neel Nanda: Yeah, really good question. I think one thing that’s quite useful for people to bear in mind is that if we want to convince Gemini to use some safety technique, me and the GDM AGI safety team are much better placed to do this than anyone outside, because we have a good model of the kinds of things that team cares about. E.g. we know to check how much the monitor will trigger on harmless prompts; we know the evaluations people care about, which aren’t always public; we can work directly with the models; we can implement them.
This means that the goal of people outside should be to do sufficiently convincing work that lab safety teams think it is a good enough use of their time to try to flesh out the evidence base.
So what does this actually mean on a practical level? A lot of what I was saying about what it takes to get a thing used in production still applies: you should be thinking about, what are the number of stakeholders you’d need to agree if this ever gets used? How complex is my technique? How expensive? What are the side effects? Trying to produce the best evaluations you can, but you also want to make sure that lab safety teams are paying attention.
A lot of people doing safety research know someone at a lab. Getting their takes can often be really helpful. People sometimes hear this advice and think they need to reach out to the leads and the most senior people. But if someone emails Rohin Shah, my manager, he’s very busy, he does not have that much time to help; while if someone messages someone who joined my team in the past year, they might be a lot more able to help and also know enough to get pretty useful context and escalate it to someone like me if it’s actually promising.
And one thing that I think some of the external orgs — like Redwood and METR, that are doing fantastic work — do very well is they just talk a bunch to people at labs throughout the research process. Which means they have good models, and there can be dialogue both ways. I think this is much more likely to go well than just trying to get people to pay attention after you’ve done the research. Though easier said than done.
Rob Wiblin: An interesting challenge that companies like Google DeepMind face is that as we’re kind of barreling forward into this AI-dominated future with so many different products — and I guess there’s a commercial race, there’s a geopolitical race — all of the companies are being forced to make what could turn out to be quite monumental decisions about their internal governance arrangements, and what sort of safeguards they have and don’t have. They’re being forced to do this on a very accelerated timescale, with potentially not enough time to analyse all of the questions as they might like.
How do you think someone inside the company can help to steer those processes towards positive outcomes? And is that something that people should have in mind as a significant way in which they can have a positive impact, or is that the wrong mindset perhaps to have?
Neel Nanda: That’s a great question. I think that there is a lot of impact you can have if you’re someone that key decision makers trust and listen to you. This doesn’t necessarily look like “playing politics”; I think that there’s a lot of potential impact in being seen as a neutral, trusted technical advisor: someone who really knows their stuff, really understands safety, but also isn’t ideological.
I think the safety community sometimes has issues with thinking it must always hype everything up as dangerous because that’s the only way people will listen. I care quite a lot about the DeepMind safety team calling out things that are scaremongering for scaremongering, and things that are actually somewhat concerning as actually somewhat concerning — because that’s the only way you get people to listen when something actually concerning is happening.
And I think people often miss this whole “you can just be a well respected technical advisor.” This doesn’t mean not being good at understanding how bureaucracies work and navigating them. And it does involve a lot of time spent not doing research. But it looks much more like identifying the key decision makers, building a reputation for yourself as smart and competent and thoughtful, and being someone who gets called on for advice. My manager, Rohin Shah, is fantastic, and I think very influential at helping DeepMind go in safer directions, largely via this kind of approach.
Is a special skill set needed to guide large companies? [00:56:06]
Rob Wiblin: Yeah. When I spoke with Beth Barnes earlier in the year, I think she had a slightly pessimistic take about the ability for people to go into these very large AI companies and steer them towards what she, and probably you, would regard as better decisions. I think in part because she’s more of a technical person, and I think she’s spoken with many people who have more of a machine learning, technical background.
Many of them, she thought, had gone into roles in these companies hoping that they would be able to influence outcomes of political decisions or group management-level decisions. But she thinks in order to influence those things, you to some extent have to specialise in the kind of skill set that’s necessary — of speaking to all of the people, and trying to build a consensus around the kinds of outcomes that you would like to have.
And people who are focused on doing machine learning research, firstly, most of their time is spent doing the research, and they potentially don’t necessarily have all of the strengths in the internal organisational politicking that would put you in the best position to change the kinds of strategic or safety decisions that the project is making. Does that sort of sceptical take resonate with you?
Neel Nanda: Yeah, it’s complicated. I think there’s definitely some truth to that. Sometimes I see people who just say, “I care about safety, so I should go work at a frontier AI company and this will make things good.” This is probably better than nothing in general, but I think you can do much better by having a plan.
One type of plan is the kind of thing I just outlined of either “be very good at navigating an org” or “be widely respected technical experts.” But the vast majority of people having an impact on safety within DeepMind are not doing that. Instead, it’s more like there’s a few senior people who are kind of interfaced with the rest of the org — and a lot of the impact that, say, the people on my team can have is by just doing good research to support those people who are pushing for safer changes.
Because good engineers and good researchers can do things like show that a technique works, invent a new technique, reduce costs, build a good evidence base for the effectiveness of the technique and that it doesn’t have big side effects or costs, or just implement the technique. It’s a lot easier for decision makers to say, “Yes, we will accept the code you have already written for us.” And this is a lot more time consuming in some senses than being the person who navigates the org.
And I think that if there are people that people respect in the org — and I have a lot of respect for some of DeepMind’s safety leadership, like Rohin Shah and Anca Dragan — just trying to empower those people can be great.
I also think that trying to empower a team can be very impactful. Recently on the AGI safety team we’ve started a small engineering team whose job is just trying to accelerate everyone’s research. And you might think that this is a mature tech company, so surely there’s nothing that could be improved. But there’s just a lot of ways you can improve the research iteration speed of a small team with fairly idiosyncratic use cases. And I think this team has been super impactful.
Vikrant [Varma], who runs that team, actually asked me to give a plug that something people may not realise is if they are a good engineer — someone who’s fast and competent at working within deep, complex tech stacks — and feel excited at the idea of accelerating research, especially if you work at Google, you can have a pretty big impact even if you don’t really know anything about research or ML. And you should totally shoot me an email if that describes you.
The benefit of risk frameworks: early preparation [01:00:05]
Rob Wiblin: I think one thing that must be difficult for decision makers inside all of the leading AI companies is that it’s just legitimately extremely hard to know what sort of safeguards are required at this point — or are going to be required in the next year, or are going to be required in the next five years — versus which ones are just not actually going to move the needle on safety. It might just be kind of a waste of time and investment, and put you at a commercial disadvantage.
It sounds like you’re saying something incredibly useful that you can do as a technically savvy person inside the companies is to provide accurate information to inform decision makers about what are the real threats that are panning out, and what sort of safeguards would actually help to defuse them. And that is much more likely to build goodwill and actually influence the decision than just constantly saying, “We should always be doing the safer thing” because that’s your overall worldview.
Neel Nanda: Yeah, I think that’s a pretty good summary. I also think that “just do the safer thing” is actually an incredibly oversimplistic perspective. The space of ways that an incredibly complex system like a frontier language model can be safer is just really big, and I think part of the skill of having an impact is finding that common ground. To be clear, I’m not just saying people should only do safety research that is directly relevant to today’s models. Rather, what I’m saying is it will need to eventually be relevant.
One caveat I may add to what you just said is that it’s a lot harder to get a technique used in a production model than it is for the safety team to just decide to work on it. We have quite a lot of autonomy to use our best judgement for what to work on, and if we think that something’s going to be a big deal in six or 12 months, and people will see the urgency, then we can start to prepare that right now.
And I think that in general, I mean, there’s this famous saying that one of the ways to get policy change to happen is to wait until a crisis. I would strongly prefer that when safety issues arise, we have good solutions ready to go. Because I personally think that many safety issues will have warning signs and tentative precursors that crop up in testing or things like that. And the thing you can produce to fix it, if you’re really urgently trying to do it because it is blocking a launch, is often less deep and effective than what you can do if you prepare right. And one of our paths to impact is this kind of preparation.
Rob Wiblin: What do you mean by “crises”? Are you talking about, we’re imagining sometime in future when the product is doing things that the company doesn’t want, and people are scrambling to find solutions that will allow it to continue to be served to users in a safer way, and you want to prepare techniques ahead of time that are sound, that actually would address the problem, and have those packaged and potentially ready to go at some future time when there’s much more demand for actually using them?
Neel Nanda: Yeah. In some sense, I think that if that happens, the safety team has failed to do their job right. What I want is for these things that would be future crises to be flagged ahead of time in a way that people notice on evaluations they think are real, such that we can fix them before any actual incidents happen. And I’m a really big fan of these risk management frameworks, like DeepMind’s Frontier Safety Framework and Anthropic’s Responsible Scaling Policy, because I kind of think of them as creating “managed crises” on our own terms.
For example, Gemini 2.5 Pro triggered the early warning sign for being able to help people do things with offensive cyber capabilities. I think that it’s very good that we do these tests and notice this, because there is then a threshold where we’ve said we’ll have mitigations in place.
So there’s now a lot of energy and interest in developing good mitigations in advance, ready to go by the time we trigger the “this could actually be dangerous if it wasn’t mitigated well.”
And we now have deceptive alignment in our Frontier Safety Framework. And as time goes on and the evidence base for these risks starts to get stronger and stronger, it becomes much easier to put the currently more theoretical kinds of risks in there. We’re already seeing things like alignment faking. And if we can identify ahead of time what the risks will be, and agree on good evaluations for them and on what kind of mitigations are needed when that threshold is reached, then that creates the same kind of effect as a crisis without all of the bad effects of having a crisis.
Should people work at the safest or most reckless company? [01:05:21]
Rob Wiblin: Do you think that, for someone who is reasonably safety focused and has strong machine learning chops, are they likely to have more impact by going and working at a company that is already quite safety focused like Google DeepMind? Or perhaps could they have even more impact by going and working at a company, companies that will remain nameless, where there’s maybe less of a safety culture and less of a safety investment, and they might be one of relatively few people who have that as a significant personal priority? Should people seriously consider going to the less safety-focused labs?
Neel Nanda: I think it’s somewhere in the middle. I care a lot about every lab making frontier AI systems having an excellent safety team, whatever the leadership philosophy of that lab is. But in the kinds of labs where it’s harder to get this kind of thing through and there’s less interest, I think there are certain kinds of people who can have a big impact, but that many people will largely not.
The ways I’d approach having impact in a lab like that is I think there’s just a lot of common ground between people who want to make better models and people who want to make safer models. The world is not necessarily just full of tradeoffs.
For example, I would love people in any lab to be working on things like monitoring the chain of thought of a model to reduce reward hacking. This is pretty commercially applicable in my opinion. But also it’s the kind of thing that very naturally flows into much more safety-relevant things, while also being somewhat relevant to safety today. Because monitoring the chain of thought for the model doing things you don’t like is just very general. Like, if you wanted to implement some more elaborate control scheme, that’s an excellent basis.
But how could a listener tell if this might be a fit for them? I think the kind of person who would be good at this is:
- Someone who has experience working in and navigating organisations effectively.
- Someone who kind of has agency, who’s able to notice and make opportunities for themselves.
- Someone who’s comfortable with the idea of working with a bunch of people who they disagree with on maybe quite important things — where ideally, it doesn’t feel like you’re kind of gritting your teeth every day into work. It’s just, “This is a job. These are people. I disagree with them. I can leave that aside.” I think it’s a much healthier attitude, less likely to lead to burnout, and will also just make you more effective at diplomacy.
- And finally, people who are good at thinking independently. I personally find it a bit hard to maintain a very different perspective from the people around me. I can, but it requires active effort. Some people are very good at this, and some people are much worse than me at this. If you are my level or worse, probably you shouldn’t do this.
Advice for getting hired by a frontier AI company [01:08:40]
Rob Wiblin: What advice do you have for people who are interested in pursuing a career in AI and getting hired by one of these frontier AI companies? Is there any general advice that you can offer?
Neel Nanda: I think the two key skills that people really care about are: ability to do research — and especially proven experience doing research — and engineering skill, partially just software engineering skill and partially machine learning engineering skill. Machine learning is a bit less important nowadays given that it’s less everyone training their own models and more like massive training runs and people having different pieces.
A few additional thoughts here: I was quite confused about what “engineering skill” meant when I first got into this field. People’s first exposure is often things like LeetCode, these simple puzzles you get asked in coding interviews where maybe you don’t even write code, you just explain the solution. This is, in my opinion, not actually that useful, and a very specific kind of engineering skill.
Something that matters a lot for working at a tech company is what I think of as deep, experienced engineering skill. It’s the ability to work in a large, complex codebase where hundreds of people have been working in the same codebase, and you’re depending on countless other people’s code, and you can’t realistically keep the whole thing in your head at once. You need to know when to go diving deep to figure out what’s going wrong and you need to know when to abstract something. Nowadays you need to know how best to use LLMs to help you navigate this.
I want to distinguish this, because I think that this is a much harder skill to gain without just directly working in that kind of tech company. And it’s not essential — you can make up for it if you’re just like a fantastic enough researcher — but it is pretty valuable.
There’s also ML engineering: in particular things like being able to write efficient ML codes, train things — where really, that skill means get it to be efficient, but also debug it when it inevitably goes wrong 100 times in really confusing ways. Unlike normal programming, you don’t get these really nice error traces telling you what went wrong. Instead it just performs bad.
I do think this might be changing somewhat, because I think especially over the last six or 12 months, coding agents have gotten really good in a way we might discuss a bit later. But I think that these more senior engineering skills are likely to be much harder to automate and remain useful substantially longer.
On the front of research and research experience, one way I recommend thinking about it is that a paper is kind of a portable credential. It doesn’t matter where you wrote the paper, or whether you were doing a really prestigious PhD or being a random independent researcher. If you have done good work and people have enough time to realise it is good work, they care.
People often rely on heuristics. If you’ve done a PhD at a prestigious lab, then that’s a thing that gets people more likely to spend the time paying attention. But often PhDs don’t really matter that much as long as you have a good research track record. People will care more if you do research relevant to frontier language models.
And it can also be very in your interests to have people at the lab who know about your work and think it is cool research — things like reaching out to people at the lab whose work you admire and telling them about your work, or trying to meet them at conferences and things like that. Or reaching out to people at a lab when you apply to an open job ad — it often doesn’t work because people get lots of emails, but in expectation I think it is extremely useful.
What makes for a good ML researcher? [01:12:57]
Rob Wiblin: A common theme in some of your answers about how to be successful in ML research and how to get hired by an AI company is just to be a productive, thoughtful researcher.
No one would deny that you’ve published an impressive amount of research over the last four years, and you’re also leading a team of I think eight people now at Google DeepMind, and you’ve supervised 50 or something junior researchers over the last four years.
What have you learned about what makes for a good researcher, and what sorts of practices they tend to adopt in order to get a lot done?
Neel Nanda: I think people often somewhat misunderstand the exact skills and mindset that go into being a good researcher. I’m sure we’ll unpack this more, but at a high level: you need to be decent at coding, just so you can get things done and you can do experiments. Often [especially in academia or independent research] it’s sufficient to only be good at very hacky, small-scale coding that you’re doing in an interactive thing like a Python notebook, rather than building complex infrastructure or working within complex libraries. So the key skill is being able to kind of be hacky and get things right.
And related to this is the skill of just iterating fast. It’s very striking how different the productive output and the success rate is between people who can just do things fast and iterate fast, and see the new results they have and decide.
That gets me onto a related point of prioritisation. Research is an incredibly open-ended space, and more so than in many other lines of work, you need to be good at acting under uncertainty. This is a psychological thing. Some people find this a lot harder than others. I mean, I used to be a mathematician. It was so good: you had universal truth everywhere. Sadly, this is not the case in machine learning. But it’s also knowing when to dive deeper into a problem, and when to zoom out. People often start off too far in one of those two directions.
Another key thing, especially in mech interp, is this kind of empirical scientific mindset and this notion of scepticism. You’re trying to navigate this incredibly large tree of possible decisions you could make, and you will get experimental evidence and you’ll need to interpret it as it were. You’ll need to understand what this tells you about what could and could not be true; whether there’s weaknesses and flaws in your evidence and you should actually push harder and you can’t really update on it, versus when it’s good enough and you can move on. And being able to look back on your last weeks of research and take the hypotheses you’re maybe quite attached to and try to red-team them, think about how they could be flawed and what experiments you could do to test their robustness.
The final key skill is complex and often misunderstood. It’s what I call “research taste,” which basically means having good intuitions for what the right high-level decisions are going to be in a research project. This can mean “what problems do I choose?” Because some projects are a good idea, some are a bad idea. If you choose a good project, you’re happy. Some of this is knowing what directions within a project are most promising and most interesting.
Some of it is more low level, like can you design a good experiment to test the hypothesis you care about that is both tractable and easy to implement? Because you kind of know what that looks like, and what hard things look like, but also actually gets at the thing you care about.
Rob Wiblin: What do you think people misunderstand about research taste?
Neel Nanda: I think one thing people misunderstand is they think that it’s only about choosing the project, when actually it applies on many levels. I have a blog post — we can hopefully put it in the description — trying to unpack what this term actually means.
One of the particularly messy things is these skills have different feedback loops. Coding, you can kind of tell if your code worked in minutes or hours. Conceptual understanding of interpretability, you can often figure out within hours; you can read papers, maybe you can ask someone more experienced. And then there’s things like prioritisation, where you often don’t know if it was a good idea for quite a long time. And there’s research taste, where things like designing an experiment, maybe you can get feedback within days to weeks; “was this a good direction?” within weeks, maybe months; “was this a good project?” within months.
I think it’s quite instructive to think of this like you’re a neural network. You’re trying to train your intuitions to be good at making predictions about research questions, because there’s not really a science of “will this research question work out or not?” — ultimately it does come down to intuition, though often the intuition is grounded in logic and arguments and evidence. And you want to get high-quality data and you want to get lots of data.
You should expect that early on you are going to learn the easy skills much faster, and research taste much slower. Often my advice to people getting into research is just don’t worry about research taste. It’s a massive pain. Learn the other skills first and then it will be much easier to learn research taste.
This ideally looks like finding a mentor and kind of using them for research taste. But if you’re not lucky enough to have that, this can look like just doing pretty unambitious projects: incremental improvements to papers or random ideas where you don’t actually care if they’re good ideas or not. You just want to try doing something, and you’re willing to give it up if it doesn’t turn out very well.
Three stages of the research process [01:19:40]
Rob Wiblin: You mentioned in your notes that there are three different stages to the research process, and it sounds like you think people can get a little bit confused about which one they’re in and potentially have the wrong attitude to the work that they’re doing on a given day. Can you explain what are the three different stages, and what’s the different mentality that you need to bring to each?
Neel Nanda: Yeah. I call them explore, understand, and distil.
Explore is when you go from having a research problem to kind of a deep understanding of the problem and what kinds of things may or may not be true. I think people often don’t even realise this is a stage, but especially in interpretability it’s often more than half of a project: figuring out the right questions to ask, figuring out the misconceptions you might have had.
Then, once you have some hypotheses for what could be true about your problem — like you go from thinking that there would be something interesting about some behaviour, like, “Why does this model self-preserve?” — exploration would be getting to the point where you’re like, “I have the hypothesis that it’s confused.”
Then understanding is when you go from a hypothesis to trying to prove some conclusions, or at least provide enough evidence to convince yourself. Exploration is very open ended. The North Star to have in mind there is, “I want to gain information” — which often doesn’t need to be very directed. While when you’re understanding, you often want to have a specific hypothesis in mind and form a plan for what experiments would get you evidence for and against this. In the case of self-preservation you might be like, “If the model’s confused, then changing this part of the prompt might make it less confused.”
And then distillation is when you go from you are pretty convinced this thing is true to you’ve communicated it to the world, and you’ve made it more rigorous, and made it into something that someone who hasn’t been deep in the guts of the project might be convinced by.
Rob Wiblin: And how does that trip people up in practice?
Neel Nanda: People often don’t seem to really get that exploration is a stage. They feel like they’re failing if they don’t know the hypothesis they’re trying to answer, or they don’t have a clear plan or theory of impact in their head for where this project is going. This means that I think people often get stuck or feel stressed. There’s also the mistake where people get stuck in rabbit holes because they think they found the right thing to look at, but actually they just don’t know enough about the problem.
And exploration, it doesn’t mean just kind of messing around. You can think rationally about the information gained per unit time. There are certain kinds of tools you might use, like reading the model’s chain of thought, that might give you pretty rich data. There are certain tools that are probably not going to give you rich data, like measuring the accuracy of some technique that’s a number.
I think of this as kind of gaining surface area. You want to do techniques that are very open ended and have loads of different things to look at and think about. You want to maximise your iteration speed. If there’s a bunch of cheap techniques that aren’t super rigorous, you should do those fast, rather than one really high-effort thing.
That’s for understanding. And you don’t need to be rigorous, because the goal is to form ideas and hypotheses. Often it makes sense to kind of pick a micro hypothesis and then try to test it, and then zoom out within like a week. One practical implementation of all this is I recommend people check in like once a day or once every two days: Did I learn anything? If the answer is no, consider changing tools.
And then exploration requires a different mindset. Maybe someone’s really good at messing around, but they’re bad at really just getting into the scientific mode of making great experiments. One mistake I think people often make is they try too hard to have lots of experiments, or really comprehensive experiments, rather than thinking about what would be the kind of killer experiment that just really establishes that the thing works. And then ideally you also do other experiments. But you’ve got to think about both: quality and quantity.
Then distillation: I think people make the mistake of, you’ve spent weeks to months in the weeds of a research project, so everything’s clear to you. This is not clear to anyone else — who are the people who actually matter, because they will be reading your work. You need to meet people where they are and provide them all of the relevant context. My practical advice for this is to get feedback from people, run your ideas by them, run a draft by them, see what confuses them, see what they don’t understand, and prioritise fixing that.
Another thing which helps a lot here is being very deliberate about what the point of your paper is. People often think they need to tell a story about everything they did in a project. But actually people will only ever really remember a few ideas from a paper — what I think of as the narrative; kind of a few ideas — and then you support them with, like, “Why are these ideas true, and why should you care about this?” You begin the paper by just explaining the ideas, and very briefly, your evidence. You then explain the ideas in detail. Then you explain your evidence in detail. But all of this is just centrally focused on this goal of having people remember and believe the ideas, this narrative.
And I have various blog posts on how to do research, including one on how to write papers, which I hope we’ll put in the description.
Rob Wiblin: It sounds like you really have to distil things down at the communication stage, because if people remember a single sentence from your paper, then in a sense you’re quite lucky from that. But that means you do really have to think very hard about what is the sentence that you’re trying to get through to them. And then if someone remembers like three sentences, the beginnings of a narrative, what might that be?
You’ve had quite a bit of success getting your papers to go viral. Do you spend a lot of time crafting the tweet, or crafting the distillation of the core point from the paper that you actually want to spread in the wild?
Neel Nanda: Yeah, quite a lot of time. And this is something I try to get my MATS scholars to focus on. I maybe want to distinguish between hype and communication. I think the crucial thing is that people understand the ideas in the paper and why they should care. This can kind of look the same as trying to build hype, but it’s actually based on something. A common joke is you should spend an equal amount of time on the abstract, introduction, figures, title, and everything else — because if you take the length of a thing times the number of people who will read it, they’re about equal. I have not yet figured out the best way to spend 20% of a project making the title good. But…
Rob Wiblin: We’re actually almost approaching that on the show, because it’s so important on YouTube or on Twitter to really distil what is the reason that anyone would want to listen to a given episode? I think we’re not at 20% yet, but it’s going up.
Neel Nanda: I’ve been inspired by your process. At the very least you should make a shortlist, shop them around for feedback. Don’t just go with your own intuition. I also think that Twitter is one of the main ways people communicate research in ML, for better or for worse. Nowadays there’s also interest in Bluesky.
I think that the key thing to have in mind if you want to write a good tweet thread about your paper is: 95% of people do not read beyond tweet 1. So tweet 1 needs to contain your entire narrative. This forces you to be concise. That is good.
Rob Wiblin: This is a 280-character tweet, not the new whole tweet thread.
Neel Nanda: Don’t do the really long tweets because then people have to click on them, and they won’t.
And the other thing is: have a single eye-catching figure that conveys something interesting in a way that people can understand without needing too much context. This is often worth a lot of effort, and it should also typically be the first figure in your paper. Again, this isn’t just about hype. I think this is actually a very effective form of communication. If more people see your work and get a correct understanding, that’s great.
The other thing to bear in mind is that people who see your tweet will just be in their feed amongst a bunch of other stuff. They won’t necessarily even know that you’re an ML researcher. So you need to situate it in the mind space of all possible things: things like use keywords they might recognise. But don’t use jargon if you want people who don’t know the jargon to understand it. Have a sentence of motivation of what’s the problem being solved? There is an art form to good tweet threads people do not seem to have. It’s very sad.
Rob Wiblin: OK, you’ve got to explain the full results of the paper situated in the field, communicate that you have the necessary expertise to be commenting on this, and do it without any jargon — in about 280 characters?
Neel Nanda: You don’t need to communicate your expertise. I think it’s fine to just be like “paper.” And never put a link to the paper in the first tweet, and never just screenshot the abstract and list of author names and title, because that’s really boring and no one will read it.
Rob Wiblin: Yeah. There’s a very strong pressure that people can feel — and I feel this myself sometimes — about how much you want to hype how exciting a result is, or a podcast is, versus be frank about the content and its limitations. How do you strike that tradeoff yourself?
Neel Nanda: So I’m not sure I always do an amazing job straddling this balance myself. But roughly, I just think it is very important to talk about the limitations of your work prominently in the work. Ideally, if there’s a key limitation, it should be in the abstract, because many people will only read the abstract. In my opinion, it is bad if people have a mistakenly positive impression of your work.
In particular, an incentive that I think people might often not track is experienced researchers are pretty good at telling when a paper is just full of hype and has no substance, and will generally lower their opinion of the authors. And often these are the people who might be hiring you in future or deciding whether to take you on as a PhD student or potentially collaborating with you. This is a real cost. I also think that having integrity is nice, and it’s great that these align.
So I think definitely you have to discuss the limitations. Otherwise people will be like, “What moron wrote this?” But also, if you bury them at the end, people may often still think that, because they don’t read to the end. And you might also impress a lot of people and get a lot of hype. And I’m not going to lie and say this isn’t useful, but I think that nuanced communication is often fairly incentivised.
How do supervisors actually add value? [01:31:53]
Rob Wiblin: In supervising 50 or so people over the last few years, what have you learned about the ways in which a research supervisor can actually add value? Especially because I’m guessing that with so many people that you’re involved with, you can’t put that many hours in — no individual person can be your main focus — so how can you help them a great deal, potentially, with only a few hours a week or maybe even less?
Neel Nanda: The system I’ve converged on for my MATS scholars, which seems to work pretty well, is I take on another eight every six-ish months. I put them into pairs. Each pair is doing a project, and I have a one-and-a-half-hour check-in with them once a week. And I’m not very good at getting the people to move on, so I currently have like 16 and it’s kind of an issue. But eight I can totally do in like one evening a week. It’s great.
So what do I actually do? And why is one and a half hours a week both sufficient and more important than nothing?
I think this gets back to this core idea that research has different skills that have different difficulty to gain. But another part of this is that they have different kind of time costs to use. Like the majority of your time will be spent writing code and you get very good feedback loops on this. I can’t write the code for them. I don’t have the time to do this. In practice, I often don’t even read the code my scholars are writing and I just hope it’s not full of bugs. But then there are these skills like prioritisation and research taste that I somehow have developed and I can use in a way that’s very cheap on my time, but would be hard for them to replicate.
So my ideal is that I try to select for people who can do fairly well autonomously and do decently on these skills themselves.
An AI PhD – with these timelines?! [01:34:11]
Rob Wiblin: So you don’t have a PhD yourself. You potentially saved many years of study that way. As an outsider, it’s hard for me to imagine that the best move in the AI industry right now would be to go away and spend four or five or six years possibly writing a thesis, doing a full PhD — at least if you could get a job in the industry in some sort of ML research some other way.
Is that kind of the right intuition? That this is no moment to be taking that much time away from advancing your career directly?
Neel Nanda: It’s complicated. I think one common mistake people make is they assume you need to finish a PhD. This is completely false. You should just view PhDs as an environment to learn and gain skills. And if a better opportunity comes along, it’s often quite easy to take a year’s leave of absence. Sometimes you should just drop out. If you’ve got an option that’s having a job doing research at a serious organisation that you expect to go better than your PhD, that’s kind of the point of doing a PhD. You’re done. You’re done early. Leave. I don’t know.
Someone I recently hired, Josh Engels, was doing a PhD in interpretability, had done some really fantastic work. And I was just already convinced he was excellent and managed to convince him to drop out a few years early to come join my team. I know I’m very biased, but I think this is the right call.
That aside, should people do them at all? I think that one of the most useful ways to get into research is mentorship. You can get mentorship from PhD supervisors or other people in the lab. You can get it from colleagues and your manager if you join an industry research team. But different places vary a lot in what you’ll learn and how valuable an experience it will be.
I think something people often don’t realise is that the skill as a manager of the person who you’ll be working with — either your manager in industry or your PhD supervisor doing a PhD — is incredibly important, and it is a very different skill from being a good researcher. Often the most famous PhD supervisors are the worst because they are really busy and they don’t have time for you, or because being a good researcher just doesn’t translate into being a patient supervisor, caring about nurturing people, making time for them. And the same applies for industry roles.
I like to think I make an environment where people on my team can learn, have some research autonomy, and grow as researchers. Because if nothing else, my life’s a lot easier if people on my team are great researchers. The incentives are very aligned.
But there are some teams where you get much less autonomy and they might want you to just be an engineer or something. Or you might be able to do research, but on a very specific agenda. And I do think engineering is a real skill set. It’s often hard to learn in a PhD and easy to learn in industry, but so is doing research.
One thing I will particularly call out is I think that we do still need people who can set new research agendas, come up with creative new ideas. Often PhDs are better environments for this, because you can just do whatever you want and no one can really stop you. And often supervisors are fine with this. Managers in industry less so.
Some of the people I respect who’ve done PhDs and recommend it emphasise this as one of the crucial reasons: that we really want more people who can lead new research agendas. And being kind of thrown into the wilderness on your own for a while to figure this out can be one of the best ways to learn this.
You can think of this as: in academia, you often have high autonomy; you don’t have much support. Some people thrive, some people do badly. Famously, the public health burden is worse than that of STDs. We can link to that blog post in the description.
Rob Wiblin: Google that if you haven’t heard that claim.
Neel Nanda: I know a lot of people who’ve been very miserable doing PhDs. I also know people who’ve loved doing PhDs, and just did not want to leave or are really glad they finished. In industry you often have lower autonomy, but it varies a lot between teams. But you can be part of larger projects, you’re better resourced, you’ll often learn a lot more about engineering, and you’ll often learn skills that are more relevant to frontier model research. But the variance between teams typically dominates.
One concrete tip: if you’re considering working in some team or in some academic lab, talk to the students of that supervisor or other people on that team in a kind of candid context where you think they’ll probably be honest with you, and ask a lot of questions about what their supervisor is like: How much time do they spend? What benefit do they get from working with this person? What things would they change about them if they could? And if the person has a great reputation but the student says bad things, it’s a trap and you shouldn’t go there.
I find it quite frustrating that there’s such a big information asymmetry between junior people trying to get into roles, and the people hiring for the roles. I try to make a point of, if I’m trying to hire someone and I’m not sure this is the best option for them, just telling them this. But there is a lot of information asymmetry. Expect that if you gather the right kinds of information, you can make better decisions.
And sometimes the person in authority over you, especially certain PhD supervisors, is kind of terrible and prioritises their own interests over yours. I’ve heard of supervisors who’ll do things like try to stop their students graduating for as long as possible so they get the maximum research labour. And I just think this is awful and massively judge supervisors like that.
Is career advice generalisable, or does everyone get the advice they don’t need? [01:40:52]
Rob Wiblin: So we’re getting a bunch of career and life advice from you throughout this episode. That sort of thing has become a bit less of a focus for the show over the years. I think one reason is I’ve become a bit more worried that advice that’s helpful for one person can be potentially harmful for someone else, because people are just so different and they might well need to hear different things. Just how much life and career advice generalises is something that I’m a little bit sceptical and a little bit nervous about.
Do you have any thoughts on that? I guess you’re in a position where you often are having to give people all kinds of different advice, people who are more junior than you who you’re trying to help. How much do you worry that the things you suggest could be bad?
Neel Nanda: Maybe let’s divide this into: advice given in a context like this, and advice given in a one-on-one setting. So in a context like this, where I’m just speaking to however many people, I agree with everything you just said: people should take everything I say with a mountain of salt. I’m trying to control for this in what I chose to say, but I probably messed up, and I definitely find it easy to slip into modelling people as though they think like I do or something.
But a few reasons that I think I can say somewhat useful things. I’m definitely very conscious that a big part of the reason I’ve been successful is I got lucky: I was in the right place at the right time. So I am not trying to give advice about that kind of thing specifically, other than meta-level things like, “Here are the traits of good opportunities to look for.”
There’s things like how to do research where I actually have a decent sample size at this point. I’m sure that I still kind of filter for a certain kind of person, and interpretability is a certain kind of field, so people should take what I say with a grain of salt.
And I do also very strongly agree with the “different people need different advice.” There’s this fantastic blog post on the law of equal but opposite advice. I try to give advice of the form, “Some people do extreme A, some people do extreme B. The correct thing to do is somewhere in the middle. Here’s what I think the correct thing looks like” to try to mitigate this a bit. But yeah, people should definitely be aware.
And also be aware that they will typically have a bias towards thinking they need to do more of the thing they are already doing too much of — because if you were biased in the other direction, you’d probably be skewed that way already.
Rob Wiblin: Right.
Neel Nanda: But I don’t know. I think you can sometimes tell. And then there’s things like how to write cold emails where I feel pretty good about giving the advice, because it’s more about how do I perceive things. Though definitely some parts of that — like put all the ways you’re impressive up front — where people’s mileage may vary.
Remember: You can just do things [01:43:51]
Rob Wiblin: All right, we’ve got to wrap up. We’ve reached the end of our booking here. You’ve had a lot of different advice to share across many different fronts in this episode. Earlier we were talking about the importance of distilling things down and having the one sentence that people remember. What’s a piece of advice that you think is pretty general that you would really like people to recall from this episode?
Neel Nanda: I think one of the most important lessons I’ve learned is that you can just do things. This may sound kind of trite, but I think there’s a bunch of non-obvious things I’ve had to learn here.
The first one is that doing things is a skill. I’m a perfectionist. I often don’t want to do things. I’m like, “This seems risky. This could go wrong.” And the way I broke this myself was I challenged myself to write a blog post a day for a month. And that’s how I met my partner of the past four years.
Rob Wiblin: Wonderful.
Neel Nanda: And it also helped me produce a bunch of the public output that’s helped build the field of mech interp.
Another part of this is what I think of as kind of maximising your luck surface area. You want to just have as many opportunities as possible for good opportunities to come your way. You want to know people. You want to be someone who sometimes says yes, so people bring things to you. You want to just get involved in a bunch of things.
You also want to be willing to do things that are kind of a bit weird or unusual. One of the most popular videos on my YouTube channel, with like 30,000 views, was I read through one of the famous mech interp papers, “A mathematical framework for transformer circuits,” for three hours and just gave takes. I did no editing, put it on YouTube. People were into this.
And maybe as a final example, I kind of ended up running the DeepMind team by accident. I joined DeepMind expecting to be an individual researcher. Then unexpectedly, the lead decided to step down a few months after I joined. And in the months since, I ended up stepping into their place. I did not know if I was going to be good at this. I think it’s gone reasonably well. To me, this is both an example of the importance of having luck surface area — being in a situation where opportunities like that can arise — but also that you should just say yes to things, even if they seem kind of scary, and you’re not confident they’ll go well, so long as the downside is pretty low.
And worst case, I just didn’t do a good job leading the team. I stepped down. We had to pick someone else out.
Rob Wiblin: Seems to have gone pretty well. My guest today has been Neel Nanda. Thanks so much for coming on The 80,000 Hours Podcast, Neel.
Neel Nanda: Thanks a lot for having me.