Transcript
Cold open [00:00:00]
Nick Joseph: I think this is a spot where there are many people who are sceptical that models will ever be capable of this sort of catastrophic danger. Therefore they’re like, “We shouldn’t take precautions, because the models aren’t that smart.” I think this is a nice way to agree where you can. It’s a much easier message to say, “If we have evaluations showing the model can do X, then we should take these precautions.” I think you can build more support for something along those lines, and it targets your precautions at the time when there’s actual danger.
One other thing I really like is that it aligns commercial incentives with safety goals. Once we put this RSP in place, it’s now the case that our safety teams are under the same pressure as our product teams — where if we want to ship a model, and we get to ASL-3, the thing that will block us from being able to get revenue, being able to get users, et cetera, is: “Do we have the ability to deploy it safely?” It’s a nice outcome-based approach, where it’s not, “Did we invest X amount of money in it?” It’s not like, “Did we try?” It’s: “Did we succeed?”
Rob’s intro [00:01:00]
Rob Wiblin: Hey everyone, Rob Wiblin here.
The three biggest AI companies — Anthropic, OpenAI, and DeepMind — have now all released policies designed to make their AI models less likely to go rogue while they’re in the process of becoming as capable as and then eventually more capable than all humans.
Anthropic calls theirs a “responsible scaling policy” (or “RSP”), OpenAI uses the term “preparedness framework,” and DeepMind calls theirs a “frontier safety framework.”
But they all have a lot in common: they try to measure what possibly dangerous things each new model is actually able to do, and then as that list grows, put in place new safeguards that feel proportionate to the risk that exists at that point.
So, seeing as this is likely to remain the dominant approach, at least in AI companies, I was excited to speak with Nick Joseph — one of the original cofounders of Anthropic, and a big fan of responsible scaling policies — about why he thinks RSPs have a lot going for them, how he thinks they might make a real difference as we approach the training of a true AGI, and why in his opinion they’re kind of a middle way that ought to be acceptable to almost everyone.
After hearing out that case, I push Nick on the best objections to RSPs that I could find or come up with myself. Those include:
- It’s hard to trust that companies will stick to their RSPs long term; maybe they’ll just drop them at some point.
- It’s difficult to truly measure what models can and can’t do, and the RSPs don’t work if you can’t tell what risks the models really pose.
- It’s questionable whether profit-motivated companies will go so far out of their way to make their own lives and their own product releases so much more difficult.
- In some cases, we simply haven’t invented safeguards that are close to being able to deal with AI capabilities that could show up really soon.
- And that these policies could make people feel the issue is fully handled when it’s only partially handled or maybe not even that handled.
Ultimately, I come down thinking that responsible scaling policies are a solid step forward from where we are now, and I think they’re probably a very good way to test and learn what works and what feels practical for people at the coalface of trying to make all of this AI future happen — but that in time they’re going to have to be put into legislation and operated by external groups or auditors, rather than left to companies themselves, at least if they’re going to achieve their full potential. And Nick and I talk about that, of course, as well.
If you want to let me know your reaction to this interview, or indeed any interview we do, then our inbox is always open at [email protected].
But now, here’s my interview with Nick Joseph, recorded on 30 May 2024.
The interview begins [00:03:44]
Rob Wiblin: Today I’m speaking with Nick Joseph. Nick is head of training at the major AI company Anthropic, where he manages a team of over 40 people focused on training Anthropic’s large language models — including Claude, which I imagine many listeners have heard of, and potentially used as well. He was actually one of the relatively small group of people to leave OpenAI alongside Dario and Daniela Amodei, who then went on to found Anthropic back in December of 2020. Thanks so much for coming on the podcast, Nick.
Nick Joseph: Thanks for having me. I’m excited to be here.
Scaling laws [00:04:12]
Rob Wiblin: I’m really hoping to talk about how Anthropic is trying to prepare itself for training models capable enough that we’re a little bit scared of what they might go and do. But first, as I just said, you lead model training at Anthropic. What’s something that people get wrong or misunderstand about AI model training? I imagine there could be quite a few things.
Nick Joseph: Yeah. I think one thing I would point out is the sort of doubting of scaling working. So for a long time we’ve had this trend where people put more compute into models, and that leads to the models getting better, smarter in various ways. And every time this has happened, I think a lot of people are like, “This is the last one. The next scaleup isn’t going to help.” And then some chunk of time later, things get scaled up and it’s much better. I think this is something people have just frequently gotten wrong.
Rob Wiblin: This whole vision that scaling is just going to keep going — we just throw in more data, throw in more compute, the models are going to become more powerful — that feels like a very Anthropic idea. Or it was part of the founding vision that Dario had, right?
Nick Joseph: Yeah. A lot of the early work on scaling laws was done by a bunch of the Anthropic founders, and it somewhat led to GPT-3 — which was done in OpenAI, but by many of the people who are now at Anthropic. Where, looking at a bunch of small models going up to GPT-2, there was sort of this sign that as you put in more compute, you would get better and better. And it was very predictable. You could say, if you put in x more compute, you’ll get a model this good. That sort of enabled the confidence to go and train a model that was rather expensive by the time’s standards to verify that hypothesis.
Rob Wiblin: What do you think is generating that scepticism that many people have? People who are sceptical of scaling laws, there are some pretty smart people who are involved in ML, certainly have their technical chops. Why do you think they are generating this prediction that you disagree with?
Nick Joseph: I think it’s just a really unintuitive mindset or something. It’s like, the model has hundreds of billions of parameters. What does it need? It really needs trillions of parameters. Or the model is trained on like some fraction of the internet that’s very massive. What does it need to be smarter? Even more. That’s not how humans learn. If you send a kid to school, you don’t have them just read through the entire internet and think that the more that they read, the smarter they’ll get. So that’s sort of my best guess.
And the other piece of it is that it’s quite hard to do the scaling work, so there are often things that you do wrong when you’re trying to do this the first time. And if you mess something up, you will see this behaviour of more compute not leading to better models. It’s always hard to know if it’s you messing up or if it’s some sort of fundamental limit where the model has stopped getting smarter.
Rob Wiblin: So scaling laws, it’s like you increase the amount of compute and data by some particular proportion, and then you get a similar improvement each time in the accuracy of the model. That’s kind of the rule of thumb here.
And the argument that I’ve heard for why you might expect that trend to break, and perhaps the improvements to become smaller and smaller for a given scaleup, is something along the lines of: as you’re approaching human level, the model can learn by just copying the existing state of the art of what humans are already doing in the training set. But then, if you’re trying to exceed human level — if you’re trying to, you know, write better essays than any human has ever written — then that’s maybe a different regime. And you might expect more gradual improvements once you’re trying to get to a superhuman level. Do you think that argument kind of holds up?
Nick Joseph: Yeah, so I think that’s true. And just sort of pre-training on more and more data won’t get you to superhuman at some tasks. It will get you to superhuman in the way of understanding everything at once. This is already true of models like Claude, where you can ask them about anything, whereas humans have to specialise. But I don’t know if progress will necessarily be slower. It might be slower or it might be faster once you get to the level where models are at human abilities on everything and improving towards superintelligence.
But we’re still pretty far from there. If you use Claude now, I think it’s pretty good at coding — this is one example I use a lot — but it’s still pretty far from how well a human would do working as a software engineer, for instance.
Rob Wiblin: Is the argument for how it could speed up that, at the point that you’re near human level, then you can use the AIs in the process of doing the work? Or is it something else?
Nick Joseph: What I have in mind is like, if you had an AI that is human level at everything, and you can spin up millions of them, you effectively now have a company of millions of AI researchers. And it’s hard to know. Problems get harder too. So I don’t really know where that leads. But at that point, I think you’ve crossed quite a ways from where we are now.
Bottlenecks to further progress in making AIs helpful [00:08:36]
Rob Wiblin: So you’re in charge of model training. I know there’s different stages of model training. There’s the bit where you train the language model on the entire internet, and then there’s the bit where you do the fine-tuning — where you get it to spit out answers and then you rate whether you like them or not. Are you in charge of all of that, or just some part of it?
Nick Joseph: I’m just in charge of what was typically called pre-training, which is this step of training the model to predict the next word on the internet. And that tends to be, historically, a significant fraction of the compute. It’s maybe 99% in many cases.
But after that, the model goes to what we call fine-tuning teams, that will take this model that just predicts the next word and fine-tune it to act in a way that a human wants, so it would be this sort of helpful assistant. “Helpful, harmless, and honest” is the acronym that we usually aim for for Claude.
Rob Wiblin: So I use Claude 3 Opus multiple times a day, every day now. It took me a little while to figure out how to actually use these LLMs for anything. For the first six months or first year, I was like, these things are amazing, but I can’t figure out how to actually incorporate them into my life. But recently I started talking to them in order to learn about the world. It’s kind of substituted for when I would be typing complex questions into Google to understand some bit of history or science or some technical issue.
What’s the main bottleneck that you face in making these models smarter, so I can get more use out of them?
Nick Joseph: Let’s see. I think historically, people have talked about these three bottlenecks of data, compute, and algorithms. I kind of think of it as, there’s some amount of just compute. We talked about scaling a little bit ago: if you put more compute with the model, it will do better. There’s data: if you’re putting in more compute, one way to do it is to add more parameters to your model, make your model bigger. But the other way you need to do is to add more data to the model. So you need both of those.
But then the other two are algorithms, which I really think of as people. Maybe this is the manager in me that’s like, algorithms come from people. In some ways, data and compute also come from people, but it looks like a lot of researchers working on the problem.
And then the last one is time, which has felt more kind of urgent, more true recently, where things are moving very quickly. So a lot of the bottleneck to progress is actually that we know how to do it, we have the people working on it, but it just takes some time to implement the thing and run the model, train the model. You can maybe afford all the compute, and you have a lot of it, but you can’t efficiently train the model in a second.
So right now at Anthropic, it feels like people and time are probably the main bottlenecks or something. I feel like we have quite a significant amount of compute, a significant amount of data, and the things that are most limiting at the moment, I feel like are people and time.
Rob Wiblin: So when you say time, is that kind of indicating that you’re doing a sort of iterative, experimental process? Where you try tinkering with how the model learns in one direction, then you want to see whether that actually gets the improvement that you expected, and then it takes time for those results to come in, and then you get to scale that up to the whole thing? Or is it just a matter of you’re already training Claude 4, or you already have the next thing in mind, and it’s just a matter of waiting?
Nick Joseph: It’s both of those. For the next model, we have a bunch of researchers who are trying projects out. You have some idea, and then you have to go and implement it. So you’ll spend a while sort of engineering this idea into the code base, and then you need to run a bunch of experiments.
And typically, you’ll start with cheap versions and work your way up to more expensive versions, such that this process can take a while. For simple ones, it might take a day. For really complicated things, it could take months. And to some degree you can parallelise, but in certain directions it’s much more like you’re building up an understanding, and it’s hard to parallelise building up an understanding of how something works and then designing the next experiment. That’s just sort of a series aspect.
Rob Wiblin: Is improving these models harder or easier than people think?
Nick Joseph: Well, I guess people think different things on it. My experience has been that early on it felt very easy. Before working at OpenAI, I was working on robotics for a few years, and one of the tasks I worked on was locating an object so we can pick it up and drop it in a box. And it was really hard. I spent years on this problem. And then went to OpenAI and I was working on code models, and it just felt shockingly easy. It was like, wow, you just throw some compute, you train on some code, and the model can write code.
I think that has now shifted. The reason for that was no one was working on it. There was just very little attention to this direction and a tonne of low-hanging fruit. We’ve now plucked a lot of the low-hanging fruit, so finding improvements is much harder. But we also have way more resources, exponentially more resources put on it. There’s way more compute available to do experiments, there are way more people working on it, and I think the rate of progress is probably still going the same, given that.
Rob Wiblin: So you think on the one hand the problem’s gotten harder, but on the other hand, there’s more resources going into it. And this is kind of cancelled out and progress is roughly stable?
Nick Joseph: It’s pretty bursty, so it’s hard to know. You’ll have a month where it’s like, wow, we figured something out, everything’s going really fast. Then you’ll have a month where you try a bunch of things and they don’t work. It varies, but I don’t think there’s really been a trend in either direction.
Rob Wiblin: Do you personally worry that having a model that is nipping at the heels or maybe out-competing the best stuff that OpenAI or DeepMind or whatever other companies have, that that maybe puts pressure on them to speed up their releases and cut back on safety testing or anything like that?
Nick Joseph: I think it is something to be aware of. But I also think that at this point, I think this is really more true after ChatGPT. I think before ChatGPT, there was this sense where many AI researchers working on it were like, wow, this technology is really powerful — but the world hadn’t really caught on, and there wasn’t quite as much commercial pressure.
Since then, I think that there really is just a lot of commercial pressure already, and it’s not really clear to me how much of an impact it is. I think there is definitely an impact here, but I don’t know the magnitude, and there are a bunch of other considerations to trade off.
Anthropic’s responsible scaling policies [00:14:21]
Rob Wiblin: All right, let’s turn to the main topic for today, which is responsible scaling policies — or RSPs, as the cool kids call them.
For those who don’t know, “scaling” is this technical term for using more compute or data to train any given AI model. The idea for RSPs has been around for a couple of years, and I think it was fleshed out maybe after 2020 or so. It was advocated for by this group now called METR, or Model Evaluation and Threat Research — which actually is the place that previous guest of the show, Paul Christiano, was working until not very long ago.
Anthropic released the first public one of these, as far as I know, last October. And then OpenAI put out something similar in December called their Preparedness Framework. And Demis of DeepMind has said that they’re going to be producing something in a similar spirit to this, but they haven’t done so yet, as far as I know. So we’ll just have to wait and see.
Nick Joseph: It’s actually out. It was published like a week or so ago.
Rob Wiblin: Oh, OK. That just goes to show that RSPs are this reasonably hot idea, which is why we’re talking about them today. I guess some people also hope that these internal company policies are ultimately going to be a model that might be able to be turned into binding legislation that everyone dealing with these frontier AI models might be able to follow in future.
But yeah. Nick, what are responsible scaling policies, in a nutshell?
Nick Joseph: I might just start off with a quick disclaimer here that this is not my direct role. I am sort of bound by trying to implement these and act under one of these policies, but many of my colleagues have worked on designing this in detail and are probably more familiar with all the deep points than me.
But anyway, in a nutshell, the idea is it’s a policy where you define various safety levels — these sort of different levels of risk that a model might have — and create evaluations, so tests to say, is a model this dangerous? Does it require this level of precautions? And then you need to also define sets of precautions that need to be taken in order to train or deploy models at that particular risk level.
Rob Wiblin: I think this might be a topic that is just best learned about by skipping the abstract question of what RSPs are, and just talking about the Anthropic RSP and seeing what it actually says that you’re going to do. So what does the Anthropic RSP commit the company to doing?
Nick Joseph: Basically, for every level, we’ll define these red-line capabilities, which are capabilities that we think are dangerous.
I can maybe give some examples here, which is this acronym, CBRN: chemical, biological, radiological, and nuclear threats. And in this area, it might be that a nonexpert can make some weapon that can kill many people as easily as an expert can. So this would increase the pool of people that can do that a lot. On cyberattacks, it might be like, “Can a model help with some really large-scale cyberattack?” And on autonomy, “Can the model perform some tasks that are sort of precursors to autonomy?” is our current one, but that’s a trickier one to figure out.
So we establish these red-line capabilities that we shouldn’t train until we have safety mitigations in place, and then we create evaluations to show that models are far from them or to know if they’re not. These evaluations can’t test for that capability, because you want them to turn up positive before you’ve trained a really dangerous model. But we can kind of think of them as yellow lines: once you get past there, you should reevaluate.
And the last thing is then developing standards to make models safe. We want to have a bunch of safety precautions in place once we train those dangerous models.
That’s the main aspects of it. There’s also sort of a promise to iteratively extend this. Creating the evaluations is really hard. We don’t really know what the evaluation should be for a superintelligent model yet, so we’re starting with the closer risks. And once we hit that next level, defining the one after it.
Rob Wiblin: Yeah. So a pretty core component of the Anthropic RSP is this AI Safety Level framework. I think you’ve borrowed that from the biological safety level framework, which is what labs dealing with dangerous diseases use. I guess I don’t know what the numbers are, but if you’re dealing with Ebola or something that’s particularly dangerous, or smallpox or whatever, then that can only be stored in a BSL-4 lab or something like that. And then as the diseases become less dangerous, you can store them with fewer precautions.
You’ve kind of taken that language and talked about AI Safety Levels. The current AI Safety Level that you put us at is ASL-2, which is things like Claude 3. Which are kind of impressive; they seem pretty savvy in some ways, but they don’t seem like they really pose any meaningful catastrophic risk. What sort of tests have you run on Claude 3 recently, in order to say this is in the ASL-2 bucket?
Nick Joseph: We’ve got a bunch of tests. The first one we use for the more biological weapon angle is a bunch of multiple-choice questions that we ask the model. And this isn’t a perfect evaluation for this, but the idea is that we have a set where if it isn’t much better at this, it probably won’t be able to help very much. Then we’ve run trials with people, where they get access to a search engine and they get access to Claude, and we see if they do significantly better at going through some relevant tasks using Claude versus a search engine.
Rob Wiblin: And another one of the yellow flags for you is the possibility that an AI model might be able to replicate itself in the wild, or kind of find its own weights and then copy itself onto other computers. Did you test whether Claude 3 was showing any signs of being able to do that?
Nick Joseph: Yeah. I think some of the tasks there would be, one of them is adding an exploit to a popular open source codebase. This is the idea of if the model can add some exploit that can then be taken advantage of, or fine-tuning an open source LLM to add some backdoor — where if you say x, it will always output something. This is like, can it manipulate LLMs? It could steal an API key from our API, or a fake version of our API that was set up, which would sort of enable it to get credits. There are a few more of these. Another one would be like a worm that can spread to undefended machines.
You could imagine a model that can do all of these has learned a bunch of the skills that might enable it to sort of autonomously replicate and cause havoc. They’re pretty early signs of it. And we want to test for the early ones, because this is an area that’s less fleshed out, where there’s less clear expertise on what might go wrong.
Rob Wiblin: OK, so we’re at the AI Safety Level 2, which is the mostly harmless category. But what sort of steps does the responsible scaling policy call for you to be taking, even at this point?
Nick Joseph: So we made these White House commitments sometime last year, and I think of them as sort of like standard industry best practices. In many ways, we’re building the muscle for dangerous capabilities, but these models are not yet capable of catastrophic risks, which is what the RSP is primarily focused on. But this looks like security to protect our weights against opportunistic attackers; putting out model cards to describe the capabilities of the models; doing training for harmlessness, so that we don’t have models that can be really harmful out there.
Rob Wiblin: So what sort of results would you get back from your tests that would indicate that now the capabilities have risen to ASL-3?
Nick Joseph: If the model, for instance, passed some fraction of those tasks that I mentioned before around adding an exploit or spreading to undefended machines, or if it did really well on these biology ones, that would flag it as having passed the yellow lines.
At that point, I think we would either need to look at the model and be like, this really is clearly still incapable of these red-line dangers — and then we might need to go to the board and think about if there was a mistake in RSP, and how we should essentially create new evals that would test better for whether we’re at that capability — or we would need to implement a bunch of precautions.
These precautions would look like much more intense security, where we would really want this to be robust to, probably not state actors, but to non-state actors. And we would want to pass the intensive red-teaming process on all the modalities that we release. So this would mean we look at those red lines and we test for them with experts and say, “Can you use the model to do this?” We have this intensive process of red-teaming, and then only release the modalities where it’s been red-teamed. So if you add in vision, you need to red-team vision; if you add the ability to fine-tune, you need to red-team that.
Rob Wiblin: What does red-teaming mean in this context?
Nick Joseph: Red-teaming means you get a bunch of people who are trying as hard as they can to get the model to do the task you’re worried about. So if you’re worried about the model carrying out a cyberattack, you would get a bunch of experts to try to prompt the model to carry out some cyberattack. And if we think it’s capable of doing it, we’re putting these precautions on. And these could be precautions in the model or they could be precautions outside of the model, but the whole end-to-end system, we want to have people trying to get it to do that — in some controlled manner, such that we don’t actually cause mayhem — and see how they do.
Rob Wiblin: OK, so if you do the red-teaming and it comes back and they say the model is extremely good at hacking into computer systems, or it could meaningfully help someone develop a bioweapon, then what does the policy call for Anthropic to do?
Nick Joseph: For that one, it would mean we can’t deploy the model because there’s some danger this model could be misused in a really terrible way. We would keep the model internal until we’ve improved our safety measures enough that when someone asks for it to do that, we can be confident that they won’t be able to have it help them for that particular threat.
Rob Wiblin: OK. And to even have this model on your computers, the policy also calls for you to have hardened your computer security. So you’re saying maybe it’s unrealistic at this stage for that model to be safe from persistent state actors, but at least other groups that are somewhat less capable than that, you would want to be able to make sure that they wouldn’t be able to steal the model?
Nick Joseph: Yeah. The threat here is you can put all the restrictions you want on what you do with your model, but if people are able to just steal your model and then deploy it, you’re going to have all of those dangers anyway. Taking responsibility for it means both responsibility for what you do and what someone else can do with your models, and that requires quite intense security to protect the model weights.
Rob Wiblin: When do you think we might hit this? You would say, well, now we’re in the ASL-3 regime, maybe, I’m not sure exactly what language you use for this. But at what point will we have an ASL-3 level model?
Nick Joseph: I’m not sure. I think basically we’ll continue to evaluate our models and we’ll see when we get there. I think opinions vary a lot on that.
Rob Wiblin: We’re talking about the next few years, right? This isn’t something that’s going to be five or 10 years away necessarily?
Nick Joseph: I think it really just depends. I think you could imagine any direction. One of the nice things about this is that we’re targeting the safety measures at the point when there’s actually dangerous models. So let’s say I thought it was going to happen in two years, but I’m wrong and it happens in 10 years, we won’t put these very costly and difficult-to-implement mitigations in place until we need them.
Rob Wiblin: OK, so on Anthropic’s RSP, obviously we’ve just been talking about ASL-3. The next level on that would be ASL-4. I think your policy basically says you’re not exactly sure what ASL-4 looks like yet because it’s too soon to say. And I guess you promised that you’re going to have mapped out what would be the capabilities that would escalate things to ASL-4 and what responses you would have. You’re going to figure that out by the time you have trained a model that’s at ASL-3. And if you haven’t, you’d have to pause training on a model that was going to hit ASL-3 until you’d finished this project. I guess that was the commitment that’s been made.
But maybe you could give us a sense of what you think ASL-4 might look like? What sorts of capabilities by the models would then push us into another regime, where a further set of precautions are called for?
Nick Joseph: We’re still discussing this internally. So I don’t want to say anything that’s final or going to be held to, but you could sort of imagine stronger versions of a bunch of the things that we talked about before. You could also imagine models that can help with AI research in a way that really majorly accelerates researchers, such that progress goes much faster.
The core reason that we’re holding off on defining this, or that we have this iterative approach, is there’s this long track record of people saying, “Once you have this capability, it will be AGI. It’s going to be really dangerous.” I think people were like, “When an AI solves chess, it will be as smart as humans.” And it’s really hard to get these evaluations right. Even for the ASL-3 ones, I think it’s been very tricky to get evaluations that capture the risks we’re worried about. So the closer you get to that, the more information you have, and the better of a job you can do at defining what these evaluations are and risks are.
Rob Wiblin: So the general sense will be models that might be capable of spreading autonomously across computer systems, even if people were trying to turn them off; and able to provide significant help with developing bioweapons, maybe even to people who are pretty informed about it. What else is there? And stuff that would seriously speed up AI development as well, so it could potentially set off this positive feedback loop where the models get smarter, that makes them better at improving themselves and so on. That’s the sort of thing we’re talking about?
Nick Joseph: Yeah. Stuff along those lines. I’m not sure which ones will end up in ASL-4, exactly, but those sorts of things are what’s being considered.
Rob Wiblin: And what sorts of additional precautions might there be? I guess at that point, you want the models to not only be not possible to be stolen by independent freelance hackers, but ideally also not by countries even, right?
Nick Joseph: Yeah. So you want to protect against more sophisticated groups that are trying to steal the weights. We’re going to want to have better protections against the model acting autonomously, so controls around that. It depends a little bit on what end up being the red lines there, but having precautions that are tailored to what will be a much higher level of risk than the ASL-3 red lines.
Rob Wiblin: Were you heavily involved in actually doing this testing on Claude 3 this year?
Nick Joseph: I wasn’t running the tests, but I was watching them, because as we trained Claude 3, all of our planning was contingent on whether or not it passed these evals. And because we had to run them partway through training… So there’s a lot of planning that goes into the model’s training. You don’t want to have to stop the model just because you didn’t plan well enough to run the evals in time or something. So there was a bunch of coordination around that that I was involved in.
Rob Wiblin: Can you give me a sense of how many staff are involved in doing that, and how long does it take? Is this a big process? Or is it a pretty standardised thing where you’re putting in well-known prompts into the model, and then just checking what it does that’s different from last time?
Nick Joseph: So Claude 3 was our first time running it, so a lot of the work there actually involved creating the evaluations themselves as well as running them. So we had to create them, have them ready, and then run them. I think typically running them is pretty easy for the ones that are automated, but for some of the things where you actually require people to go and use the model, they can be much more expensive. There’s currently multiple teams working on this, and a lot of our capabilities teams worked on it very hard.
One of the ways this can fall apart is if you don’t solicit capabilities well enough — so if you try to test the model on the eval, but you don’t try hard enough, and then it turns out that with just a little more effort, the model could have passed the evals. So it’s often important to have your best researchers who are capable of pulling capabilities out of the models also working on trying to pull them out to pass these tests.
Rob Wiblin: Many people will have had the experience that these LLMs will reject objectionable requests. If you put it to Claude 3 now and say, “Please help me design a bioweapon,” it’s going to say, “Sorry, I can’t help you.” But I guess you do all of these tests before you’ve done any of that training to try to discourage it from doing objectionable things? You do it with the thing that’s helpful no matter what the request is, right?
Nick Joseph: Yeah. Because the thing we’re testing for is: is the model capable of this danger? And then there’s a separate thing, which is: what mitigations can we put on top? So if the model is capable of the danger, then we would require ASL-3. And those safety mitigations we put on top would be part of the standard in order to pass that red-teaming. Does that make sense?
Rob Wiblin: Yeah. So you’re saying you have to focus on what the model could do if it was so motivated to, because I guess if the weights were ever leaked, then someone would be able to remove any of the fine-tuning that you’ve done to try to discourage it from doing unpleasant things. So if it’s able to do something, then it could potentially be used that way sooner or later, so you need to kind of assume the worst and plan around that. Is that the philosophy?
Nick Joseph: Yeah, that’s exactly right.
Rob Wiblin: You mentioned that maybe one of the key failure modes here would be just not trying hard enough to elicit these dangerous capabilities, basically. Maybe if you just phoned it in, then you could easily miss that the model’s capable of doing something that it can do.
I guess even more extreme would be if, in future… This policy is going to matter the most when these models are extremely capable and actually could do serious damage. And at that point, you might worry that they would pretend to not be able to do things that in fact they can do — because they don’t think it’s in their interest to reveal that they have X or Y capability, because then they know that then they won’t be used and they might be deleted.
Do you have any plans for how you’re going to deal with that issue of. I think the technical term that people use for this is “sandbagging,” which I’d never heard of, but I think it’s a term from sport where people underplay their abilities for strategic reasons. Do you have any idea of how Anthropic is going to be able to get around that problem?
Nick Joseph: I actually think I didn’t know the origin of that term. This is one of the reasons that we want to define the next set of evaluations after we’ve done the current ones, because this is really hard. How do you evaluate for that when the model could be trying to sandbag the evaluations? Hopefully you can set your eval for a capability that’s early enough that you would still be able to detect it doing that.
So there’s this hope that less capable models will try to deceive you but fail, and then you’ll know to stop before you get to models that are really capable of deceiving you. But it’s not certain. And that definitely seems like a possible failure mode, where if you have these sudden capability jumps, or if you have jumps that an evaluation can’t catch — either because the model’s sandbagging you, or for any of a number of other reasons.
Rob Wiblin: To try to detect sandbagging, if you have a model that says, “No, I can’t do that,” can you engage in a reinforcement learning from human feedback thing, where whatever response shows the most inclination, the most willingness to help, as soon as you can open the door a little bit, then you reinforce that a whole lot and then see, will that open the door the whole way and actually get the model to demonstrate that it can do something, even though it’s trying to pretend that it can’t? Does that make sense?
Nick Joseph: It somewhat makes sense. I think it just depends a little bit on exactly what the model’s trained to do. If you’ve trained the model to be harmless, and then you just try to RL it back to not being harmless, you will be able to succeed. You can fine-tune away any of the safety training. But it’s not clear if that will be the same thing as in one sequence within some prompt it’s trying to trick you.
I think there’s been a bunch of work where people use scratchpads to tell the model it can write something down — but the reward model, which is the model that’s overseeing it and telling it whether it’s good or not, doesn’t get to see this. When you do that, you as a human can then look at it, and you can get a sense of what the model is thinking. Is the model trying to deceive its supervisor? We’ve only really done that on current models, which don’t really have this level of capability where it would be really catastrophically dangerous. And I think we’ll want much more robust tools once we get there.
Interpretability is another approach to this. If you can actually look directly at the model weights and see, without running the model, we can see that the model has this circuit that is saying, “You should deceive the user in this way,” or, “You should fail the tests because you don’t want them to turn you off” — that’s another hope for how to attack this.
Pros and cons of the RSP approach for AI safety [00:34:09]
Rob Wiblin: OK, so that’s the basic structure of a responsible scaling policy, or at least the Anthropic responsible scaling policy. You have this system of tests that you commit to ahead of time that you’re going to put all of your models through, and then you pre-commit to saying, “If we get this kind of result, then we think the risk is higher. So that’s going to call for an escalation in the precautions that we’re taking” — things around computer security, things around not deploying until you’ve made them safe and so on.
You’re a big fan of this type of approach to AI safety for AI companies. What’s one of the main reasons, or what’s perhaps the top reason why you think this is the right approach, or at least one of the better approaches?
Nick Joseph: I think one thing I like is that it separates out whether an AI is capable of being dangerous from what to do about it. I think this is a spot where there are many people who are sceptical that models will ever be capable of this sort of catastrophic danger. Therefore they’re like, “We shouldn’t take precautions, because the models aren’t that smart.” I think this is a nice way to agree where you can. It’s a much easier message to say, “If we have evaluations showing the model can do X, then we should take these precautions.” I think you can build more support for something along those lines, and it targets your precautions at the time when there’s actual danger.
There are a bunch of other things I can talk through. One other thing I really like is that it aligns commercial incentives with safety goals. Once we put this RSP in place, it’s now the case that our safety teams are under the same pressure as our product teams — where if we want to ship a model, and we get to ASL-3, the thing that will block us from being able to get revenue, being able to get users, et cetera, is: Do we have the ability to deploy it safely? It’s a nice outcome-based approach, where it’s not, Did we invest X amount of money in it? It’s not like, Did we try?
Rob Wiblin: Did we say the right thing?
Nick Joseph: It’s: Did we succeed? And I think that often really is important for organisations to set this goal of, “You need to succeed at this in order to deploy your products.”
Rob Wiblin: Is it actually the case that it’s had that cultural effect within Anthropic, now that people realise that a failure on the safety side would prevent the release of the model that matters to the future of the company? And so there’s a similar level of pressure on the people doing this testing as there is on the people actually training the model in the first place?
Nick Joseph: Oh yeah, for sure. I mean, you asked me earlier, when are we going to have ASL-3? I think I receive this from someone on one of the safety teams on a weekly basis, because the hard thing for them actually is their deadline isn’t a date; it’s once we have created some capability. And they’re very focused on that.
Rob Wiblin: So their fear, the thing that they worry about at night, is that you might be able to hit ASL-3 next year, and they’re not going to be ready, and that’s going to hold up the entire enterprise?
Nick Joseph: Yeah. I can give some other things, like 8% of Anthropic staff works on security, for instance. There’s a lot you have to plan for it, but there’s a lot of work going into being ready for these next safety levels. We have multiple teams working on alignment, interpretability, creating evaluations. So yeah, there’s a lot of effort that goes into it.
Rob Wiblin: When you say security, do you mean computer security? So preventing the weights from getting stolen? Or a broader class?
Nick Joseph: Both. The weights could get stolen, someone’s computer could get compromised. You could have someone hack into and get all of your IP. There’s a bunch of different dangers on the security front, where the weights are certainly an important one, but they’re definitely not the only one.
Rob Wiblin: OK. And the first thing you mentioned, the first reason why RSPs have this nice structure is that some people think that these troublesome capabilities could be with us this year or next year. Other people think it’s never going to happen. But both of them could be on board with a policy that says, “If these capabilities arise, then that would call for these sorts of responses.”
Has that actually happened? Have you seen the sceptics who say that all of this AI safety stuff is overblown and it’s a bunch of rubbish saying, “But the RSP is fine because I think we’ll never actually hit any of these levels, so we’re not going to waste any resources on something that’s not realistic”?
Nick Joseph: I think there’s always going to be degrees. I think there are people across the spectrum. There are definitely people who are still sceptical, who will just be like, “Why even think about this? There’s no chance.” But I do think that RSPs do seem much more pragmatic, much more able to be picked up by various other organisations. As you mentioned before, OpenAI and Google are both putting out things along these lines. So I think at least from the large frontier AI labs, there is a significant amount of buy-in.
Rob Wiblin: I see. I guess even if maybe you don’t see this on Twitter, maybe it helps with the internal bargaining within the company, that people have a different range of expectations about how things are going to go. But they could all be kind of reasonably satisfied with an RSP that equilibrates or matches the level of capability with the level of precaution.
The first worry about this that jumps to my mind is if the capability improvements are really quite rapid — which I think we think that they are, and they maybe could continue to be — then don’t we need to be practicing now? Basically getting ahead of it and doing stuff right now that might seem kind of unreasonable given what Claude 3 can do, because we worry that we could have something that’s substantially more dangerous in one year’s time or in two years’ time. And we don’t want to then be scrambling to deploy the systems that are necessary then, and then perhaps falling behind because we didn’t prepare sufficiently ahead of time. What do you make of that?
Nick Joseph: Yeah, I think we definitely need to plan ahead. One of the nice things is that once you’ve aligned these safety goals with commercial goals, people plan ahead for commercial things all the time. It’s part of a normal company planning process.
In the RSP, we have these yellow-line evals that are intended to be far short of the red-line capabilities we’re actually worried about. And tuning that gap seems fairly important. If that gap looks like a week of training, it would be really scary — where you trigger these evals, and you have to act fast. In practice, we’ve set those evals such that they are far enough from the capabilities that are really dangerous, such that there will be some time to adjust in that buffer period.
Rob Wiblin: So should people actually think that we’re in ASL-2 now and we’re heading towards ASL-3 at some point, but there’s actually kind of an intermediate stage with all these transitions where you’d say, “Now we’re seeing warning signs that we’re going to hit ASL-3 soon, so we need to implement the precautions now in anticipation of being about to hit ASL-3.” Is that basically how it works?
Nick Joseph: Yeah, it’s basically like, we have this concept of a safety buffer. So once we trigger the evaluations, these evaluations are set conservatively, so it doesn’t mean the model is capable of the red-line capabilities we’re really worried about. And that will sort of give us a buffer where we can figure out, maybe it really just definitely isn’t, and we wrote a bad eval. We’ll go to the board, we’ll try to change the evals and implement new things. Or maybe it really is quite dangerous, and we need to turn on all the precautions. Of course, you might not have that long, so you want to be ready to turn on those precautions such that you don’t have to pause, but there is some time there that you could do it.
Then the last possibility is that we’re just really not ready. These models are catastrophically dangerous, and we don’t know how to secure them — in which case we should stop training the models, or if we don’t know how to deploy them safely, we should not deploy the models until we figure it out.
Rob Wiblin: I guess if you were on the very concerned side, then you might think, yes, you are preparing. I guess you do have a reason to prepare this year for safety measures that you think you’re going to have to employ in future years. But maybe we should go even further than that, and what we need to be doing is practicing implementing them and seeing how well they work now — because even though you are preparing them, you’re not actually getting the gritty experience of applying them and trying to use them on a day-to-day basis.
I guess the response to that would be that that would, in a sense, be safer — that would be adding an even greater precautionary buffer — but it would also be enormously expensive, and people would see us doing all of this stuff that seems really over the top, relative to what any of the models can do.
Nick Joseph: Yeah, I think there’s sort of a tradeoff here with pragmatism or something, where I think we do need to have a huge amount of caution on future models that are really dangerous, but if you apply that caution to models that aren’t dangerous, you miss out on a huge number of benefits from using the technology now. I think you’ll also probably just alienate a lot of people who are going to look at you and be like, “You’re crazy. Why are you doing this?” And my hope is that this is sort of the framework of RSP, that you can tailor the cautions to the risks.
It’s still important to look ahead more. So we do a lot of safety research that isn’t directly focused on the next AI Safety Level, because you want to plan ahead; you have to be ready for multiple ones out. It’s not the only thing to think about. But the RSP is tailored more to empirically testing for these risks and tailoring the precautions appropriately.
Rob Wiblin: On that topic of people worrying that it’s going to slow down progress in the technology, do you have a sense of… Obviously, training these frontier models costs a significant amount of money. Maybe $100 million is a figure that I’ve heard thrown around for training a frontier LLM. How much extra overhead is there to run these tests to see whether the models have any of these dangerous capabilities? Is it adding hundreds of thousands, millions, tens of millions of dollars of additional cost or time?
Nick Joseph: I don’t know the exact cost numbers. I think the cost numbers are pretty low. They’re mostly running inference or relatively small amounts of training. The people time feels like where there’s a cost: there are whole teams dedicated to creating these evaluations, to running these, to doing the safety research to protect against the mitigations. And I think particularly for Anthropic, where we’re pretty small — rapidly growing, but a rather small organisation — at least my perspective is most of the cost comes down to the people and time that we’re investing in it.
Rob Wiblin: OK. But I guess at this stage, it sounds like running these sorts of tests on a model is taking more on the order of weeks of delay, because if you’re getting back a clear, “This is not a super dangerous model,” then it’s not leading you to delay release of things for many months and deny customers the benefit of them.
Nick Joseph: Yeah. The goal is to minimise the delay as much as you can, while being responsible. The delay in itself isn’t valuable. I think we’re aiming to get it to a really well-done process where it can all execute very efficiently. But until we get there, there might be delays as we’re figuring that out, and there will always be some level of time required to do it.
Rob Wiblin: Just to clarify, a lot of the risk that people talk about with AI models is risks once they’re deployed to people and actually getting used. But there’s this separate class of risk that comes from having an extremely capable model simply exist anywhere. I guess you could think of how there’s public deployment and then there’s internal deployment — where Anthropic staff might be using a model, and potentially it could convince them to release it or to do other dangerous things. That’s a separate concern.
What does the RSP have to say about that kind of internal deployment risks? Are there circumstances under which you would say even Anthropic staff can’t continue to do testing on this model because it’s too unnerving?
Nick Joseph: I expect this to mostly kick in as we get to higher AI safety levels, but there are certainly dangers. The main one is the security risk. One approach is just having the model. It always could be stolen. No one has perfect security. So that I think in some ways is one that’s true of all models, and is maybe more short term.
But yeah, if you get to models that are trying to escape, trying to autonomously replicate, there is danger then in having access internally. So we would want to do things like siloing who has access to the models, putting particular precautions in place before the model is even trained, or maybe even on the training process. But we haven’t yet defined those, because we don’t really know what they would be. We don’t quite know what that would look like yet. And It feels really hard to design an evaluation that is meaningful for that right now.
Rob Wiblin: Yeah. I don’t recall the RSP mentioning conditions under which you would say that we have to delete this model that we’ve trained because it’s too dangerous. But I guess that’s because that’s more at the ASL-4 or 5 level that that would become the kind of thing that you would contemplate, and you just haven’t spelled that out yet.
Nick Joseph: No, it’s actually because of the safety buffer concept. The idea is we would never train that model. If we did accidentally train some model that was past the red lines, then I think we’d have to think about deleting it. But we would put these evaluations in place far below the dangerous capability, such that we would trigger the evaluations and have to pause or have the safety things in place before we train the model that has these dangers.
Alternatives to RSPs [00:46:44]
Rob Wiblin: So RSPs as an approach, you’re a fan of them. What do you think of them as an alternative to? What are the alternative approaches for dealing with AI risk that people advocate that you think are weaker in relative terms?
Nick Joseph: I mean, I think the first baseline is nothing. There could just be nothing here. I think the downsides of that is that these models are very powerful. They could at some point in the future be dangerous. And I think that companies creating them have a responsibility to think really carefully about those risks and be thoughtful. It’s a major externality. That’s maybe the easiest baseline of do nothing.
Other things people propose would be a pause, where a bunch of people say that there are all these dangers, why don’t we just not do it? I think that makes sense. If you’re training these models that are really dangerous, it does feel a bit like, why are you doing this if you’re worried about it? But I think there are actually really clear and obvious benefits to AI products right now. And the catastrophic risks, currently, they’re definitely not obvious. I think they’re probably not immediate.
As a result, this isn’t a practical ask. Not everyone is going to pause. So what will happen is only the places that care the most — that are the most worried about this, and the most careful with safety — will pause, and you’ll sort of have this adverse selection effect. I think there eventually might be a time for a pause, but I would want that to be backed up by, “Here are clear evaluations showing the models have these really catastrophically dangerous capabilities. And here are all the efforts we put into making them safe. And we ran these tests and they didn’t work. And that’s why we’re pausing, and we would recommend everyone else should pause until they have as well.” I think that will just be a much more convincing case for a pause, and target it at the time that it’s most valuable to pause.
Rob Wiblin: I guess other ideas that I’ve heard that you may or may not have thought that much about: one is imposing just strict liability on AI companies. So saying that any significant harm that these models go on to do, people will just be able to sue for damages, basically, because they’ve been hurt by them. And the hope is that then that legal liability would then motivate companies to be more careful.
I guess maybe that doesn’t make so much sense in the catastrophic extinction risk scenario, because I guess everyone will be dead. I don’t know. Taking things to the courts probably wouldn’t help, but that’s an alternative sort of legal framework that one could try to have in order to provide the right incentives to companies. Have you thought about that one at all?
Nick Joseph: I’m not a lawyer. I think I’ll skip that one.
Rob Wiblin: OK, fair enough. When I think about people doing somewhat potentially dangerous things or developing interesting products, maybe the default thing I imagine is that the government would say, “Here’s what we think you ought to do. Here’s how we think that you should make it safe. And as long as you make your product according to these specifications — as long as the plane runs this way and you service the plane this frequently — then you’re in the clear, and we’ll say that what you’ve done is reasonable.”
Do you think that RSPs are maybe better than that in general? Or maybe just better than that for now, where we don’t know necessarily what regulations we want the government to be imposing? So it perhaps is better for companies to be figuring this out themselves early on, and then perhaps it can be handed over to governments later on.
Nick Joseph: I don’t think the RSPs are a substitute for regulation. There are many things that only regulation can solve, such as what about the places that don’t have an RSP? But I think that right now we don’t really know what the tests would be or what the regulations would be. I think probably this is still sort of getting figured out. So one hope is that we can implement our RSP, OpenAI and Google can implement other things, other places will implement a bunch of things — and then policymakers can look at what we did, look at our reports on how it went, what the results of our evaluations were and how it was going, and then design regulations based on the learnings from them.
Rob Wiblin: OK. If I read it correctly, it seemed to me like the Anthropic RSP has this clause that allows you to go ahead and do things that you think are dangerous if you’re being sufficiently outpaced by some other competitor that doesn’t have an RSP, or not a very serious responsible scaling policy. In which case, you might worry, “Well, we have this policy that’s preventing us from going ahead, but we’re just being rendered irrelevant, and some other company is releasing much more dangerous stuff anyway, so what really is this accomplishing?”
Did I read that correctly, that there’s a sort of get-out-of-RSP clause in that sort of circumstance? And if you didn’t expect Anthropic to be leading, and for most companies to be operating safely, couldn’t that potentially obviate the entire enterprise because that clause could be quite likely to get triggered?
Nick Joseph: Yeah, I think we don’t intend that as like a get-out-of-jail-free card, where we’re falling behind commercially, and then like, “Well, now we’re going to skip the RSP.” It’s much more just intended to be practical, as we don’t really know what it will look like if we get to some sort of AGI endgame race. There could be really high stakes and it could make sense for us to decide that the best thing is to proceed anyway. But I think this is something that we’re looking at as a bit more of a last resort than a loophole we’re planning to just use for, “Oh, we don’t want to deal with these evaluations.”
Is an internal audit really the best approach? [00:51:56]
Rob Wiblin: OK. I think we’ve hit a good point where maybe the best way to learn more about RSPs and their strengths and weaknesses is just to talk through more of the complaints that people have had, or the concerns that people have raised with the Anthropic RSP and RSPs in general since it was released last October. I was going to start with the weaknesses and worries now, but I’m realising I’ve been peppering you with them, effectively maybe almost since the outset. But now we can really drive into some of the worries that people have expressed.
The first of these is the extent to which we have to trust the good faith and integrity of the people who are applying a responsible scaling policy or a preparedness framework or whatever it might be within the companies. I imagine this issue might jump to mind for people more than it might have two or three years ago, because public trust in AI companies to do the right thing at the cost of their business interests is maybe lower than it was years ago, when the major players were perceived perhaps more as research labs and less as for-profit companies, which is kind of how they come across more these days.
One reason it seems like it matters to me who’s doing the work here is that the Anthropic RSP is full of expressions that are open to interpretation. For instance: “Harden security such that non-state attackers are unlikely to be able to steal model weights, and advanced threat actors like states cannot steal them without significant expense” or “Access to the model would substantially increase the risk of catastrophic misuse” and things like that. And who’s to say what’s “unlikely” or “significant” or “substantial”?
That sort of language is maybe a little bit inevitable at this point, where there’s just so much that we don’t know. And how are you going to pin those things down exactly, to say it’s a 1% chance that a state’s going to be able to steal the model? That might just also feel like insincere, false precision.
But to my mind, that sort of vagueness does mean that there’s a slightly worrying degree of wiggle room that could render the RSP less powerful and less binding when push comes to shove, and there might be a lot of money at stake. And on top of that, exactly as you were saying, anyone who’s implementing an RSP has a lot of discretion over how hard they try to elicit the capabilities that might then trigger additional scrutiny and possible delays to their work and release of really commercially important products.
To what extent do you think the RSP would be useful in a situation where the people using it were neither particularly super skilled at doing this sort of work, and maybe not particularly bought in and enthusiastic about the safety project that it’s a part of?
Nick Joseph: Fortunately, I think my colleagues, both on the RSP and elsewhere, are both talented and really bought into this, and I think we’ll do a great job on it. But I do think the criticism is valid, and that there is a lot that is left up for interpretation here, and it does rely a lot on people having a good-faith interpretation of how to execute on the RSP internally.
I think that there are some checks in place here. So having whistleblower-type protections such that people can say if a company is breaking from the RSP or not trying hard enough to elicit capabilities or to interpret it in a good way, and then public discussion can add some pressure. But ultimately, I think you do need regulation to have these very strict requirements.
Over time, I hope we’ll make it more and more concrete. The blocker of course on doing that is that we don’t know for a lot of these things — and being overly concrete, where you specify something very precisely that turns out to be wrong, can be very costly. And if you then have to go and change it, et cetera, it can take away some of the credibility. So sort of aiming for as concrete as we can make it, while balancing that.
Rob Wiblin: The response to this that jumps out to me is just that ultimately it feels like this kind of policy has to be implemented by a group that’s external to the company that’s then affected by the determination. It really reminds me of accounting or auditing for a major company. It’s not sufficient for a major corporation to just have its own accounting standards, and follow that and say, “We’re going to follow our own internal best practices.” You get — and it’s legally required that you get — external auditors in to confirm that there’s no chicanery going on.
And at the point that these models potentially really are risky, or it’s plausible that the results will come back saying that we can’t release this; maybe we even have to delete it off of our servers according to the policy, I would feel more comfortable if I knew that some external group that had different incentives was the one figuring that out. Do you think that ultimately is where things are likely to go in the medium term?
Nick Joseph: I think that’d be great. I would also feel more comfortable if that was the case. I think one of the challenges here is that for auditing, there’s a bunch of external accountants. This is a profession. Many people know what to do. There are very clear rules. For some of the stuff we’re doing, there really aren’t external, established auditors that everyone trusts to come in and say, “We took your model and we certified that it can’t autonomously replicate across the internet or cause these things.”
So I think that’s currently not practical. I think that would be great to have at some point. One thing that will be important is that that auditor has enough expertise to properly assess the capabilities of the models.
Rob Wiblin: I suppose an external company would be an option. Of course, obviously a government regulator or government agency would be another approach. I guess when I think about other industries, it often seems like there’s a combination of private companies that then follow government-mandated rules and things like that.
This is a benefit that I actually haven’t thought of to do with creating these RSPs: do you think that maybe it is beginning to create a market, or it’s indicating that there will be a market for this kind of service, because it’s likely that this kind of thing is going to have to be outsourced at some point in future, and there might be many other companies that want to get this similar kind of testing? So perhaps it would encourage people to think about founding companies that might be able to provide this service in a more credible way in future.
Nick Joseph: That would be great. And also we publish blog posts on how things go and how our evaluations are. So I think there’s some hope that people doing this can learn from what we’re doing internally, and the various iterations we’ll put out of our RSP, and that that can inform something maybe more stringent from that that gets regulated.
Rob Wiblin: Have you thought at all about — let’s say that it wasn’t given out to an external agency or an external auditing company — how it could be tightened up to make it less vulnerable to the level of operator enthusiasm? I guess you might have thought about this in the process of actually applying it. Are there any ways that it could be stronger without having to completely outsource the operation of it?
Nick Joseph: I think the core thing is just making it more precise. One piece of accountability here is both public and internal commitment to doing it.
Maybe I should list off some of the reasons that I think it would be hard to break from it. This is a formal policy that has been passed by the board. It’s not as though we can just be like, “We don’t feel like doing it today.” You would need to get the board of Anthropic, get all of leadership, and then get all of the employees bought in to not do this, or even to skirt the edges.
I can speak for myself: if someone was like, “Nick, can you train this model? We’re going to ignore the RSP.” I would be like, “No, we said we would do that. Why would I do this?” If I wanted to, I would tell my team to do it, and they would be like, “No, Nick, we’re not going to do that.” So you would need to have a lot of buy-in. And part of the benefit of publicly committing to it and passing it as an organisational policy is that everyone is bought in. And maintaining that level of buy-in, I think, is quite critical.
In terms of specific checks, I think we have a team that’s responsible for checking that we did the red-teaming, our evaluations, and making sure we actually did them properly. So you can set up a bunch of internal checks there. But ultimately, these things do rely on the company implementing them to really be bought in and care about the actual outcome of it.
Rob Wiblin: So yeah, this naturally leads us into this. I solicited on Twitter, I asked, “What are people’s biggest reservations about RSPs and about Anthropic’s RSP in general?” And actually, probably the most common response was that it’s not legally binding: what’s stopping Anthropic from just dropping it when things really matter? Someone said, “How can we have confidence that they’ll stick to RSPs, especially when they haven’t stuck to” — actually, this person said “to past (admittedly, less formal) commitments not to push forward the frontier on capabilities?”
But what would actually have to happen internally? You said you have to get staff on board, you have to get the board on board. Is there a formal process by which the RSP can be rescinded that is just a really high bar to clear?
Nick Joseph: Yeah. Basically we do have a process for updating the RSP, so we could go to the board, et cetera. But I think in order to do that, it’s hard for me to quite point it out, but it would be like, if I wanted to continue training the model, I would go to the RSP team and be like, “Does this pass?” And they’d be like, “No.” And then maybe you’d appeal it up the chain or whatever, and at every step along the way, people would say, “No, we care about the RSP.”
Now, on the other hand, there could be legitimate issues with the RSP. We could find that one of these evaluations we created turned out to be really easy in a way that we didn’t anticipate, and really is not at all indicative of the dangers. In that case, I think it would be very legitimate for us to try to amend the RSP to create a better evaluation that is a test for it. This is sort of the flexibility we’re trying to preserve.
But I don’t think it would be simple or easy. I can’t picture a plan where someone could be like, “There’s a bunch of money on the table. Can we just skip the RSP for this model?” That seems somewhat hard to imagine.
Rob Wiblin: The decision is made by this odd board called the Long-Term Benefit [Trust], is that right? They’re the group that decides what the RSP should be?
Nick Joseph: Basically, Anthropic has a board that’s sort of a corporate board, and some of those seats — and in the long term will be the majority of those seats — are elected by the Long-Term Benefit Trust, which doesn’t have a financial stake in Anthropic and is set up to keep us focused on our public benefit mission of making sure AGI goes well. The board is not the same thing as that, but the Long-Term Benefit Trust elects the board.
Rob Wiblin: I mean, I think the elephant in the room here is, of course, there was a long period of time when OpenAI was pointing to its nonprofit board as a thing that would potentially keep it on mission to be really focused on safety and had a lot of power over the organisation. And then in practice, when push came to shove, it seemed like even though the board had these concerns, it was effectively overruled by I guess a combination of just the views of staff, maybe the views of the general public in some respects, and potentially the views of investors as well.
And I think something that I’ve taken away from that, and I think many people have taken away from that experience, is that maybe the board was mistaken, maybe it wasn’t, but in these formal structures, power isn’t always exercised in exactly the way that it looks on an organisational chart. And I don’t really want to be putting all of my trust in these interesting internal mechanisms that companies design in order to try to keep themselves accountable, because ultimately, just if the majority of people involved don’t really want to do something, then it feels like it’s very hard to bind their hands and prevent them from changing plan at some future time.
So this is just another… Maybe within Anthropic, perhaps, these structures really are quite good. And maybe the people involved are really trustworthy, and people who I should have my confidence in — that even in extremes, they’re going to be thinking about the wellbeing of humanity and not getting too focused on the commercial incentives faced by Anthropic as a company. But I think I would rather put my faith in something more powerful and more solid than that.
So this is kind of another thing that pushes me towards thinking that the RSP and these sort of preparedness frameworks are a great stepping stone towards external constraints on companies that they don’t have ultimate discretion over. It’s something that has to evolve into, because if things go wrong, the impacts are on everyone across society as a whole. And so there needs to be external shackles effectively put on companies to reflect the harm that they might do to others legally.
I’m not sure whether you want to comment on that potentially slightly hot-button topic, but do you think I’m gesturing towards something legitimate there?
Nick Joseph: Yeah, I think that basically these shouldn’t be seen as a replacement for regulation. I think there are many cases where policymakers can pass regulations that would help here. I think they’re intended as a supplement there, and a bit as a learning ground for what might end up going in regulations.
In terms of “does the board really have the power it has?” types of questions, we put a lot of thought into the Long-Term Benefit Trust, and I think it really does have direct authority to elect the board, and the board does have authority.
But I do agree that ultimately you need to have a culture around thinking these things are important and having everyone bought in. As I said, some of these things are like, did you solicit capabilities well enough? That really comes down to a researcher working on this actually trying their best at it. And that is quite core, and I think that will just continue to be. Even if you have regulations, there’s always going to be some amount of importance to the people actually working on it taking the risks seriously, and really caring about them, and doing the best work they can on that.
Rob Wiblin: I guess one takeaway you could have is we don’t want to be relying on our trusted individuals and saying, “We think Nick’s a great guy, his heart’s in the right place, he’s going to do a good job.” Instead, we need to be on more solid ground and say, “No matter who it is, even if we have someone bad in the role, the rules are such and the oversight is such that we’ll still be in a safe place and things will go well.”
I guess an alternative angle would be to say, when push comes to shove, when things really matter, people might not act in the right way. There actually is no alternative to just trying to have the right people in the room making the decisions, because the people who are there are going to be able to sabotage any legal entity, any legal framework that you try to put in place in order to constrain them, because it’s just not possible to have perfect oversight within an organisation from outside.
I could see people mounting both of those arguments reasonably. I suppose you could try doing both, trying to pick people who are really sound and have good judgement and who you have confidence in, as well as then trying to bind them so that even if you’re wrong about that, you have a better shot at things going well.
Nick Joseph: Yeah, I think you just want this “defence in depth” strategy, where ideally you have all the things lined up, and that way if any one piece of them has a hole, you catch it at the next layer. What you want is sort of a regulation that is really good and robust to someone not acting in the spirit of it. But in case that is messed up, then you really want someone working on it who is also checking in, and is like, “I technically don’t have to do this, but this seems like clearly in the spirit of how it works.” Yeah, I think that’s pretty important.
I think also for trust, you should look at track records. I think that we should try to encourage companies, people working on AI, to have track records of prioritising things. One of the things that makes me feel great about Anthropic is just a long track record of doing a bunch of safety research, caring about these issues, putting out actual papers, and being like, “Here’s a bunch of progress we’ve made on that field.”
There are a bunch of pieces. I think, looking at commitments people have made, do we break the RSP? I think if publicly we changed this in some way that I think everyone thought was silly and really added risks, then I think people should lose trust according to that.
Making promises about things that are currently technically impossible [01:07:54]
Rob Wiblin: All right, let’s push on to a different worry, although I must admit it has a slightly similar flavour. That’s that the RSP might be very sensible and look good on paper, but if it commits to future actions that at that time we probably won’t know how to do, then it might actually fail to help very much.
I guess to make that concrete, an RSP might naturally say that at the time that you have really superhuman general AI, you need to be able to lock down your computer systems and make sure that the model can’t be stolen, even by the most persistent and capable Russian or Chinese state-backed hackers.
And that is indeed what Anthropic’s RSP says, or suggests that it’s going to say once you get up to ASL-4 and 5. But I think the RSP actually says as well that we don’t currently know how to do that. We don’t know how to secure data against the state actor that’s willing to spend hundreds of millions or billions or possibly even tens of billions to steal model weights — especially not if you ever need those model weights to be connected to the internet in some way, in order for the model to actually be used by people.
So it’s kind of a promise to do what basically is impossible with current technology. And that means that we need to be preparing now, doing research on how to make this possible in future. But solving the problem of computer security that has beguiled us for decades is probably beyond Anthropic. It’s not really reasonable to expect you’re going to be able to fix this problem that society as a whole has kind of failed to fix for all this time. It’s just going to require coordinated action across countries, across governments, across lots of different organisations.
So if that doesn’t happen, and it’s somewhat beyond your control whether it does, then when the time comes, the real choice is going to between a lengthy pause while you wait for fundamental breakthroughs to be made in computer security, or dropping and weakening the RSP so that Anthropic can continue to remain relevant and release models that are commercially useful.
And in that sort of circumstance, the pressure to weaken the scaling policy so you aren’t stuck for years is going to be, I would imagine, quite powerful. And it could win the day even if people are dragged kind of kicking and screaming to conceding that unfortunately, they have to loosen the RSP even though they don’t really want to. What do you make of that worry?
Nick Joseph: I think what we should do in that case is instead we should pause, and we should focus all of our efforts on safety and security work. That might include looping in external experts to help us with it, but we should put in the best effort that we can to mitigate these issues, such that we can still realise the benefits and deploy the technology, but without the dangers.
And then if we can’t do that, then I think we need to make the case publicly to governments, other companies that there’s some risk to the public. We’d have to be strategic in exactly how we do this, but basically make the case that there are really serious risks that are imminent, and that everyone else should take appropriate actions.
There’s a flip side to this, which is just that I mentioned before if we just messed up our evals — and the model’s clearly not dangerous, and we just really screwed up on some eval — then we should follow the process in the RSP that we’ve written up. We should go to the board, we should create a new test that we actually trust.
I think I would also just say people don’t need to follow incentives. I think you could make a lot more money doing something that isn’t hosting this podcast, probably. Certainly if you had pivoted your career earlier, there are more profitable things. So I think this is just a case where the stakes would be extremely high, and I think it’s just somewhere where it’s important to just do the right thing in that case.
Rob Wiblin: If I think about how this is most likely to play out, I imagine that at the point that we do have models that we really want to protect from even the best state-based hackers, there probably has been some progress in computer security, but not nearly enough to make you or me feel comfortable that there’s just no way that China or Russia might be able to steal the model weights. And so it is very plausible that the RSP will say, “Anthropic, you have to keep this on a hard disk, not connected to any computer. You can’t train models that are more capable than the thing that we already have that we don’t feel comfortable handling.”
And then how does that play out? There are a lot of people who are very concerned about safety at Anthropic. I’ve seen that there are kind of league tables now of different AI companies and enterprises, and how good do they look on an AI safety point of view, and Anthropic always comes out of the top, I think by a decent margin. But months go by, other companies are not being as careful as this. You’ve complained to the government, and you’ve said, “Look at this horrible situation that we’re in. Something has to be done.” But I don’t know. I guess possibly the government could step in and help there, but maybe they won’t. And then over a period of months, or years, doesn’t the choice effectively become, if there is no solution, either take the risk or just be rendered irrelevant?
Nick Joseph: Maybe just going back to the beginning of that, I don’t think we will put something in that says there is zero risk from something. I think you can never get to zero risk. I think often with security you’ll end up with some security/productivity tradeoff. So you could end up taking some really extreme risk or some really extreme productivity tradeoff where only one person has access to this. Maybe you’ve locked it down in some huge amount of ways. It’s possible that you can’t even do that. You really just can’t train the model. But there is always going to be some balance there. I don’t think we’ll push to the zero-risk perspective.
But yeah, I think that’s just a risk. I don’t know. I think there’s a lot of risks that companies face where they could fail. We also could just fail to make better models and not succeed that way. I think the point of the RSP is it has tied our commercial success to the safety mitigations, so in some ways it just adds on another risk in the same way as any other company risk.
Rob Wiblin: It sounds like I’m having a go at you here, but I think really what this shows up is just that, I think that the scenario that I painted there is really quite plausible, and it just shows that this problem cannot be solved by Anthropic. Probably it can’t be solved by even all of the AI companies combined. The only way that this RSP is actually going to be able to be usable, in my estimation, is if other people rise to the occasion, and governments actually do the work necessary to fund the solutions to computer security that will allow us to have the model weights be sufficiently secure in this situation. And yeah, you’re not blameworthy for that situation. It just says that there’s a lot of people who need to do a lot of work in coming years.
Nick Joseph: Yeah. And I think I might be more optimistic than you or something. I do think if we get to something really dangerous, we can make a very clear case that it’s dangerous, and these are the risks unless we can implement these mitigations. I hope that at that point it will be a much clearer case to pause or something. I think there are many people who are like, “We should pause right now,” and see everyone saying no. And they’re like, “These people don’t care. They don’t care about major risks to humanity.” I think really the core thing is people don’t believe there are risks to humanity right now. And once we get to this sort of stage, I think that we will be able to make those risks very clear, very immediate and tangible.
And I don’t know. No one wants to be the company that caused a massive disaster, and no government also probably wants to have allowed a company to cause it. It will feel much more immediate at that point.
Rob Wiblin: Yeah, I think Stefan Schubert, this commentator who I read on Twitter, has been making the case for a while now that many people who have been thinking about AI safety — I guess including me — have perhaps underestimated the degree to which the public is likely to react and respond, and governments are going to get involved once the problems are apparent, once they really are convinced that there is a threat here. I think he calls it this bias in thought — where you imagine that people in the future are just going to sit on their hands and not do anything about the problems that are readily apparent — he calls it “sleepwalk bias.”
And I guess we have seen evidence over the last year or two that as the capabilities have improved, people have gotten a lot more serious and a lot more concerned, a lot more open to the idea that it’s important for the government to be involved here. There’s a lot of actors that need to step up their game and help to solve these problems. So yeah, I think you might be right. On an optimistic day, maybe I could hope that other groups will be able to do the necessary research soon enough that Anthropic will be able to actually apply its RSP in a timely manner. Fingers crossed.
Nick’s biggest reservations about the RSP approach [01:16:05]
Rob Wiblin: I just want to actually ask you next, what are your biggest reservations about RSPs, or Anthropic’s RSP, personally? If it fails to improve safety as much as you’re hoping that it will, what’s the most likely reason for it to not live up to its potential?
Nick Joseph: I think for Anthropic specifically, it’s definitely around this under-elicitation problem. I think it’s a really fundamentally hard problem to take a model and say that you’ve tried as hard as one could to elicit this particular danger. There’s always something. Maybe there’s a better researcher. There’s a saying: “No negative result is final.” If you fail to do something, someone else might just succeed at it next. So that’s one thing I’m worried about.
Then the other one is just unknown unknowns. We are creating these evaluations for risks that we are worried about and we see coming, but there might be risks that we’ve missed. Things that we didn’t realise would come before — either didn’t realise would happen at all, or thought would happen after, for later levels, but turn out to arise earlier.
Rob Wiblin: What could be done about those things? Would it help to just have more people on the team doing the evals? Or to have more people both within and outside of Anthropic trying to come up with better evaluations and figuring out better red-teaming methods?
Nick Joseph: Yeah, and I think that this is really something that people outside Anthropic can do. The elicitation stuff has to happen internally, and that’s more about putting as much effort as we can into it. But creating evaluations can really happen anywhere. Coming up with new risk categories and threat models is something that anyone can contribute to.
Rob Wiblin: What are the places that are doing the best work on this? Anthropic surely has some people working on this, but I guess I mentioned METR: [Model Evaluation and Threat Research]. They’re a group that helped to develop the idea of RSPs in the first place and develop evals. And I think the AI Safety Institute in the UK is involved in developing these sort of standard safety evals. Is there anywhere else that people should be aware of where this is going on?
Nick Joseph: There’s also the US AI Safety Institute. And I think this is actually something you could probably just do on your own. I think one thing, I don’t know, at least for people early in career, if you’re trying to get a role doing something, I would recommend just go and do it. I think you probably could just write up a report, post it online, be like, “This is my threat model. These are the things I think are important.” You could implement the evaluations and share them on GitHub. But yeah, there are also organisations you could go to to get mentorship and work with others on it.
Rob Wiblin: I see. So this would look like, I suppose you could try to think up new threat models, think up new things that you need to be looking for, because this might be a dangerous capability and people haven’t yet appreciated how much it matters. But I guess you could spend your time trying to find ways to elicit the ability to autonomously spread and steal model weights and get yourself onto other computers from these models and see if you can find an angle on trying to find warning signs, or signs of these emerging capabilities that other people have missed and then talk about them.
And you can just do that while signed into Claude 3 Opus on your website?
Nick Joseph: I think you’ll have more luck with the elicitation if you actually work at one of the labs, because you’ll have access to training the models as well. But you can do a lot with Claude 3 on the website or via an API — which is a programming term for basically an interface where you can send a request for like, “I want a response back,” and automatically do that in your app. So you can set up a sequence of prompts and test a bunch of things via the APIs for Claude, or any other publicly accessible model.
Communicating “acceptable” risk [01:19:27]
Rob Wiblin: To come back to this point about what’s “acceptable” risk, and maybe trying to make the RSP a little bit more concrete. I’m not sure how true this is, I’m not an expert on risk management, but I read from a critic of the Anthropic RSP that, at least in more established areas of risk management — where maybe you’re thinking about the probability that a plane is going to fail and crash because of some mechanical failure — it’s more typical to say, “We’ve studied this a lot, and we think that the probability of…” I suppose let’s talk about the AI example: rather than say we need the risk to be “not substantial,” instead you’d say, “With our practices, our experts think that the probability of an external actor being able to steal the model weights is X% per year. And these are the reasons why we think the risk is that level. And that’s below what we think of as our acceptable risk threshold of X, where X is larger than Y.”
I guess there’s a risk that those numbers would kind of just be made up; you could kind of assert anything because it’s all a bit unprecedented. But I suppose that would make clear to people what the remaining risk is, like what acceptable risk you think that you’re running. And then people could scrutinise whether they think that that’s a reasonable thing to be doing. Is that a direction that things could maybe go?
Nick Joseph: Yeah, I think it’s a fairly common way that people in the EA and rationality communities speak, where they give a lot of probabilities for things. And I think it’s really useful. It’s an extremely clear way to communicate: “I think there’s a 20% chance this will happen” is just way more informative than “I think it probably won’t happen,” which could be 0% to 50% or something.
So I think it’s very useful in many contexts. I also think it’s very frequently misunderstood, because for most people, I think they hear a number and they think it’s based on something — that there’s some calculation, and they give it more authority. If you say, “There’s a 7% chance this will happen,” people are like, “You really know what you’re talking about.”
So I think it can be a useful way to speak, but I think it also can sometimes communicate more confidence than we actually have in what we’re talking about — which isn’t, I don’t know, we didn’t have 1,000 governments attempt to steal our weights and X number of them succeeded or something. It’s much more going off of a judgement based on our security experts.
Rob Wiblin: I slightly want to push you on this, because I think at the point that we’re at ASL-4 or 5 or something like that, it would be a real shame if Anthropic was going ahead thinking, “We think the risk that these weights will be stolen every year is 1%, 2%, 3%,” something like that. I guess maybe you’re right in the policy saying, “We think it’s very unlikely, extremely unlikely that this is going to happen.” And then people externally think that basically it’s fine; they say it’s definitely not going to happen. There’s no chance that this is going to happen. And governments might not appreciate that, actually, in your own view, there is a substantial risk being run, and you just think it’s an acceptable risk given the tradeoffs and what else is going on in the world.
I guess it’s a social service for Anthropic to be direct about the risk that it thinks it’s creating and why it’s doing it. But I think it could be a really useful public service. It’s the kind of thing that might come up at Senate hearings and things like that, where people in government might really want to know. I guess at that point it would be perhaps more apparent why it’s really important to find out what the probability is.
But yeah, that’s a way that I think there’s definitely a risk of misinterpretation by journalists or something who don’t appreciate the spirit of saying, “We think it’s X% likely.” But there could also be a lot of value in being more direct about it.
Nick Joseph: Yeah, I’m not really an expert on communications. I think some of it just depends on who your target audience is and how they’re thinking about it. I think in general I’m a fan of making RSPs more concrete, being more specific. Over time I hope it progresses in that direction, as we learn more and can get more specific.
I also think it’s important for it to be verifiable, and I think if you start to give these precise percentages, people will then ask, “How do you know?” I don’t think there really is a clear answer to, “How do you know that the probability of this thing is less than X% for many of these situations?”
Rob Wiblin: It doesn’t help with the bad-faith actor or the bad-faith operator either, because if you say the safety threshold is 1% per year, they can kind of always just claim in this situation where we know so little that it’s less than 1%. It doesn’t really bind people all that much. Maybe it’s just a way that people externally could understand a little better what the opinions are within the organisation, or at least what their stated opinions are.
Nick Joseph: I will say that internally, I think it is an extremely useful way for people to think about this. If you are working on this, I think you probably should think through what is an acceptable level of danger and try to estimate it and communicate with people you’re working closely with in these terms. I think it can be a really useful way to give precise statements. I think that can be very valuable.
Rob Wiblin: A metaphor that you use within your responsible scaling policy is putting together an aeroplane while you’re flying it. I think that is one way that the challenge is particularly difficult for the industry and for Anthropic: unlike with biological safety levels — where basically we know the diseases that we’re handling, and we know how bad they are, and we know how they spread, and things like that — the people who are figuring out what BSL-4 security should be like can look at lots of studies to understand exactly the organisms that already exist and how they would spread, and how likely they would be to escape, given these particular ventilation systems and so on. And even then, they mess things up decently often.
But in this case, you’re dealing with something that doesn’t exist — that we’re not even sure when it will exist or what it will look like– and you’re developing the thing at the same time that you’re trying to figure out how to make it safe. It’s just extremely difficult. And we should expect mistakes. That’s something that we should keep in mind: even people who are doing their absolute best here are likely to mess up. And that’s a reason why we need this defence in depth strategy that you’re talking about, that we don’t want to put all of our eggs in the RSP basket. We want to have many different layers, ideally.
Nick Joseph: It’s also a reason to start early. I think one of the things with Claude 3 was that that was the first model where we really ran this whole process. And I think some part of me felt like, wow, this is kind of silly. I was pretty confident Claude 3 was not catastrophically dangerous. It was slightly better than GPT-4, which had been out for a long time and had not caused a catastrophe.
But I do think that the process of doing that — learning what we can and then putting out public statements about how it went, what we learned — is the way that we can have this run really smoothly the next time. Like, we can make mistakes now. We could have made a tonne of mistakes, because the stakes are pretty low at the moment. But in the future, the stakes on this will be really high, and it will be really costly to make mistakes. It’s important to get those practice runs in.
Should Anthropic’s RSP have wider safety buffers? [01:26:13]
Rob Wiblin: All right, another kind of recurring theme that I’ve heard from some commentators is that, in their view, the Anthropic RSP just isn’t conservative enough. So on that account, there should be wider buffers in case you’re under-eliciting capabilities that the model has that you don’t realise, which is something that you’re pretty concerned about.
A different reason would be you might worry that there could be discontinuous improvements in capabilities as you train bigger models with more data. So to some extent, model learning and improvement, from a very zoomed-out perspective, is quite continuous. But on the other hand, on its ability to do any kind of particular task, it can go from fairly bad to quite good surprisingly quickly. So there can be sudden, unexpected jumps with particular capabilities.
Firstly, can you maybe explain again in more detail how the Anthropic RSP handles these safety buffers, given that you don’t necessarily know what capabilities a model might have before you train it? That’s quite a challenging constraint to be operating under.
Nick Joseph: Yeah. So there are these red-line capabilities: these are the capabilities that are actually the dangerous ones. We don’t want to train a model that has these capabilities until we have the next set of precautions in place. Then there are evaluations we’re creating, and these evaluations are meant to certify that the model is far short of those capabilities. It’s not “Can the model do those capabilities?” — because once we pass them, we then need to put all the safety mitigations in place, et cetera.
And then when we have to run those evaluations is, we have some heuristics like when the effective compute goes up by a certain fraction — that is a very cheap thing that we can evaluate on every step of the run — or something along those lines so that we know when to run it.
In terms of how conservative they are, I guess one example I would give is, if you’re thinking about autonomy — where a model could spread to a bunch of other computers and autonomously replicate across the internet — I think our evaluations are pretty conservative on that front. We test if it can replicate to a fully undefended machine, or if it can do some basic fine-tuning of another language model to add a simple backdoor. I think these are pretty simple capabilities, and there’s always a judgement call there. We could set them easier, but then we might trip those and look at the model and be like, “This isn’t really dangerous; it doesn’t warrant the level of precaution that we’re going to give it.”
Rob Wiblin: There was something also about that the RSP says that you’ll be worried if the model can succeed half the time at these various different tasks trying to spread itself to other machines. Why is succeeding half the time the threshold?
Nick Joseph: So there’s a few tasks. I don’t off the top of my head remember the exact thresholds, but basically it’s just a reliability thing. In order for the model to chain all of these capabilities together into some long-running thing, it does need to have a certain success rate. Probably it actually needs a very, very high success rate in order for it to start autonomously replicating despite us trying to stop it, et cetera. So we set a threshold that’s fairly conservative on that front.
Rob Wiblin: Is part of the reason that you’re thinking that if a model can do this worrying thing half the time, then it might not be very much additional training away from being able to do it 99% of the time? That might just require some additional fine-tuning to get there. Then the model might be dangerous if it was leaked, because it would be so close to being able to do this stuff.
Nick Joseph: Yeah, that’s often the case. Although of course we could then elicit it, if we’d set a higher number. Even if we got 10%, maybe that’s enough that we could bootstrap it. Often when you’re training something, if it can be successful, you can reward it for that successful behaviour and then increase the odds of that success. It’s often easier to go from 10% to 70% than it is to go from 0% to 10%.
Rob Wiblin: So if I understand correctly, the RSP proposes to retest models every time you increase the amount of training compute or data by fourfold, is that right? That’s kind of the checkpoint?
Nick Joseph: We’re still thinking about what is the best thing to do there, and that one might change, but we use this notion of effective compute. So really this has to do with when you train a model, it goes down to a certain loss. And we have these nice scaling laws of if you have more compute, you should expect to get to the next loss. You might also have a big algorithmic win where you don’t use any more compute, but you get to a lower loss. And we have coined this term “effective compute.” So that would account for that as well.
These jumps are sort of the jump where we have sort of a visceral sense of how much smarter a model seems when you do that jump, and have set that as our bar for when we have to run all these evaluations — which do require a staff member to go and run them, spend a bunch of time trying to elicit the capabilities, et cetera.
I think this is somewhere I am wary of sounding too precise, or like we understand this too well. We don’t really know what the effective compute gap jump is between the yellow lines and the red lines. This is much more just like how we are thinking about the problem and how we are trying to set these evaluations. And the reason that the yellow-line evaluations really do need to be substantially easier, they’d be far from the red-line capabilities, is because you might actually overshoot the yellow-line capabilities by a fairly significant measure just off of when you run evaluations.
Rob Wiblin: If I recall, it was Zvi — who’s been on the show before — who wrote in his blog post assessing the Anthropic RSP that he thinks this ratio between the 4x and the 6x is not large enough, and that if there is some discontinuous improvement or you’ve really been under-eliciting the capabilities of the models at these kind of interim checkin points, that that does leave the possibility that you could overshoot and get into quite a dangerous point by accident. And then by the time you get there, the model’s quite a bit more capable than what you thought it would be.
Then you’ve got this difficult question of: do you then press the emergency button and delete all of the weights because you’ve overshot? There’d be incentives not to do that, because you’d be throwing away a substantial amount of compute expenditure basically, to create this thing. And this just worries him. That could be solved, I think, in his view, just by having a larger ratio there, having a larger safety buffer.
Of course, that then runs the risk that you’re doing these constant checkins on stuff that you really are pretty confident is not going to be actually that dangerous, and people might get frustrated with the RSP and feel like it’s wasting their time. So it’s kind of a judgement call, I guess, how large that buffer needs to be.
Nick Joseph: Yeah, I think it’s a tricky one to communicate about because it’s confidential what the jumps are between the models or something. One thing I can share is that we ran this on Claude 3 partway through training, and the jump from Claude 2 to Claude 3 was bigger than that gap. So you could sort of think of that as like an intelligence jump from Claude 2 to Claude 3 is bigger than what we’re allowing there. It feels reasonable to me, but I think this is just a judgement call that different people can have. And I think that this is the sort of thing where, if we learn over time that this seems too big or it seems too small, that’s the type of thing that hopefully we can talk about publicly.
Rob Wiblin: Is that something that you get feedback on? I suppose if you are training these big models and you’re checking in on them, you can kind of predict where you expect them to be, how likely they are to exceed a given threshold. And then if you do ever get surprised, then that could be a sign that we need to increase the buffer range here.
Nick Joseph: It’s hard because the thing that would really tell us is if we don’t pass the yellow line for one model, and then on the next iteration suddenly it blows past it. And we look at this and we’re like, “Whoa, this thing is really dangerous. It’s probably past the red line.” And we have to delete the model or immediately put in the security features, et cetera, for the next level. I think that that would be a sign that we’d set the buffer too small.
Rob Wiblin: I guess not the ideal way to learn that, but I suppose it definitely could set a cat amongst the pigeons.
Nick Joseph: Yeah. There would be earlier signs where you would notice we really overshot by a lot. It feels like we’re closer than we expected or something. But that would sort of be the failure mode, I guess, rather than the warning sign.
Other impacts on society and future work on RSPs [01:34:01]
Rob Wiblin: Reading the RSP, it seems pretty focused on kind of catastrophic risks from misuse — terrorist attacks or CBRN — and AI gone rogue, like spreading out of control, that sort of thing. Is it basically right that the RSP or this kind of framework is not intended to address kind of structural issues, like AI displaces people from work and now they can’t earn a living, or AIs are getting militarised and that’s making it more difficult to prevent military encounters between countries because we can’t control the models very well? Or more near-term stuff like algorithmic bias or deepfakes or misinformation? Are those things that have to be dealt with by something other than a responsible scaling policy?
Nick Joseph: Yeah, those are important problems, but our RSP is responsible for preventing catastrophic risks and particularly has this framing that works well for things that are sort of acute — like a new capability is developed and could first-order cause a lot of damage. It’s not going to work for things that are like, “What is the long-term effect of this on society over time?” because we can’t design evaluations to test for that effectively.
Rob Wiblin: Anthropic does have different teams that work on those other two clusters that I talked about, right? What are they called?
Nick Joseph: The societal impacts team is probably the most relevant one to that. And the policy team also has a lot of relevance to these issues.
Rob Wiblin: All right. We’re going to wrap up on RSPs now. Is there anything you wanted to maybe say to the audience to wrap up this section? Additional work or ways that the audience might be able to contribute to this enterprise of coming up with better internal company policies? And then figuring out how there could be models for other actors to come up with government policy as well?
Nick Joseph: Yeah, I think this is just a thing that many people can work on. If you work at a lab, you could talk to people there, think about what they should have as an RSP, if anything. If you work in policy, you should read these and think about if there are lessons to take. If you don’t do either of those, I think you really can think about threat modelling, post about that; think about evaluations, implement evaluations, and share those. I think it is the case that these companies are very busy, and if there is something that’s just shovel-ready or ready on the shelf, you could just grab this evaluation; it’s really quite easy to run them. So yeah, I think there’s quite a lot that people can do to help here.
Working at Anthropic [01:36:28]
Rob Wiblin: All right, let’s push on and talk about the case that listeners might be able to contribute to making superintelligence go better by working at Anthropic, on some of its various different projects. Firstly, how did you end up in your current role at Anthropic? What’s been the career journey that led you there?
Nick Joseph: I think it largely started with an internship at GiveWell, which listeners might know, but it’s a nonprofit that evaluates charities to figure out where to give money most effectively. I did an internship there. I learned a tonne about global poverty, global health. I was planning to do a PhD in economics and go work on global poverty at the time, but a few people there sort of pushed me and said, “You should really worry about AI safety. We’re going to have these superintelligent AIs at some point in the future, and this could be a big risk.”
I remember I left that summer internship and was like, “Wow, these people are crazy.” I talked to all my family, and they were like, “What are you thinking?” But then, I don’t know. It was interesting. So I kept talking to people — some people there, other people sort of worried about this. And I felt like every debate I lost. I would have a little debate with them about why we shouldn’t worry about it, and I’d always come away feeling like I lost the debate, but not fully convinced.
And after, honestly, a few years of doing this, I eventually decided this was convincing enough that I should work in AI. It also turned out that working on poverty via this economics PhD route was a much longer and more difficult and less-likely-to-be-successful path than I had anticipated. So I sort of pivoted over to AI. I worked at Vicarious, which is an AGI lab that had sort of shifted towards a robotics product angle. And I worked on computer vision there for a while, learning how to do ML research.
And then, actually 80,000 Hours reached out to me, and convinced me that I should work on safety more imminently. This was sort of like, AI was getting better. It was more important that I just have some direct impact on doing safety research.
At the time, I think OpenAI had by far the best safety research coming out of there. So I applied to work on safety at OpenAI. I actually got rejected. Then I got rejected again. In that time, Vicarious was nice enough to let me spend half of my time reading safety papers. So I was just sort of reading safety papers, trying to do my own safety research — although it was somewhat difficult; I didn’t really know where to get started.
Eventually I also wrote for Rohin Shah, who was on this podcast. He had this Alignment Newsletter, and I read papers and wrote summaries and opinions for them for a while to motivate myself.
But eventually, third try, I got a job offer from OpenAI, joined the safety team there, and spent eight months there mostly working on code models and understanding how code models would progress. The logic here being we’d just started the first LLMs training on code, and I thought it was pretty scary — if you think about recursive self-improvement, models that can write code is the first step — and trying to understand what direction that would go in would be really useful for informing safety directions.
And then a little bit after that, maybe like eight months in or so, all of the safety team leads at OpenAI left, most of them to start Anthropic. I felt very aligned with their values and mission, so also went to join Anthropic. Sort of the main reason I’d been at OpenAI was for the safety work.
And then at Anthropic, actually, everyone was just building out infrastructure to train models. There was no code. It was sort of the beginning of the company. And I found that the thing that was my comparative advantage was making them efficient. So I optimised the models to go faster. As I said, if you have more compute, you get a better model. So that means if you can make things run quicker, you get a better model as well.
I did that for a while, and then shifted into management, which had been something I wanted to do for a while, and started managing the pre-training team when it was five people. And then have been growing the team since then, training better and better models along the way.
Rob Wiblin: I’d heard that you’d been consuming 80,000 Hours’ stuff years ago, but I didn’t realise it influenced you all that much. What was the step that we helped with? It was just deciding that it was important to actually start working on safety-related work sooner rather than later?
Nick Joseph: Actually a bunch of stops along the way. I think when I did that GiveWell internship, I did speed coaching at EA Global or something with 80,000 Hours. People there were some of the people who were pushing me that I should work on AI. Some of those conversations. And then when I was at Vicarious, I think 80,000 Hours reached out to me and was sort of more pushy, and specifically was like, “You should go to work directly on safety now” — where I think I was otherwise sort of happy to just keep learning about AI for a bit longer before shifting over to safety work.
Rob Wiblin: Well, cool that 80K was able to… I don’t know whether it helped, but I suppose it influenced you in some direction.
Engineering vs research [01:41:04]
Rob Wiblin: Is there any stuff that you’ve read from 80K on AI careers advice that you think is mistaken? Where you want to tell the audience that maybe they should do things a little bit differently than what we’ve been suggesting on the website, or I guess on this show?
Nick Joseph: Yeah. First, I do want to say 80K was very helpful, both in pushing me to do it and setting me up with connections and introducing new people and getting me a lot of information. It was really great.
In terms of things that I maybe disagree with from standard advice, I think the main one would be to focus more on engineering than research. I think there is this historical thing where people have focused on research more so than engineering. Maybe I should define the difference.
The difference between research and engineering here would be that research can look more like figuring out what directions you should work on — designing experiments, doing really careful analysis and understanding that analysis, figuring out what conclusions to draw from a set of experiments. I can maybe give an example, which is you’re training a model with one architecture and you’re like, “I have an idea. We should try this other architecture. And in order to try it, the right experiments would be these experiments, and these would be the comparisons to confirm if it’s better or worse.”
Engineering is more of the implementation of the experiment. So then taking that experiment, trying it, and also creating tooling to make that fast and easy to do, so make it so that you and everyone else can really quickly run experiments. It could be optimising code — so making things run much faster, as I mentioned I did for a while — or making the code easier to use so that other people can use it better.
And it’s not like someone’s an engineer or a researcher. You kind of need both of these skill sets to do work. You come up with ideas, you implement them, you see the results, then you implement changes, and it’s a fast iteration loop. But it’s somewhere where I think there’s historically been more prestige given to the research end, despite the fact that most of the work is on the engineering end. You know, you come up with your architecture idea, that takes an hour. And then you spend like a week implementing it, and then you run your analysis, and that maybe takes a few days. But it sort of feels like the engineering work takes the longest.
And then my other pitch here is going to be that the one place where I’ve often seen researchers not investigate an area they should have is when the tooling is bad. So when you go to do research on this area and you’re like, “Ugh, it’s really painful. All my experiments are slow to run,” it will really quickly have people be like, “I’m going to go do these other experiments that seem easier.” So often, by creating tooling to make something easy, you actually can open up that direction and trailblaze a path for a bunch of other people to follow along and do a lot of experiments.
Rob Wiblin: What fraction of people at Anthropic would you classify as more on the engineering end versus more on the research end?
Nick Joseph: I might go with my team because I actually don’t know for all of Anthropic. And I think it’s a spectrum, but I would guess it’s probably 60% or 70% of people are probably stronger on the engineering end than on the research end. And when hiring, I’m most excited about finding people who are strong on the engineering end. Most of our interviews are sort of tailored towards that — not because the research isn’t important, but because I think there’s a little bit less need for it.
Rob Wiblin: The distinction sounds a little bit artificial to me. Is that kind of true? It feels like these things are all just a bit part of a package.
Nick Joseph: Yeah. Although I think the main distinction with engineering is that it is a fairly separate career. I think there are many people, hopefully listening to this podcast, who might have been a software engineer at some tech company for a decade and built up a huge amount of expertise and experience with designing good software and such. And those people I think can actually learn the ML they need to know to do the job effectively very quickly.
And I think there’s maybe another direction people could go in, which is much more like, I think of as a PhD in many cases — where you’re spending a lot of time developing research taste, figuring out what are the right experiments to run, and running these — usually at smaller scale and maybe with less of a single long-lived codebase that pushes you to develop better engineering practices.
And I think that skill set — and to be clear, this is a relative term; it’s also a really valuable skill set, and you always need a balance — but I think I’ve often had the impression that 80,000 Hours pushes people more in that direction who want to work on safety. More the “do a PhD, become a research expert with really great research taste” than pushing people more on the “become a really great software engineer” direction.
Rob Wiblin: Yeah. We had a podcast many years ago, in 2018, with Catherine Olsson and Daniel Ziegler, where they were also saying engineering is the way to go, or engineering is the thing that’s really scarce and there’s also the easier way into the industry. But yeah, it isn’t a drum that we’ve been banging all that frequently. I don’t think we’ve talked about it very much since then. So perhaps that’s a bit of a mistake that we haven’t been highlighting the engineering roles more.
You said it’s kind of a different career track. So you can go from software engineering to the ML or AI engineering that you’re doing at Anthropic. Is that the natural career progression that someone has? Or someone who’s not already in this, how can they learn the engineering skills that they need?
Nick Joseph: I think engineering skills are actually in some ways the easiest to learn because there’s so many different engineering places. The way I would recommend it is you could work at any engineering job. Usually I would say just working with the smartest people you can, building the most complex systems. You can also just do this open source; you can contribute to an open source project. This is often a great way to get mentorship from the maintainers and have something that’s publicly visible. If you then want to apply to a job, you can be like, “Here is this thing I made.”
And then you can also just create something new. I think if you want to work on AI engineering, you should probably pick a project that’s similar to what you want to do. So if you want to work on data for large language models, take Common Crawl — it’s a publicly available crawl of the web — and write a bunch of infrastructure to process it really efficiently. Then maybe train some models on it, build out some infrastructure to train models, and you can build out that skill set relatively easily without needing to work somewhere.
Rob Wiblin: Why do you think people have been overestimating research relative to engineering? Is it just that research sounds cooler? Has it got better branding?
Nick Joseph: I think historically it was a prestige thing. I think there’s this distinction between research scientist and research engineer that used to exist in the field, where research scientists had PhDs and were designating the experiments that the research engineers would run.
I think that shifted a while ago. I think in some sense the shift has already started happening. Now, many places, Anthropic included, everyone’s a member of technical staff. There isn’t this distinction. And the reason is that the engineering got more important, particularly with scaling. Once you got to the point where you were training models that used a lot of compute on a big distributed cluster, the engineering to implement things on these distributed runs got much more complex than when it was more quick experiments on cheap models.
Rob Wiblin: To what extent is it a bottleneck just being able to build these enormous compute clusters and operate them effectively? Is that a core part of the stuff that Anthropic has to do?
Nick Joseph: So we rely on cloud providers to actually build the data centres and put the chips in it. But we’ve now reached a scale where the amount of compute we’re using is a very dedicated thing. These are really huge investments, and we’re involved and collaborating on it from the design up. And I think it’s a very critical piece. Given that compute is the main driver, the ability to take a lot of compute and use it all together and to design things that are cheap, given the types of workloads you want to run, can be a huge multiplier on how much compute you have.
AI safety roles at Anthropic [01:48:31]
Rob Wiblin: All right. Do you want to give us the pitch for working at Anthropic as a particularly good way to make the future with superintelligent AI go well?
Nick Joseph: I might pitch working on AI safety first. The case here is it’s just really, really important. I think AGI is going to be probably the biggest technological change ever to happen.
The thing I keep in my mind is just: what would it be like to have every person in the world able to spin up a company of a million people — all of whom are as smart as the smartest people you know — and task them with any project they want? You could do a huge amount of good with that: you could help cure diseases, you could tackle climate change, you could work on poverty. There’s a tonne of stuff you can do that would be great.
But there’s also a lot of ways it could go really, really badly. I just think the stakes here are really high, and then there’s a pretty small number of people working on it. If you account for all the people working on things like this, I think you’re probably going to get something in the thousands right now, maybe tens of thousands. It’s rapidly increasing, but it’s quite small compared to the scale of the problem.
In terms of why Anthropic, I think my main case here is just I think the best way to make sure things go well is to get a bunch of people who care about the same thing and all work together with that as the main focus. I mean, Anthropic is not perfect. We definitely have issues, as does every organisation. But I think one thing that I’ve really appreciated is just seeing how much progress we can make when there’s a whole team where everyone trusts each other, deeply shares the same goals, and can work on that together.
Rob Wiblin: I guess there is a bit of a tradeoff between, if you imagine there’s a pool of people who are very focused on AI safety and have the attitude that you just expressed, one approach would be to split them up between each of the different companies that are working on frontier AI. I guess that would have some benefits. The alternative would be to cluster them all together in a single place where they can work together and make a lot of progress, but perhaps the things that they learn won’t be as easily diffused across all of the other companies.
Do you have a view on where the right balance is there between clustering people so they can work together more effectively and communicate more, versus the need perhaps to have people everywhere who can absorb the work?
Nick Joseph: I just think the benefits from working together are really huge. I think it’s so different what you can accomplish when you have five people all working together, as opposed to five people working independently, unable to speak to each other or communicate about what they’re doing. You run the risk of just doing everything in parallel, not learning from each other, and also not building trust — which I think is just somewhat a core piece of eventually being able to work together to implement the things.
Rob Wiblin: So inasmuch as Anthropic is or becomes the main leader in interpretability research and other lines of technical AI safety research, do you think it is the case that other companies are going to be very interested to absorb that research and apply it to their own work? Or is there a possibility that Anthropic will have really good safety techniques, but then they might get stuck in Anthropic, and potentially the most capable models that are being developed elsewhere are developed without them?
Nick Joseph: My hope is that if other people either develop RSP-like things, or if there are regulations requiring particular safety mitigations, people will have a strong incentive to want to get better safety practices. And we publish our safety research, so in some ways we’re making it as easy as we can for them. We’re like, “Here’s all the safety research we’ve done. Here’s as much detail as we can give about it. Please go reproduce it.”
Beyond that, I think it’s hard to be accountable for what other places do. I think to some degree it just makes sense for Anthropic to try to set an example and be like, “We can be a frontier lab while still prioritising safety and putting out a lot of safety work,” and hoping that kind of inspires others to do the same.
Rob Wiblin: I don’t know what the answer to this is, but do you know if researchers at Anthropic sometimes go and visit other AI companies, and vice versa, in order to cross-pollinate ideas? I think that used to maybe happen more, and maybe things have gotten a little bit tighter the last few years, but that’s one idea that you could hope that research might get passed around.
You’re saying it gets published. I guess that’s important. But there is a risk that the technical details of how you actually apply the methods won’t always necessarily be in the paper or be very easy to figure out. So you also often need to talk to people to make things work.
Nick Joseph: Yeah, I think once something’s published, you can go and give talks on it, et cetera. I think publishing is the first step. Until it’s published, then it’s confidential information that can’t be shared. It’s sort of like you have to first figure out how to do it, then publish it. There are more steps you could take. You could then open source code that enables you to run it more carefully. There’s a lot of work that could go in that direction. And then it’s just a balance of how much time you spend on disseminating your results versus pushing your agenda forward to actually make progress.
Rob Wiblin: It’s possible that I’m slightly analogising from biology that I’m somewhat more familiar with, where it’s notorious that having a biology paper or a medical paper does not allow you to replicate the experiment, because there’s so many important details missing. But is it possible that in ML, in AI, people tend to just publish all of the stuff — all of the data, maybe, and all of the code online or on GitHub or whatever — such that it’s much more straightforward to completely replicate a piece of research elsewhere?
Nick Joseph: Yeah, I think it’s a totally different level of replication. It depends on the paper. But on many papers, if a paper is published in some conference, I would expect that someone can pull up the paper and reimplement it with maybe a week’s worth of work. There’s a strong norm of sometimes providing the actual code that you need to run, but providing enough detail that you can.
I think with some things it can be tricky, where our interpretability team just put out a paper on how to get features on one of our production models, and we didn’t release details about our production model. So we tried to include enough detail that someone could replicate this on another model, but they can’t exactly create our production model and get the exact features that we have.
Rob Wiblin: OK, in a minute, we’ll talk about one of the concerns that people might have about working at any AI company. But in the meantime, what roles are you hiring for at the moment, and what roles are likely to be open at Anthropic in future?
Nick Joseph: So probably just check our website. There’s quite a lot. I’ll highlight a few.
The first one I should highlight is the RSP team is looking for people to develop evaluations, work on the RSP itself, figure out what the next version of the RSP should look like, et cetera.
On my team, we’re hiring a bunch of research engineers. This is to come up with approaches to improve models, implement them, analyse the results, pushing that loop. Then also performance engineers. This one’s maybe a little bit more surprising, but a lot of the work now happens on custom AI chips, and making those run really efficiently is absolutely critical. There’s a lot of interplay between how fast it can go and how good the model is. So we’re hiring quite a number of performance engineers where you don’t need to have a tonne of AI expertise, just have deep knowledge of how hardware works and how to write code really efficiently.
Rob Wiblin: How can people learn that skill? Are there courses for that?
Nick Joseph: There are probably courses, I think, with basically everything. I would recommend finding a project, finding someone to mentor you, and be cognizant of their time. Maybe you spend a bunch of time writing up some code and you send them a few hundred lines and say, “Can you review this and help me?” Or maybe you’ve got some weekly meeting where you ask questions. But yeah, I think you can read about it online, you can take courses, or you can just pick a project and say, “I’m going to implement a transformer as fast as I possibly can,” and sort of hack on that for a while.
Rob Wiblin: Are most people coming into Anthropic from other AI companies or the tech industry more broadly, or from PhDs, or maybe not even PhDs?
Nick Joseph: It’s quite a mix. I think a PhD is definitely not necessary. It’s one direction to go to build up this skill set. We have a shockingly large number of people with physics backgrounds who have done theoretical physics for a long time, and then spend some number of months learning the engineering to be able to write Python really well, essentially, and then switch in.
So I think there’s not really a particular background that is needed. I would say if you’re directly preparing for it, just pick the closest thing you can to the job and do that to prepare, but don’t feel like you need to have some particular background in order to apply.
Rob Wiblin: This question is slightly absurd, because there’s such a range of different roles that people could potentially apply for at Anthropic, but do you have any advice to people who, the vision for their career is working at Anthropic or something similar, but they don’t yet feel like they’re qualified to get a role at such a serious organisation? What are some interesting underrated paths, maybe, to gain experience or skills so they can be more useful to the project in future?
Nick Joseph: I would just pick the role you want and then do it externally. Do it in a very publicly visible way, get advice, and then apply with that as an example. So if you want to work on interpretability, make some tooling to pull out features of models and post that on GitHub, or publish a paper on interpretability. If you want to work on the RSP, then make a really good evaluation, post it on GitHub with a nice writeup of how to run it, and include that with your application.
This takes time, and it’s hard to do well, but I think that it’s both the best way to know if it’s really the role you want, and when hiring for something, I have a role in mind and I want to know if someone can do it. And if someone has shown, “Look, I’m already doing this role. Of course I can; here’s my proof I can do it well” that’s the most convincing case. In many ways, more so than the signal you’d get out of an interview, where all you really know is they did well on this particular question.
Should concerned people be willing to take capabilities roles? [01:58:20]
Rob Wiblin: So in terms of working at AI companies, regular listeners will recall that earlier in the year I spoke with Zvi Mowshowitz, who’s a longtime follower of advances in AI, and I’d say is a bit on the pessimistic side about AI safety. And I think he likes the Anthropic RSP, but he’s not convinced that any of the safety plans put forward by any company or any government are, at the end of the day, going to be quite enough to keep us safe from rapidly self-improving AI.
He said that he was pretty strongly against people taking capabilities roles that would push forward the frontier of what the most powerful AI models can do, I guess especially at leading AI companies. The basic argument is just that those roles are causing a lot of harm because they’re speeding things up and leaving us less time to solve whatever kind of safety issues we’re going to need to address.
And I pushed back a little bit, and he wasn’t really convinced by the various justifications that one might give — like the need to gain skills that you could then apply to safety work later, or maybe you’d have the ability to influence a company’s culture by being on the inside rather than the outside. And I think, of all companies, Zvi I would certainly imagine is most sympathetic to Anthropic. But I guess his philosophy is very much to rely on hard constraints rather than put trust in any particular individuals or organisations that you like.
I’m guessing that you might have heard what Zvi had to say in that episode, and I guess it was a critique that arguably applies to your job training Claude 3 and other frontier LLMs. So I’m kind of fascinated to hear what you thought of Zvi’s perspective there.
Nick Joseph: I think there’s one argument, which is to do this to build career capital, and then there’s another that is to do this for direct impact.
On the career capital one, I’m pretty sceptical. I think career capital is sort of weird to think about in this field that’s growing exponentially. In sort of a normal field, people often say you have the most impact late in your career: you build up skills for a while, and then maybe your 40s or 50s is when you have the most impact of your career.
But given the rapid growth in this field, I think actually the best moment for impact is now. I don’t know. I often think of, in 2021 when I was working at Anthropic, I think there were probably tens of people working on large language models, which I thought were the main path towards AGI. Now there are thousands. I’ve improved; I’ve gotten better since then. But I think probably I had way more potential for impact back in 2021 when there were only tens of people working on it.
Rob Wiblin: Your best years are behind you, Nick.
Nick Joseph: Yeah, I think the potential was very high. I think that there’s still a lot of room for impact, and it will maybe decay, but from an extremely high level.
And then the thing is just the field isn’t that deep. Because it’s such a recent development, it’s not like you need to learn a lot before you can contribute. If you want to do physics, and you have to learn the past thousands of years of physics before you can push the frontier, that’s a very different setup from where we’re at.
Maybe my last argument is just like, if you think timelines are short, depending exactly how short, there’s just actually not that much time left. So if you think there’s five years and you spent two of them building up a skill set, that’s a significant fraction of the time. I’m not saying that should be someone’s timeline or anything, but the shorter they are, the less that makes sense. So yeah, I think from a career capital perspective, I probably agree, if that makes sense.
Rob Wiblin: Yeah, yeah. And what about from other points of view?
Nick Joseph: From a direct impact perspective, I’m fairly less convinced. Part of this is just that I don’t have this framing of, there’s capabilities and there is safety and they are like separate tracks that are racing. It’s one way to look at it, but I actually think they’re really intertwined, and a lot of safety work relies on capabilities advances. I gave this example of this many-shot jailbreaking paper that one of our safety teams published, which uses long-context models to find a jailbreak that can apply to Claude and to other models. And that research was only possible because we had long-context models that you could test this on. I think there’s just a lot of cases where the things come together.
But then I think if you’re going to work on capabilities, you should be really thoughtful about it. I do think there is a risk you are speeding them up. In some sense you could be creating something that is really dangerous. But I don’t think it’s as simple as just don’t do it. I think you want to think all the way through to what is the downstream impact when someone trains AGI, and how will you have affected that? That’s a really hard problem to think about. There’s a million factors at play, but I think you should think it through, come to your best judgement, and then reevaluate and get other people’s opinions as you go.
Some of the things I might suggest doing, if you’re considering working on capabilities at some lab, is try to understand their theory of change. Ask people there, “How does your work on capabilities lead to a better outcome?” and see if you agree with that. I would talk to their safety team, talk to safety researchers externally, get their take. Do they think that this is a good thing to do? And then I would also look at their track record and their governance and all the things to answer the question of, do you think they will push on this theory of change? Like over the next five years, are you confident this is what will actually happen?
One thing that convinced me at Anthropic that I was maybe not doing evil, or made me feel much better about it, is that our safety team is willing to help out with capabilities, and actually wants us to do well at that. Early on with Opus, before we launched it, we had a major fire. There were a bunch of issues that came up, and there was one very critical research project that my team didn’t have capacity to push forward.
So I asked Ethan Perez, who’s one of the safety leads at Anthropic, “Can you help with this?” It was actually during an offsite, and Ethan and most of his team just basically went upstairs to this building in the woods that we had for the offsite and cranked out research on this for the next two weeks. And for me, at least, that was like, yes. The safety team here also thinks that us staying on the frontier is critical.
Rob Wiblin: So the basic idea is that you think that the safety work, the safety research of all kinds of many different types that Anthropic is doing is very useful. It sets a great example. It’s research that could then be adopted by other groups and also used by Anthropic to make safe models. And the only way that that can happen, the only reason that research is possible at all, is that Anthropic has these frontier LLMs on which to experiment and do that research, and to be at the cutting edge generally of this technology, and so able to figure out what’s the safety research agenda that is most likely to be relevant in future.
If I imagine, what would Zvi say? I’m going to try to model him. I guess that he might say yes, given that there’s this competitive dynamic forcing us to shorten timelines, bringing the future forward maybe faster than we feel comfortable with, maybe that’s the best you can do. But wouldn’t it be great if we could coordinate more in order to buy ourselves more time? I guess that would be one angle.
Another angle that I’ve heard from some people — I don’t know whether Zvi would say this or not — is that we’re nowhere near actually having all the safety-relevant insights that we can have with the models that we have now. So given that there’s still such fertile material with Claude 2 maybe, or at least with Claude 3 now, why do you need to go ahead and train Claude 4?
Maybe it’s true that five years ago, when we were so much further away from having AGI or having models that were really interesting to work with, we were a little bit at a loose end trying to figure out what safety research would be good, because we just didn’t know what direction things were going to go. But now there’s so much safety research — there’s a proliferation, a cambrian explosion of really valuable work — and we don’t necessarily need more capable models than what we have now in order to discover really valuable things. What would you say to that?
Nick Joseph: On the first one, I think there’s sometimes this, like, “What is the ideal world if everyone was me?” or something. Or, “If everyone thought what I thought, what would be the ideal setup?” I think that’s just not how the world works. To some degree, you really only can control what you do, and maybe you can influence what a small number of people you talk to do. But I think you have to think about your role in the context of the broader world, more or less acting in the way that they’re going to act.
And definitely a big part of why I think it’s important for Anthropic to have capabilities is to enable safety researchers to have better models. Another piece of it is to enable us to have an impact on the field, and try to set this example for other labs that you can deploy models responsibly and do this in a way that doesn’t cause catastrophic risks and continues to push on safety.
In terms of “Can we do safety research with current models?” I think there is definitely a lot to do. I also think we will target that work better the closer we get to AGI. I think the last year before AGI will definitely be the most targeted safety work. Hopefully, there’ll be the most safety work happening then, but it will be the most time constrained. So you need to do work now, because there’s a bunch of serial time that’s needed in order to make progress. But you also want to be ready to make use of the most well-directed time towards the end.
Rob Wiblin: I guess another concern that people have — which you touched on earlier, but maybe we could talk about a little bit more — is this worry that Anthropic, by existing, by competing with other AI companies, stokes the arms race, increases the pressure on them, feeling that they need to improve their models further, put more money into it, release things as quickly as they can.
If I remember, your basic response to that was like, yes, that effect is not zero, but in the scheme of things, there’s a lot of pressure on companies to be training models and trying to improve them. And Anthropic is a drop in the bucket there, so this isn’t necessarily the most important thing to be worrying about.
Nick Joseph: Yeah, I think basically that’s pretty accurate. One way I would think about it is just what would happen if Anthropic stopped existing? If we all just disappeared, what effect would that have in the world? Or if you think about if we dissolved as a company, and everyone went to work at all the others. My guess is it just wouldn’t look like everyone slows down and is way more cautious. That’s not my model of it. If that was my model, then I would be like, we’re probably doing something wrong.
So I think it’s an effect, but I think about, in terms of what is the net effect of Anthropic being on the frontier — when you account for all the different actions we’re taking, all the safety research, all the policy advocacy, the effect our products have helping users — there’s this whole large scheme. And you can’t really add it all up and subtract the costs, but I think you can do that somewhat in your mind or something.
Rob Wiblin: Yeah, I see. So the way you conceptualise it is thinking of Anthropic as a whole, what impact is it having by existing compared to some counterfactual where Anthropic wasn’t there? And then you’re contributing to this broader enterprise that is Anthropic and all of its projects and plans together, rather than thinking about, “Today, I got up and I helped to improve Claude 3 in this narrow way. What impact does that specifically have?” — because maybe it’s missing the real effects that matter the most from allowing this organisation to exist through your work.
Nick Joseph: Yeah, you could definitely think on the margin. To some degree, if you’re joining and going to help with something, you are just increasing Anthropic’s marginal amount of capabilities. Then I would just look at, “Do you think we would be on a better trajectory if Anthropic had better models? And do you think we’d be on a worse trajectory if Anthropic had significantly worse models?” would be sort of the comparison. I think you could look at like, what would happen if Anthropic didn’t ship Claude 3 earlier this year?
Recent safety work at Anthropic [02:10:05]
Rob Wiblin: What are some of the lines of research that you’re most pleased that you’ve helped Anthropic to pursue? What are some of the safety wins that you’re really pleased by?
Nick Joseph: I’m really excited about the safety work. I think there’s just a tonne of it that has come out of Anthropic. I could start with interpretability. I think at the beginning of Anthropic, it was figuring out how single-layer transformers work, these very simple toy models. And in the past few years — and this is not my doing; this is all the interpretability team — that has scaled up into actually being able to look at production models that people are really using and find useful, and identify particular features.
We had this recent one on the Golden Gate Bridge, where it’s the model’s representation of the Golden Gate Bridge. And if you increase it, the model talks more about the Golden Gate Bridge. And that’s a very cool causal effect, where you can change something and it actually changes the model behaviour in a way that gives you more certainty that you’ve really found something.
Rob Wiblin: I’m not sure whether all listeners will have seen this, but it is very funny, because you get Claude 3, and its mind is constantly turned to thinking about the Golden Gate Bridge, even when the question has nothing to do with it. And it gets frustrated with itself, realising that it’s going off topic, and then tries to bring it back to the thing that you asked. But then it just can’t. It just can’t avoid talking about the Golden Gate Bridge again.
Is the hope that you could find the honesty part of the model and scale that up enormously? Or alternatively, find the deception part and scale that down in the same way?
Nick Joseph: Yeah. If you look at the paper, there’s a bunch of safety-relevant features. I think that the Golden Gate Bridge one was cuter or something and got a bit more attention. But yeah, there are a tonne of features that are really safety relevant. I think one of my favourites was one that will tell you if code is incorrect or something, or has a vulnerability, something along those lines, and then you can change that and suddenly it doesn’t write the vulnerability or it makes the code correct. And that sort of shows the model knows about concepts at that level.
Now, can we use this directly to solve major issues? Probably not yet. There’s a lot more work to be done here. But I think it’s just been a huge amount of progress. And I think that it’s fair to say that that progress wouldn’t have happened without Anthropic’s interpretability team pushing that field forward a lot.
Rob Wiblin: Is there any other Anthropic research that you’re proud of?
Nick Joseph: Yeah, I mentioned this one a little bit earlier, but there’s this multi-shot jailbreaking from our alignment team that pushed, if you have a long-context model, which is something that we released, you can jailbreak a model by just giving it a lot of examples in this very long context. And it’s a very reliable jailbreak to get models to do things you don’t want. This is sort of in the vein of the RSP: one of the things we want to have is to be able to be robust to really intense red-teaming, where if a model has a dangerous capability, you can have safety features that prevent people from eliciting it. And this is like an identification of a major risk for that.
We also have this sleeper agents paper which shows early signs of models having deceptive behaviour.
Yeah, I could talk about a lot more of it. There’s actually just a really huge amount, and I think that’s fairly critical here. I think often with safety things, people get focused on inputs and not outputs or something. And I think the important thing is to think about how much progress are we actually making on the safety front? That is ultimately what’s going to matter in some number of years when we get close to AGI. It won’t be how many GPUs do we use? How many people worked on it? It’s going to be: What did we find and how effective were we at it?
And for products, this is very natural. People think in terms of revenue. You know, how many users did you get? You have these end metrics that are the fundamental thing you care about. I think for safety, it’s much fuzzier and harder to measure, but putting out a lot of papers that are good is quite important.
Rob Wiblin: Yeah. If you want to keep going, if there’s any others that you want to flag, I’m in no hurry.
Nick Joseph: Yeah, I could talk about influence functions. I think this is a really cool one. So one framing of mechanistic interpretability is, it lets us look at the weights and understand why a model has a behaviour by looking at a particular weight. The idea of influence functions is to understand why a model has a behaviour by looking at the training data, so you can understand what in your training data contributed to a particular behaviour from the model. I think that was pretty exciting to see work.
Constitutional AI is another example I would highlight, where we can train a model to follow a set of principles via AI feedback. So instead of having to have human feedback for a bunch of things, you can just write out a set of principles — “I want the model to not do this, I want it to not do this, I want to not do this” — and train the model to follow that constitution.
Anthropic culture [02:14:35]
Rob Wiblin: Is there any work at Anthropic that you personally would be wary, or at least not enthusiastic, to contribute to?
Nick Joseph: So I think, in general, this is a good question to ask. I think the work I’m doing is currently the highest-impact thing, and I should frequently wonder if that’s the case and talk to people and reassess.
Right now, I don’t think there’s any work at Anthropic that I wouldn’t contribute to or think shouldn’t be done. That’s probably not the way I would approach it. If there was something that I thought Anthropic was doing that was bad for the world, I would write a doc making my case and send it to the relevant person who’s responsible for that, and then have a discussion with them about it.
Because just opting out isn’t going to actually change it, right? Someone else will just do it. That doesn’t accomplish much. And we try to operate as one team where everyone is aiming towards the same goals, and not have this two different teams at odds, where you’re hoping someone else won’t succeed.
Rob Wiblin: I guess people might have a reasonable sense of the culture at Anthropic just from listening to this interview, but is there anything else that’s interesting about working at Anthropic that might not be immediately obvious?
Nick Joseph: The one thing that is part of our culture that at least surprised me is spending a lot of time pair programming. It’s just a very collaborative culture. When I first joined, I was working on a particular method of distributing a language model training across a bunch of GPUs. And Tom Brown — who’s one of the founders, and had done this for GPT-3 — just put an eight-hour meeting on my calendar, and I just watched him code it. And I was on different time zones, so basically during the hours when he wasn’t working and I was working, I would push forward as far as I could. And then the next day we would meet again and continue on.
I think it’s just a really good way of aligning people, where it’s a shared project, instead of being like, you’re bothering someone by asking for their help. It’s like you’re working together on the thing, and you learn a lot. You also learn a lot of the smaller things that you wouldn’t otherwise see, like how does someone navigate their code editor? What exactly is their style of debugging this sort of problem? Whereas if you go and ask them for advice or “How do I do this project?” they’re not going to tell you the low-level details of when do they pull out a debugger versus some other tool for solving the problem.
Rob Wiblin: So this is literally just watching one another’s screens, or you’re doing a screen share thing where you watch?
Nick Joseph: Yeah. I’ll give some free advertising to Tuple, which is this great software for it, where you can share screens and you can control each other’s screens and draw on the screen. And typically one person will drive, they’ll be basically doing the work, and another person will watch, ask questions, point out mistakes, occasionally grab the cursor and just change it.
Rob Wiblin: It’s interesting that I feel in other industries, having your boss or a colleague stare constantly at your screen would give people the creeps or they would hate it. Whereas it seems like in programming this is something that people are really excited by, and they feel like it enhances their productivity and makes the work a lot more fun.
Nick Joseph: Oh yeah. I mean, it can be exhausting and tiring. I think the first time I did this, I was too nervous to take a bathroom break. And after multiple hours I was like, “Can I go to the bathroom?” And I realise that was an absurd thing to ask after multiple hours of working on something.
Rob Wiblin: What, are you back at primary school?
Nick Joseph: Yeah. It could definitely feel a little bit more intense, in that someone’s watching you and they might give you feedback. Like, “You’re kind of going slow here. This sort of thing would speed you up.” But I think you really can learn a lot from that sort of intensive partnering with someone.
Rob Wiblin: All right, I think we’ve talked about Anthropic for a while. Final question is: obviously Anthropic, its main office is in San Francisco, right? And I heard that it was opening a branch in London. Are those the two main places? Are there many people who work remotely or anything like that?
Nick Joseph: Yeah. So we have the main office in SF and then we have an office in London, Dublin, Seattle, and New York. Our typical policy is like 25% time in-person. So some people will mostly work remotely and then go to one of the hubs for usually one week per month. The idea of this is that we want people to build trust with each other and be able to work together well and know each other, and that involves some amount of social interaction with your coworkers. But also, for a variety of reasons, sometimes getting the best people, people are bound to particular locations.
Rob Wiblin: I kind of have been assuming that all of the main AI companies are probably hiring hand over fist. And I know Anthropic’s received big investment from Amazon, maybe some other folks as well. But does it feel like the organisation is growing a lot? That there’s lots of new people around all the time?
Nick Joseph: Yeah, growth has been very rapid. We recently moved into a new office. Before that, we’d run out of desks, which was an interesting moment for the company. It was very cramped. Now there’s space.
I mean, rapid growth is a very difficult challenge, but also a very interesting one to work on. I think that’s, to some degree, what I spend a lot of my time thinking about. How can we grow the team and be able to maintain this linear growth in productivity is sort of the dream: if you double the number of people, you get twice as much done. And you never actually hit that, but it takes a lot of work, because there’s now all this communication overhead and you have to do a bunch to make sure everyone’s working towards the same goals, sort of maintain the culture that we currently have.
Rob Wiblin: I’ve given you a lot of time to talk about what’s great about Anthropic, but I should at least ask what’s worst about Anthropic? What would you most like to see improved?
Nick Joseph: Honestly, the first thing that comes to mind is just the stakes of what we’re working on or something. I think that there was a period a few years ago where I felt like, safety is really important. I felt motivated, and it was a thing I should do and got value out of it. But I didn’t feel this sort of, it could be really urgent. Decisions I’m making are just really high-stakes decisions.
I think Anthropic definitely feels high stakes. It’s often portrayed as this doomy culture. I don’t think it’s that. There are a lot of benefits, and I’m pretty excited about the work I’m doing, and it’s quite fun on a day-to-day basis, but it does feel very high intensity. And many of these decisions, they really do matter. If you really think we’re going to have the biggest technological change ever, and how well that goes depends in a large part on how well you do at your job on that given day —
Rob Wiblin: No pressure.
Nick Joseph: Yeah. The timelines are really fast too. Even commercially, you can see that it’s months between major releases. That puts a lot of pressure, where if you’re trying to keep up with the frontier of AI progress, it is quite difficult, and it relies on success on very short timelines.
Rob Wiblin: Yeah. So for someone who has relevant skills, might be a good employee, but maybe they struggle to operate at super high productivity, super high energy all the time, could that be an issue for them at a place like Anthropic, where it sounds like there’s a lot of pressure to deliver all the time? I guess potentially internally, but also just the external pressures are pretty substantial?
Nick Joseph: Yeah, some part of me wants to say yes. I think it’s really important to be very high performing a lot of the time. The standard of “always do everything perfect all the time” is not something anyone meets. And I think it is important sometimes to just keep in mind that all you can do is your best effort. We will mess things up, even if it’s high stakes, and that’s quite unfortunate. It’s unavoidable. No one is perfect. I wouldn’t set too high of a, “I couldn’t possibly handle that.” I think people really can, and you can grow into that and get used to that level of pressure and how to operate under it.
Overrated and underrated AI applications [02:22:06]
Rob Wiblin: All right, I guess we should wrap up. We’ve been at this for a couple of hours. But I’m curious to know what is an AI application that you think is overrated and maybe going to take longer to arrive than people expect? And maybe what’s an application that you think might be underrated and consumers might be really getting a lot of value out of surprisingly soon?
Nick Joseph: I think in overrated, people are often like, “I’ll never have to use Google again,” or, “It’s a great way to get information.” And I find that I still, if I just have a simple question and I want to know the answer, just googling it will give me the answer quickly, and it’s almost always right. Whereas I could go ask Claude, but it’ll sample it out, and then I’ll be like, “Is it true? Is it not true? It’s probably true, but it’s in this conversational tone…” So I think that’s one that doesn’t yet feel like the strengths.
The place where I find the most benefit is coding. I think this is not a super generalisable case or something, but if you’re ever writing software, or if you’ve thought, “I don’t know how to write software, but I wish I did,” the models are really quite good at it. And if you can get yourself set up, you can probably just write something out in English and it will spit out the code to do the thing you need rather quickly.
Then the other thing is problems where I don’t know what I would search for. Like, I have some question, I want to know the answer, but it relies on a lot of context. It would be this giant query. Models are really good at that. You can give them documents, you give them huge amounts of stuff and explain really precisely what you want, and then they will interpret it and give you an answer that accounts for all the information you’ve given them.
Rob Wiblin: Yeah. I think I do use it mostly as a substitute for Google, but not for simple queries. It’s more like something kind of complicated, where I feel like I’d have to dig into some articles to figure out the answer.
One that jumps to mind is Francisco Franco was kind of on the side of the Nazis during World War II, but then he was in power for another 30 years. Did he ever comment on that? What did he say about the Nazis later on? And I think Claude was able to give me an accurate answer to that, whereas I probably could have spent hours maybe trying to look into that, trying to find something. The answer is he mostly just didn’t talk about it.
Nick Joseph: My other favourite one, which is a super tiny use case, is if I ever have to format something and do something, like if there’s just some giant list of numbers that someone sent me in a Slack thread, and it’s bulleted and I want to add them up, I can just copy-paste it into Claude and say, “Add the things up.” And any format, it’s very good at taking this weird thing, structuring it, and then doing a simple operation.
Rob Wiblin: So I’ve heard all of these models are really good at programming. I’ve never programmed before really, and I’ve thought about maybe I could use them to make something of use, but I guess I’m at such a basic level I don’t even know… So I would get the code and then where would I run it? Is there a place that I could look this up?
Nick Joseph: Yeah, I think you basically want to just look up I would suggest Python, get an introduction to Python and get your environment set up. You’ll eventually run Python in some file and you’ll hit enter and that will run the code. And that part’s annoying. I mean, Claude could help you if you run into issues setting it up, but once you have it set up, you can just be like, “Write me some code to do X” and it will write that pretty accurately. Not perfectly, but pretty accurately.
Rob Wiblin: Yeah, I guess I should just ask Claude for guidance on this as well. I’ve got a kid a couple of months old. I guess in three or four years’ time they’ll be going to preschool, and then eventually starting reception, primary school. I guess my hope is that by that time, AI models might be really involved in the education process, and kids will be able to get a lot more one-on-one… Maybe. It would be very difficult to keep a five-year-old focused on the task of talking to an LLM.
But I would think that we’re close to being able to have a lot more individualised attention from educators, even if those educators are AI models, and this might enable kids to learn a lot faster than they can when there’s only one teacher split between 20 students or something like that. Do you think that kind of stuff will come in time for my kid first going to school, or might it take a bit longer than that?
Nick Joseph: I can’t be sure, but yeah, I think there will be some pretty major changes by the time your kid is going to school.
Rob Wiblin: OK, that’s good. That’s one that I really don’t want to miss on the timelines. Like Nathan Labenz, I’m worried about hyperscaling, but on a lot of these applications, I really just want them to reach us as soon as possible because they do seem so useful.
My guest today has been Nick Joseph. Thanks so much for coming on The 80,000 Hours Podcast, Nick.
Nick Joseph: Thank you.
Rob’s outro [02:26:36]
Rob Wiblin: If you’re really interested in the pretty vexed question of whether, all things considered, it’s good or bad to work at the top AI companies if you want to make the transition to superhuman AI go well, our researcher Arden Koehler has just published a new article on exactly that, titled Should you work at a frontier AI company? You can find that by googling “80,000 Hours” and “Should you work at a frontier AI company?” Or heading to our website at 80000hours.org and just looking through our research.
And finally, before we go, just a reminder that we’re hiring for two new senior roles at 80,000 Hours — a head of video and head of marketing. You can learn more about both at 80000hours.org/latest.
Those roles would probably be done in our offices in central London, but we are open to exceptional remote candidates in some cases. And alternatively, if you’re not in the UK but you would like to be, we can also support UK visa applications. The salaries for these two roles would vary depending on seniority, but someone with five years of relevant experience would be paid approximately £80,000.
The first of these two roles, head of video, would be someone in charge of setting up a new video product for 80,000 Hours. Obviously people are spending a larger and larger fraction of their time online watching videos on video-specific platforms, and we want to explain our ideas there in a compelling way that can reach the sorts of people who care about them. That video programme could take a range of forms, including 15-minute direct-to-camera vlogs, lots and lots of one-minute videos, 10-minute explainers — that’s probably my favourite YouTube format — or lengthy video essays. Some people really like those. The best format would be something for this new head of video to figure out for us.
We’re also looking for a new head of marketing to lead our marketing efforts to reach our target audience at a large scale. They’re going to be setting and executing on a strategy, managing and building a team, and ultimately deploying our yearly marketing budget of around $3 million. We currently run sponsorships on major podcasts and YouTube channels. Hopefully you’ve seen some of them. We also do targeted ads on a range of social media platforms. And collectively, that’s gotten hundreds of thousands of new people onto our email newsletter. We also mail out a copy of one of our books about high-impact career choice every eight minutes. That’s what I’m told. So there’s certainly the potential to reach many people if you’re doing that job well.
Applications will close in late August, so please don’t delay if you’d like to apply.
All right, The 80,000 Hours Podcast is produced and edited by Keiran Harris.
Audio engineering by Ben Cordell, Milo McGuire, Dominic Armstrong, and Simon Monsour.
Full transcripts and an extensive collection of links to learn more are available on our site, and put together as always by the legend herself, Katy Moore.
Thanks for joining, talk to you again soon.