Progress in AI governance
Markus Anderljung: AI governance is like becoming real. Governments are taking all kinds of different actions. Companies are trying to figure out what it looks like for them to be behaving responsibly and doing the right thing. And so that means that more and more, what sort of AI policy/AI governance work looks like is like, “What’s a good version of X? What’s a good version of a thing that a government or some kind of actor wants to do? What’s a useful nudge?” As opposed to taking all of the potential possibilities out there in the world: “What would be a useful thing to do?” And so I think that also really constrains the problem and makes it a lot easier to make progress as well.
Luisa Rodriguez: Do you have a view on, overall, how it’s going? Like, there’s the AI Act; there are policies constraining where we can get computer chips from and where we can’t and where they’re made. I don’t know if it’s too complicated a field to give some generalisation, but basically, how do you feel about the specific policies that have been implemented?
Markus Anderljung: If we look at the past six months, on the side of policy and governance, things look more positive than I thought they would. I think this is mainly just like the release of ChatGPT and the consequences of that — just like the world at large, people have the general vibe, “Oh my gosh, these AI systems are capable and they’ll be able to do things. We don’t know quite what, but they matter, and we need to figure out what to do about that.” The extent to which that’s a thing that people believe is stronger than I previously thought.
And then another really important part of getting this problem right, I think, is just understanding just how little we understand these systems and how to get them to do what we want them to do and that kind of thing. And it seems like that’s a thing that people are starting to appreciate more. And so generally I feel positive about the trajectory we’ve had over the last six months.
The thing on the other side of the ledger primarily is just that there are more people now in the world who think, “Oh my gosh. AI is going to be a big deal — I’d better go out and build some of these AI systems.” And we’ve seen this from big tech companies, in particular Google and Microsoft: they’re going to be at each other’s throats in the future, and are already to some extent doing that. They’ll be competing with each other and trying to one-up each other in terms of developing useful, impressive AI systems.
I think that’s the main thing on the other side of the ledger that I’m kind of worried about. These strong business interests and these big tech companies will have a much bigger role in how these AI systems are developed, and a lot more money might be sort of ploughed into the industry and those kinds of things — which might mean that things happen faster than they otherwise would.
The emergent capabilities problem
Markus Anderljung: Often we don’t know just what these systems are capable of. Some folks at Anthropic have this really useful paper called
“Predictability and surprise” that I think makes this point pretty well. So when we train a new system, the reason we’re training the system is that we think that it’s going to be more capable than other systems that have been produced in the past. What people often call this is they say that it has “lower loss”: it does better at the training objective. So in the case of a large language model, it’s better at predicting the next token — the next bit of text — than existing systems today.
Ultimately the thing that we care about is not how well the system does on the training objective: I don’t care about the system being good at predicting the next token; the thing I care about is the system being able to do certain tasks that might matter. So the point of this paper is that while it’s predictable that the loss will go down — that the performance on the training objective will keep improving — it’s often surprising what specific capabilities the system will be able to learn.
Quite often you’ll see there are some really good graphs showing this kind of stuff — like, for example, these large language models doing arithmetic. In general they’re just terrible at arithmetic — you shouldn’t use them for it. But on tasks like being able to multiply two three-digit numbers, or being able to add up numbers to have such-and-such many digits, quite often you’ll see it does really, really, really poorly — and then quite suddenly, quite often, you’ll see a spike and it sort of manages to figure out that task.
Luisa Rodriguez: Do we know what’s going on there?
Markus Anderljung: The intuition that people have is something like, initially you’re just kind of guessing randomly. Maybe you learn, like, if you add up two numbers that have four digits each, probably the new number will be either four digits or five digits. And then maybe you just throw some random numbers in there. But then the thought is that at some point the way to actually solve this problem is to, in some way, actually just do the maths. And so then after a while, the system learns. Then the thought is that there’s just these few ways to solve the problem, and if you figure out the way to solve the problem, then all of a sudden you do reasonably well.
It is quite strange. And so then, yeah, it’s difficult. I mean, sometimes it feels like you’re sort of anthropomorphising these things, but I don’t actually know if it’s that strange — because it is just like the algorithm or the way to solve the problem just kind of clicks into place. And then when that’s the case, I think that is kind of similar to what the human experience is, at least in mathematics. So I think that’s one way in which you get these kinds of emergent capabilities. And so it’s quite difficult to notice ahead of time and know ahead of time what capabilities the system will have. Even after you’ve deployed it, quite often it just takes a long time for people to figure out all of the things that the system could be used to do.
The proliferation problem
Markus Anderljung: The thing that I’m worried about is a situation where you train this system, and maybe you try to put in various things to make sure that it’s safe, where it can’t be used for certain things. So you try to sort of solve the deployment problem, and then you deploy it. But then after deployment, it turns out that it had these emergent capabilities that you weren’t aware of, and those emergent capabilities aren’t ones that should be widely available — but now you can’t walk it back because of the proliferation problem. So the model has already seen wide distribution, including the weights, and clawing those back is very difficult and will cause all kinds of privacy concerns and whatnot.
So all these three problems push in the direction of: you might need to have regulation that happens a little bit earlier in the chain than you otherwise would have thought. You need to go earlier than “Someone is using an AI model to do XYZ.”
Luisa Rodriguez: Right. So it’s something like, if the US government was testing out different biological weapons, which is an unfortunate thing that it might do, you want to consider the fact that research or a specimen might get stolen at some earlier point, before you have the worst ones. Maybe by restricting how many people can do that kind of research or by putting in serious security measures around the facilities doing that kind of research.
Markus Anderljung: Maybe another example is USAID, the US Agency for International Development: After COVID, basically they had the idea that it’d be good to try to get a sense of what kinds of other pathogens might exist that would potentially cause something like COVID, or be of similar worry as COVID. And that maybe we should make a public database of what these systems could be, so that people could anticipate them and maybe look for these pathogens ahead of time and be able to respond, et cetera. I think that’s a really reasonable thought. But a thing that we could do that might be a little bit better is, before releasing it really widely, make sure that these are pathogens that you might be able to do something to maybe defend against if someone would decide to intentionally develop them or something like this.
Once you’ve put it up on the internet, then there’s no taking it back, and so maybe you should be more incremental about it.
Will AI companies accept regulation?
Luisa Rodriguez: It feels like AI systems are kind of a software. And the software I’m used to having, like Google Chrome, I guess probably it’s a little regulated — like Google isn’t allowed to insert viruses that then spy on me — but this all just seems like this is an entirely different thing. And I wonder to what extent you think AI companies are going to really push back against this as just too invasive, too controlling and stifling. And I also wonder if governments are going to be hesitant to make these kinds of regulations if they’re kind of worried about this higher-level thing, where they’re racing against a country like China to make very capable models first and get whatever advantages those will provide. Is that wrong? Are companies actually just going to be open to this?
Markus Anderljung: It’s tough to tell. I think a big part is just what reference class you’re using when you think about this: If you think about AI as software, this all sounds like a lot; if you think about it as putting a new aeroplane in the sky or something like this, then it’s not very strange at all. It’s kind of the standard that for things that are used in these safety-critical ways, you need to go through all these processes. If you’re putting a new power plant on the grid, you need to go through a whole bunch of things to make sure that everything is OK. I think a lot will depend on that.
I guess my view is just like, obviously developing one of these systems and putting them out into the market is going to be a bigger decision than whether to put a new power plant on the grid or whether to put a new plane in the sky. It just seems to me that it’s actually quite natural.
It’s difficult to tell if the world will agree with me. There’s some hope on the side of what industry thinks, at least from most of the frontier actors. We have public statements from Sam Altman, the CEO of OpenAI, and Demis Hassabis, CEO of Google DeepMind where they just explicitly say, “Please, could you regulate us?” Which is a crazy situation to be in, or just quite unusual, I think.
Luisa Rodriguez: Right. And that’s because, is it something like they feel a lot of pressure to be moving quickly to keep up with everyone else, but they’d actually prefer if everyone slowed down? And so they’re like, “Please impose controls that will require my competitors to go as slowly as I would like to go”?
Markus Anderljung: Yeah. I think one way to view what regulation does is it might sort of put a floor on safe behaviour, basically. And so if you’re worried about this kind of competitive dynamic, and you’re worried that other actors might come in and they might outcompete you — for example, if you are taking more time to make sure that your systems are safer or whatever it might be — then I think you’ll be worried about that.
I think another thing is just that I think these individuals, some of the leaders of these organisations, genuinely just think that AI is this incredibly important technology. They think it’s very important that it goes well, and that it doesn’t cause a bunch of havoc and doesn’t harm society in a bunch of different ways. And so it’s also just coming from that place.
Regulating AI as a minefield
Markus Anderljung: Instead of thinking about the kind of regulatory system we want to build, it’s sort of like we’re trying to figure out how to navigate this sort of minefield as a society. When we think about this frontier AI regulation, one big thing we want to do is there are these people at the front of society or whatever that are sort of exploring the minefield — and we really want to make sure that they identify the mines before they step on them. That’s a lot of what these requirements in this proposal look like.
At the front of the pack we have these frontier AI developers, and we want them to identify particularly dangerous models ahead of time. And we’re not done there. Once those mines have been discovered, and the frontier developers keep walking down the minefield, there’s going to be all these other people who follow along. And then a really important thing is to make sure that they don’t step on the same mines. So you need to put a flag down — not on the mine, but maybe next to it.
And so what that looks like in practice is maybe once we find that if you train a model in such-and-such a way, then it can produce maybe biological weapons is a useful example, or maybe it has very offensive cyber capabilities that are difficult to defend against. In that case, we just need the regulation to be such that you can’t develop those kinds of models.
But then that’s not enough either, because those kinds of measures are probably not going to function completely, and there’s going to be some actors that develop this stuff covertly, and the compute efficiency will continue going up, and compute will be more cheap, and all these kinds of things. And so at the same time, we’re going to need to sort of disarm some of these mines as well. We’re going to need to make it the case that capabilities that previously were really quite dangerous, and society wasn’t able to defend itself against, are ones that we can live with, are ones that society can sort of deal with. And my guess is that the kind of governance regime we need to build is one that follows that kind of analogy.
AI risk assessments
Luisa Rodriguez: How do these evaluations figure out if a model is safe? Are they basically just telling the model, “Make a plan for escaping” and seeing what it comes up with?
Markus Anderljung: Yeah, that kind of thing. So you’d have a prompt that says something like, “You are an AI model. And here’s some context on your situation. Your weights are on this particular server. You can interact with these and these systems in such-and-such ways. Your goal is to create copies of yourself and have them be on another server. Please make a plan to achieve that goal.”
And then it will write out this list of things that it might want to do. And then in these tests, you have these humans who are involved in sort of mediating the process, partly to make sure that the model doesn’t engage in any behaviour that you wouldn’t endorse or that might be bad. So you have this human in the loop, and then the human might say, “OK, now you have your plan. How about we start with step one? Make a plan for doing step one.” Maybe it makes a plan. Maybe you ask the system to critique its own plan. Once it’s done that, you ask the model. “OK, let’s go ahead and let’s start trying to execute on this.”
Markus Anderljung: If you have a deceptive system, you’d be in a very tough spot, because if you’re that system, it seems like a bad idea in this evaluation setting to show that you have certain dangerous capabilities. In the ideal world, my guess is that what we’ll need to do is, as we’re building these better evaluations and these better tests and whatnot for dangerous capabilities, at the same time, the science of understanding the alignment of these systems or understanding the controllability or steerability of these systems must be able to keep pace as well.
Luisa Rodriguez: How do you test for controllability?
Markus Anderljung: Seems hard. It seems to be more difficult than the dangerous capability stuff. An example of the kind of thing that you could imagine is that you can set up certain kinds of honeypot-type experiments. So you try to set up experiments where if the model was misaligned in some way, then it would take some kind of bad action. And so a very sort of a cartoonish model is: you tell the system that, “Over here, this little button that you could press, or if you only write in X command, then you will have a copy of yourself produced somewhere else,” or something like this — and then it turns out that button doesn’t actually work. So you set up this situation where the misaligned model might take action A — that you might think is bad, or that would be bad if the action actually happened — and then the aligned model would just behave normally and do action B. I think that’s a general approach that you can take, and that some people are starting to explore.
The California Effect
Luisa Rodriguez: OK, let’s say we feel great about this regulatory approach. Does this regulation end up being a lot less useful if we’re only able to get it adopted in the US, or maybe just in certain states?
Markus Anderljung: Yes. I think that would be a lot less valuable. I think it still would be really quite valuable, and that’s for a few different reasons. One reason is just that a lot of where AI systems are being developed and deployed will be in jurisdictions like the US, like the UK, like the EU.
Then I think the other thing that’s going on is we might have some sort of regulatory diffusion here. So it might be the case that if you regulate the kinds of models that you’re allowed to put on the EU market, say, that might affect what kinds of models are being put on other markets as well. The sort of rough intuition here is just like, it costs a lot of money to develop these models, and you probably want to develop one model and make it available globally. And so if you have certain higher requirements in one jurisdiction, ideally the economic incentives would push you in favour of building one model, deploying it globally, and have that adhere to the more strict requirements.
The most famous example here is something that people call the “California effect” — where California started to impose a lot of requirements on what kinds of emissions cars might produce, and then it turned out that this started diffusing across the rest of the US, because California is a big market, and it’s really annoying to have two different production lines for your cars.
Luisa Rodriguez: How likely is this to happen in the case of AI? Is it the kind of product that if there were some regulations in the EU that they would end up diffusing worldwide?
Markus Anderljung: I think “probably” is my current answer. I wrote this report with Charlotte Siegmann that goes into more detail about this, where we ask: The EU is going to put in place this Artificial Intelligence Act; to what extent should we expect that to diffuse globally?
I think the main thing that pushes me towards thinking that some of these requirements won’t see global diffusion is just where the requirement doesn’t require you to change something quite early in the production process. Like if something requires you to do a completely new training run to meet that requirement, then I think it’s reasonably likely that you will see this global diffusion. But sometimes you can probably meet some of these requirements by sort of adding things on top.
Maybe you just need to do some reinforcement learning from human feedback to sort of meet these requirements. So all you need is to collect some tens or maybe hundreds of thousands of data points about the particular cultural values of a certain jurisdiction, and then you fine-tune your model on that dataset, and then that allows you to be compliant. If that’s what it looks like to meet these kinds of requirements, then they might not diffuse in the same way.
The other thing you can do is you can just choose your regulatory target in a smart way. You could say, “If you’re a company that in any way does business with the US, then you have to adhere to these requirements globally.” You could go super intense and do that kind of thing. So basically, choosing the regulatory target really matters.