#195 – Sella Nevo on who’s trying to steal frontier AI models, and what they could do with them

In today’s episode, host Luisa Rodriguez speaks to Sella Nevo — director of the Meselson Center at RAND — about his team’s latest report on how to protect the model weights of frontier AI models from actors who might want to steal them.

They cover:

  • Real-world examples of sophisticated security breaches, and what we can learn from them.
  • Why AI model weights might be such a high-value target for adversaries like hackers, rogue states, and other bad actors.
  • The many ways that model weights could be stolen, from using human insiders to sophisticated supply chain hacks.
  • The current best practices in cybersecurity, and why they may not be enough to keep bad actors away.
  • New security measures that Sella hopes can mitigate with the growing risks.
  • Sella’s work using machine learning for flood forecasting, which has significantly reduced injuries and costs from floods across Africa and Asia.
  • And plenty more.

Producer and editor: Keiran Harris
Audio engineering team: Ben Cordell, Simon Monsour, Milo McGuire, and Dominic Armstrong
Additional content editing: Katy Moore and Luisa Rodriguez
Transcriptions: Katy Moore

Highlights

Why protect model weights?

Sella Nevo: The work that we did over the past year focused specifically on the confidentiality of the weights, which is a way of saying we want to make sure that the model weights are not stolen. And the reason we decided to at least start there is because the model weights represent kind of a unique culmination of many different costly prerequisites for training advanced models.

So to be able to produce these model weights, you need significant compute. It was estimated that GPT-4 cost $78 million and thousands of GPU years. Gemini Ultra cost nearly $200 million. And these costs are continuing to rise rapidly. A second thing you need is enormous amounts of training data. It’s been rumoured to be more than ten terabytes of training data for GPT-4. You need all those algorithmic improvements and optimisations that are used during training that you mentioned.

So if you can access the weights directly, you bypass at least hundreds of millions of dollars — and probably in practice a lot more that comes with talent and infrastructure and things like that that are not counted in the direct training cost.

But on the other hand, as soon as you have the weights, computing inference from a large language model is usually less than half a cent per 1,000 tokens. There’s still some compute involved, but it’s negligible. There are other things you need. Maybe you need to know the exact architecture, and you can’t always fully infer that from the weights. Obviously you need to have some machine learning understanding to be able to deploy this. But these are all fairly small potatoes relative to being able to produce the weights yourself. So there’s a lot of value in getting to those weights.

Critically, once you do that, you can pretty much do whatever you want: a lot of other defences that labs may have in place no longer apply. If there’s monitoring over the API to make sure you’re not doing things you’re not supposed to, that no longer matters because you’re running it independently. If there are guardrails that are trained into the model to prevent it from doing something, we know you can fine-tune those away, and so those don’t really matter. So really, there’s almost nothing to stop an actor from being able to abuse the model once they have access to the weights.

Luisa Rodriguez: Is their value limited by the fact that once you’ve got the model weights, that model will soon be surpassed by the next generation of frontier models?

Sella Nevo: I think that really depends on what the attacker wants to use them for, or what you as the defender are worried about. If we’re thinking about this like global strategic competition considerations — which countries will have the most capable models for economic progress and things like that — then I think that’s relevant. Still, stealing the models might give an attacker years of advantage relative to where they would have been otherwise.

I’m most concerned about just the abuse of these models to do something terrible. So if we were to evaluate a model and know that you can use it to do something terrible, I don’t really care that the company a few months later is even more capable. Still someone can abuse it to do something terrible.

SolarWinds hack

Sella Nevo: One attack, that is often called the SolarWinds hack, began in 2019. Let’s start with the end. It installed backdoors, which are hidden ways to get into the system that the attacker can then abuse whenever they want. They installed backdoors in 18,000 organisations, then those backdoors were used to install malware in more than 200 organisations that the attackers chose as high-value targets. These included Microsoft, Cisco, the cybersecurity firm FireEye. It included the US Departments of Commerce, Treasury, State, Homeland Security, and the Pentagon and others. It included NATO, other organisations in the UK government, the European Parliament. So they got a lot with one hack.

This is estimated to have been done by a Russian espionage group that’s sometimes referred to as Cozy Bear — there’s a lot of fun names in the cybersecurity industry — which is sponsored by Russia’s foreign intelligence service. So this is an example of what’s called a supply chain attack. In a supply chain attack, instead of directly attacking whoever you’re interested in, you can attack software or hardware or other infrastructure that they use — so they’re part of their supply chain. This is incredibly powerful, because you can attack software that lots of people use simultaneously, and that’s how you get to these kinds of scales. But also, the infrastructure that we all use is just enormous, so there are endless vulnerabilities hiding in its depths, making this more plausible and feasible.

What they did is they used SolarWinds’ update mechanism. So whenever you have software updates, really what your computer is doing is getting a message from the internet that says, “Here’s a new executable that’s better than the one you have. How about you update it?” Usually this is OK. There are signatures on the files that help you ensure that the company that you trust is the one who sent them; not anyone can send you files over the internet. But they put in a backdoor so that the attackers as well could send whatever executable they want.

So they did this for, as I mentioned before, 18,000 organisations that downloaded updates from SolarWinds. And then they kind of cherry pick the organisations that they want to use the backdoor to actually install malware.

What that malware did was, first it kind of lurked silently for about two weeks, and then seeing that everything was OK, it reached out to command and control servers that could tell it what to do. For example, things that they told this malware to do is copy and send various types of sensitive information — including emails, documents, certificates, and directions for expanding throughout the network beyond the original devices that were compromised. And just to sum this up, after they were caught, it was estimated they had about 14 months of unfettered and undetected access to these organisations.

Maybe one final thing that I think was interesting about this specific attack is that sometimes malware is self-propagating. It automatically tries to get wherever it can and so on. This was not an example of that. Even though they were active in 200 organisations, every single attack was manually controlled by a human on the other side, which tells you something about their interests, and what they were looking for, and how much they were willing to invest in this.

Luisa Rodriguez: Yep, yep. Holy cow. This is insane. Do we know what the outcome was? Like, what Russia was able to learn or do using this malware?

Sella Nevo: So we know which networks were compromised. We had mentioned this was quite a lot of very interesting networks. We know that they did copy and send very large amounts of information. The content of that information we don’t fully know. When it’s on the government side, it often remains confidential or classified. And even with private companies, they often try to downplay the effect. It’s a bit hard to know exactly what was taken, but we’re pretty confident it was a lot.

Zero-days

Sella Nevo: There’s a lot of different components to a cyberattack: there’s getting into the network, there’s what’s called lateral movement — so moving within the network. Maybe the weights are protected in various ways: maybe they’re encrypted, maybe they’re inside a device. You would need to overcome those kinds of defences. There’s other things to make sure that you’re undetected. All of these can use vulnerabilities to achieve all of those goals. And all of those vulnerabilities would be called zero-days if you’re the first one to find them, if they have not yet been publicly reported.

There’s a few things that are interesting here.

One is it’s just incredibly common. We know this happens all the time. We know it’s the bread and butter of information security attacks. Secondly, machine learning infrastructure in particular is shockingly insecure — more so than other types of software infrastructure. There’s two things I think driving this. One is just the fact that the industry is advancing at such a rapid pace, and everyone is rushing to go to market. And this is true from the hardware level to the software level.

GPU firmware is usually unaudited, which is not true for a lot of other firmware. The software infrastructure that people use for training and for monitoring their training runs has these enormous sprawling dependencies. We mentioned supply chain attacks before. There have already been vulnerabilities introduced to these systems. Some of these infrastructure say in their documentation, “This is not meant to be used in a secure environment” — but these are key infrastructure that is used in all machine learning systems. So really the situation with machine learning infrastructure is particularly bad, and we’re quite far from even reaching the kind of standard practice for software systems, which in and of itself is not that great.

There’s another thing that I’m worried about, and I think more commercial companies should be worried about — again, as this moves on to not just being worried about cybercriminals, but about nation-states — which is that if you’re a nation-state, you can cheat in how you get zero-days.

For example, China has a convenient set of regulations titled “Regulations on the Management of Network Product Security Vulnerabilities.” Broadly what that regulation says is that any researcher or any organisation that has any footprint in China is required to report any vulnerabilities that they find to the government — which we know that the government then hands off to their offensive cyber organisations.

And simultaneously, they also put severe penalties if you share that information with almost any other organisation. So you should not be surprised that China then has dozens or hundreds or, I don’t know exactly how many, zero-days of their own.

And then maybe another way that you can get huge amounts of zero-days is, even if you’re not a state, but you’re a sufficiently capable actor — maybe you’re in the OC4 level or OC5 — you can hack into the channels through which zero-days are reported: if other people find vulnerabilities, they have to report to the company somehow. There’s a lot of different infrastructure in place for that. If you can get access to that infrastructure, you get a continuous stream of all new zero-days that anyone can find.

Side-channel attacks

Sella Nevo: For centuries, when people tried to understand communication systems and how they can undermine their defences — for example, encryption; you want to know what some encrypted information is — they looked at the inputs. There’s some text that we want to encrypt, and here’s the outputs. Here’s the encrypted text.

At some point, someone figured out, what about all other aspects of the system? What about the system’s temperature? What about its electricity usage? What about the noise it makes as it runs? What about the time that it takes to complete the actual encryption? And we tend to think of computation as abstract. That’s what digital systems are meant to do: to abstract out. But there’s always some physical mechanism that actually is running that computation — and physical mechanisms have physical effects on the world.

So it turns out that all these physical effects are super informative. Let me just give a very simple one. There’s a lot of things here with, as I mentioned, temperature and electricity and things like that. But let me give a very simple one. RSA is a famous type of encryption. It calculates, as part of its operation, it takes one number and takes it to the power of another number. Pretty simple. I’m not trying to say something too exotic here. To do that, an efficient way of calculating this thing is it primarily uses multiplication and squaring. For whatever reason, that’s an efficient way of doing that.

Turns out that the operation to multiply two numbers and the operation to square a number uses a different amount of electricity. So if you track the electricity usage over time, you can literally identify exactly what numbers it’s working with, and break the encryption within minutes, as an example of a side channel attack.

This is a bit of an old one. It’s been known for, I don’t remember the exact time, but well over a decade. But to give a more modern one, just a year ago, there was a new paper showing that you can run malware on a cell phone. Cell phones often are not our work devices. We think of them as maybe our personal devices and things like that. Through the cell phone, you can, with the cell phone’s microphone, listen to someone typing in their password and identify what their password is, because it sounds slightly differently when you’re tapping different keys.

Luisa Rodriguez: And do you mean physical keys on a slightly older cell phone, or do you mean like my iPhone, which has digital keys on a touchscreen?

Sella Nevo: Oh sorry, just to clarify: I actually don’t mean them typing their password on their phone. I mean that their phone is on the desk. They are typing their password into their work computer. You identify what the password is on the work computer.

Luisa Rodriguez: Holy crap. That is wild.

Sella Nevo: Let me just add that I think that one recent concern is, as cloud computing is becoming more common, side-channel attacks are a huge problem. Often when you run something on the cloud, you share a server with others, therefore you’re sharing certain resources. A lot of people can directly or indirectly see actually how much processing your application is using and things like that. And if they’re smart, they can infer information in your applications. So that is a huge area of potential information leaks in cloud computing.

USB cables

Sella Nevo: Let’s put highly secure air-gapped networks aside for a moment and just talk about getting a USB to connect to a network. It’s worth flagging that this is a really easy thing to do. One thing that people will do — and this is not just nation-states and whatnot; this is random hackers that want to do things for the fun of it — can just drop a bunch of USB sticks in the parking lot of an organisation, and someone will inevitably be naive enough to be like, “Oh no, someone has dropped this. Let’s plug it in and see who this belongs to.”

Sella Nevo: And you’re done. Now you’re in and you can spread in the internal network. This happens all the time. It’s happened multiple times in multiple nuclear sites in the United States. So yeah, this is a pretty big deal.

Sella Nevo: Now, I think that many people, like you, will find that surprising. I think security folks are kind of being like, “Well, no one would. Everyone knows, everyone in security knows that you shouldn’t plug in a USB stick.”

Luisa Rodriguez: Shouldn’t just pick up a USB stick. Yeah.

Sella Nevo: But let me challenge even those folks who think that this is obvious, and also in that way bring it back to the more secure networks we were talking about before. So indeed organisations with serious security know not to plug in random USB sticks. But what about USB cables? So Luisa, let me ask you, actually: if you needed a USB cable, and you just saw one in the hallway or something, would you use it?

Luisa Rodriguez: 100% I would use that. Absolutely. I actually, I’m sure I’ve literally already done that.

Sella Nevo: So here’s an interesting fact, which I think even most security folks don’t know. You could actually buy a USB cable — not a USB stick, a USB cable — for $180 that is hiding a USB stick inside and can communicate wirelessly back home.

So once you stick that cable in, an attacker can now control your system from afar — not even in the mode that I mentioned before of, you wait until the USB stick will be put in again. It just continuously can communicate with and control your system. I guarantee you that if you toss that cable into a tech organisation’s cables shelf, I guarantee it’ll be plugged in.

Luisa Rodriguez: Absolutely. Yeah. That’s really crazy. Has that been used in the real world?

Sella Nevo: I don’t know. There’s a company that’s selling them. I haven’t seen reports of when it’s been used, but presumably if it’s a product on the market, someone is buying it.

Articles, books, and other media discussed in the show

RAND is currently hiring for a brand-new role very relevant to the topics discussed in this podcast: Mid/Senior AI and Information Security Analyst. Check it out as well as RAND’s other open positions in technical and policy information security if you’re interested in this field!

Sella’s work:

Other work in this area:

80,000 Hours podcast episodes and resources:

Related episodes

About the show

The 80,000 Hours Podcast features unusually in-depth conversations about the world's most pressing problems and how you can use your career to solve them. We invite guests pursuing a wide range of career paths — from academics and activists to entrepreneurs and policymakers — to analyse the case for and against working on different issues and which approaches are best for solving them.

The 80,000 Hours Podcast is produced and edited by Keiran Harris. Get in touch with feedback or guest suggestions by emailing [email protected].

What should I listen to first?

We've carefully selected 10 episodes we think it could make sense to listen to first, on a separate podcast feed:

Check out 'Effective Altruism: An Introduction'

Subscribe here, or anywhere you get podcasts:

If you're new, see the podcast homepage for ideas on where to start, or browse our full episode archive.