#195 – Sella Nevo on who’s trying to steal frontier AI models, and what they could do with them

By Luisa Rodriguez and Keiran Harris · Published August 1st, 2024 ·

#195 – Sella Nevo on who’s trying to steal frontier AI models, and what they could do with them

By Luisa Rodriguez and Keiran Harris · Published August 1st, 2024

In today’s episode, host Luisa Rodriguez speaks to Sella Nevo — director of the Meselson Center at RAND — about his team’s latest report on how to protect the model weights of frontier AI models from actors who might want to steal them.

They cover:

Real-world examples of sophisticated security breaches, and what we can learn from them.
Why AI model weights might be such a high-value target for adversaries like hackers, rogue states, and other bad actors.
The many ways that model weights could be stolen, from using human insiders to sophisticated supply chain hacks.
The current best practices in cybersecurity, and why they may not be enough to keep bad actors away.
New security measures that Sella hopes can mitigate with the growing risks.
Sella’s work using machine learning for flood forecasting, which has significantly reduced injuries and costs from floods across Africa and Asia.
And plenty more.

Producer and editor: Keiran Harris
Audio engineering team: Ben Cordell, Simon Monsour, Milo McGuire, and Dominic Armstrong
Additional content editing: Katy Moore and Luisa Rodriguez
Transcriptions: Katy Moore

Highlights

Why protect model weights?

Sella Nevo: The work that we did over the past year focused specifically on the confidentiality of the weights, which is a way of saying we want to make sure that the model weights are not stolen. And the reason we decided to at least start there is because the model weights represent kind of a unique culmination of many different costly prerequisites for training advanced models.
So to be able to produce these model weights, you need significant compute. It was estimated that GPT-4 cost $78 million and thousands of GPU years. Gemini Ultra cost nearly $200 million. And these costs are continuing to rise rapidly. A second thing you need is enormous amounts of training data. It’s been rumoured to be more than ten terabytes of training data for GPT-4. You need all those algorithmic improvements and optimisations that are used during training that you mentioned.
So if you can access the weights directly, you bypass at least hundreds of millions of dollars — and probably in practice a lot more that comes with talent and infrastructure and things like that that are not counted in the direct training cost.
But on the other hand, as soon as you have the weights, computing inference from a large language model is usually less than half a cent per 1,000 tokens. There’s still some compute involved, but it’s negligible. There are other things you need. Maybe you need to know the exact architecture, and you can’t always fully infer that from the weights. Obviously you need to have some machine learning understanding to be able to deploy this. But these are all fairly small potatoes relative to being able to produce the weights yourself. So there’s a lot of value in getting to those weights.
Critically, once you do that, you can pretty much do whatever you want: a lot of other defences that labs may have in place no longer apply. If there’s monitoring over the API to make sure you’re not doing things you’re not supposed to, that no longer matters because you’re running it independently. If there are guardrails that are trained into the model to prevent it from doing something, we know you can fine-tune those away, and so those don’t really matter. So really, there’s almost nothing to stop an actor from being able to abuse the model once they have access to the weights.
Luisa Rodriguez: Is their value limited by the fact that once you’ve got the model weights, that model will soon be surpassed by the next generation of frontier models?
Sella Nevo: I think that really depends on what the attacker wants to use them for, or what you as the defender are worried about. If we’re thinking about this like global strategic competition considerations — which countries will have the most capable models for economic progress and things like that — then I think that’s relevant. Still, stealing the models might give an attacker years of advantage relative to where they would have been otherwise.
I’m most concerned about just the abuse of these models to do something terrible. So if we were to evaluate a model and know that you can use it to do something terrible, I don’t really care that the company a few months later is even more capable. Still someone can abuse it to do something terrible.

SolarWinds hack

Sella Nevo: One attack, that is often called the SolarWinds hack, began in 2019. Let’s start with the end. It installed backdoors, which are hidden ways to get into the system that the attacker can then abuse whenever they want. They installed backdoors in 18,000 organisations, then those backdoors were used to install malware in more than 200 organisations that the attackers chose as high-value targets. These included Microsoft, Cisco, the cybersecurity firm FireEye. It included the US Departments of Commerce, Treasury, State, Homeland Security, and the Pentagon and others. It included NATO, other organisations in the UK government, the European Parliament. So they got a lot with one hack.
This is estimated to have been done by a Russian espionage group that’s sometimes referred to as Cozy Bear — there’s a lot of fun names in the cybersecurity industry — which is sponsored by Russia’s foreign intelligence service. So this is an example of what’s called a supply chain attack. In a supply chain attack, instead of directly attacking whoever you’re interested in, you can attack software or hardware or other infrastructure that they use — so they’re part of their supply chain. This is incredibly powerful, because you can attack software that lots of people use simultaneously, and that’s how you get to these kinds of scales. But also, the infrastructure that we all use is just enormous, so there are endless vulnerabilities hiding in its depths, making this more plausible and feasible.
What they did is they used SolarWinds’ update mechanism. So whenever you have software updates, really what your computer is doing is getting a message from the internet that says, “Here’s a new executable that’s better than the one you have. How about you update it?” Usually this is OK. There are signatures on the files that help you ensure that the company that you trust is the one who sent them; not anyone can send you files over the internet. But they put in a backdoor so that the attackers as well could send whatever executable they want.
So they did this for, as I mentioned before, 18,000 organisations that downloaded updates from SolarWinds. And then they kind of cherry pick the organisations that they want to use the backdoor to actually install malware.
What that malware did was, first it kind of lurked silently for about two weeks, and then seeing that everything was OK, it reached out to command and control servers that could tell it what to do. For example, things that they told this malware to do is copy and send various types of sensitive information — including emails, documents, certificates, and directions for expanding throughout the network beyond the original devices that were compromised. And just to sum this up, after they were caught, it was estimated they had about 14 months of unfettered and undetected access to these organisations.
Maybe one final thing that I think was interesting about this specific attack is that sometimes malware is self-propagating. It automatically tries to get wherever it can and so on. This was not an example of that. Even though they were active in 200 organisations, every single attack was manually controlled by a human on the other side, which tells you something about their interests, and what they were looking for, and how much they were willing to invest in this.
Luisa Rodriguez: Yep, yep. Holy cow. This is insane. Do we know what the outcome was? Like, what Russia was able to learn or do using this malware?
Sella Nevo: So we know which networks were compromised. We had mentioned this was quite a lot of very interesting networks. We know that they did copy and send very large amounts of information. The content of that information we don’t fully know. When it’s on the government side, it often remains confidential or classified. And even with private companies, they often try to downplay the effect. It’s a bit hard to know exactly what was taken, but we’re pretty confident it was a lot.

Zero-days

Sella Nevo: There’s a lot of different components to a cyberattack: there’s getting into the network, there’s what’s called lateral movement — so moving within the network. Maybe the weights are protected in various ways: maybe they’re encrypted, maybe they’re inside a device. You would need to overcome those kinds of defences. There’s other things to make sure that you’re undetected. All of these can use vulnerabilities to achieve all of those goals. And all of those vulnerabilities would be called zero-days if you’re the first one to find them, if they have not yet been publicly reported.
There’s a few things that are interesting here.
One is it’s just incredibly common. We know this happens all the time. We know it’s the bread and butter of information security attacks. Secondly, machine learning infrastructure in particular is shockingly insecure — more so than other types of software infrastructure. There’s two things I think driving this. One is just the fact that the industry is advancing at such a rapid pace, and everyone is rushing to go to market. And this is true from the hardware level to the software level.
GPU firmware is usually unaudited, which is not true for a lot of other firmware. The software infrastructure that people use for training and for monitoring their training runs has these enormous sprawling dependencies. We mentioned supply chain attacks before. There have already been vulnerabilities introduced to these systems. Some of these infrastructure say in their documentation, “This is not meant to be used in a secure environment” — but these are key infrastructure that is used in all machine learning systems. So really the situation with machine learning infrastructure is particularly bad, and we’re quite far from even reaching the kind of standard practice for software systems, which in and of itself is not that great.
There’s another thing that I’m worried about, and I think more commercial companies should be worried about — again, as this moves on to not just being worried about cybercriminals, but about nation-states — which is that if you’re a nation-state, you can cheat in how you get zero-days.
For example, China has a convenient set of regulations titled “Regulations on the Management of Network Product Security Vulnerabilities.” Broadly what that regulation says is that any researcher or any organisation that has any footprint in China is required to report any vulnerabilities that they find to the government — which we know that the government then hands off to their offensive cyber organisations.
And simultaneously, they also put severe penalties if you share that information with almost any other organisation. So you should not be surprised that China then has dozens or hundreds or, I don’t know exactly how many, zero-days of their own.
And then maybe another way that you can get huge amounts of zero-days is, even if you’re not a state, but you’re a sufficiently capable actor — maybe you’re in the OC4 level or OC5 — you can hack into the channels through which zero-days are reported: if other people find vulnerabilities, they have to report to the company somehow. There’s a lot of different infrastructure in place for that. If you can get access to that infrastructure, you get a continuous stream of all new zero-days that anyone can find.

Side-channel attacks

Sella Nevo: For centuries, when people tried to understand communication systems and how they can undermine their defences — for example, encryption; you want to know what some encrypted information is — they looked at the inputs. There’s some text that we want to encrypt, and here’s the outputs. Here’s the encrypted text.
At some point, someone figured out, what about all other aspects of the system? What about the system’s temperature? What about its electricity usage? What about the noise it makes as it runs? What about the time that it takes to complete the actual encryption? And we tend to think of computation as abstract. That’s what digital systems are meant to do: to abstract out. But there’s always some physical mechanism that actually is running that computation — and physical mechanisms have physical effects on the world.
So it turns out that all these physical effects are super informative. Let me just give a very simple one. There’s a lot of things here with, as I mentioned, temperature and electricity and things like that. But let me give a very simple one. RSA is a famous type of encryption. It calculates, as part of its operation, it takes one number and takes it to the power of another number. Pretty simple. I’m not trying to say something too exotic here. To do that, an efficient way of calculating this thing is it primarily uses multiplication and squaring. For whatever reason, that’s an efficient way of doing that.
Turns out that the operation to multiply two numbers and the operation to square a number uses a different amount of electricity. So if you track the electricity usage over time, you can literally identify exactly what numbers it’s working with, and break the encryption within minutes, as an example of a side channel attack.
This is a bit of an old one. It’s been known for, I don’t remember the exact time, but well over a decade. But to give a more modern one, just a year ago, there was a new paper showing that you can run malware on a cell phone. Cell phones often are not our work devices. We think of them as maybe our personal devices and things like that. Through the cell phone, you can, with the cell phone’s microphone, listen to someone typing in their password and identify what their password is, because it sounds slightly differently when you’re tapping different keys.
Luisa Rodriguez: And do you mean physical keys on a slightly older cell phone, or do you mean like my iPhone, which has digital keys on a touchscreen?
Sella Nevo: Oh sorry, just to clarify: I actually don’t mean them typing their password on their phone. I mean that their phone is on the desk. They are typing their password into their work computer. You identify what the password is on the work computer.
Luisa Rodriguez: Holy crap. That is wild.
Sella Nevo: Let me just add that I think that one recent concern is, as cloud computing is becoming more common, side-channel attacks are a huge problem. Often when you run something on the cloud, you share a server with others, therefore you’re sharing certain resources. A lot of people can directly or indirectly see actually how much processing your application is using and things like that. And if they’re smart, they can infer information in your applications. So that is a huge area of potential information leaks in cloud computing.

USB cables

Sella Nevo: Let’s put highly secure air-gapped networks aside for a moment and just talk about getting a USB to connect to a network. It’s worth flagging that this is a really easy thing to do. One thing that people will do — and this is not just nation-states and whatnot; this is random hackers that want to do things for the fun of it — can just drop a bunch of USB sticks in the parking lot of an organisation, and someone will inevitably be naive enough to be like, “Oh no, someone has dropped this. Let’s plug it in and see who this belongs to.”
Sella Nevo: And you’re done. Now you’re in and you can spread in the internal network. This happens all the time. It’s happened multiple times in multiple nuclear sites in the United States. So yeah, this is a pretty big deal.
Sella Nevo: Now, I think that many people, like you, will find that surprising. I think security folks are kind of being like, “Well, no one would. Everyone knows, everyone in security knows that you shouldn’t plug in a USB stick.”
Luisa Rodriguez: Shouldn’t just pick up a USB stick. Yeah.
Sella Nevo: But let me challenge even those folks who think that this is obvious, and also in that way bring it back to the more secure networks we were talking about before. So indeed organisations with serious security know not to plug in random USB sticks. But what about USB cables? So Luisa, let me ask you, actually: if you needed a USB cable, and you just saw one in the hallway or something, would you use it?
Luisa Rodriguez: 100% I would use that. Absolutely. I actually, I’m sure I’ve literally already done that.
Sella Nevo: So here’s an interesting fact, which I think even most security folks don’t know. You could actually buy a USB cable — not a USB stick, a USB cable — for $180 that is hiding a USB stick inside and can communicate wirelessly back home.
So once you stick that cable in, an attacker can now control your system from afar — not even in the mode that I mentioned before of, you wait until the USB stick will be put in again. It just continuously can communicate with and control your system. I guarantee you that if you toss that cable into a tech organisation’s cables shelf, I guarantee it’ll be plugged in.
Luisa Rodriguez: Absolutely. Yeah. That’s really crazy. Has that been used in the real world?
Sella Nevo: I don’t know. There’s a company that’s selling them. I haven’t seen reports of when it’s been used, but presumably if it’s a product on the market, someone is buying it.

Articles, books, and other media discussed in the show

RAND is currently hiring for a brand-new role very relevant to the topics discussed in this podcast: Mid/Senior AI and Information Security Analyst. Check it out as well as RAND’s other open positions in technical and policy information security if you’re interested in this field!

Sella’s work:

Securing AI model weights: Preventing theft and misuse of frontier models (with coauthors)
Supporting follow-up screening for flagged nucleic acid synthesis orders (with Tessa Alexanian)
Flood forecasting with machine learning models in an operational framework (with coauthors) — and also check out the preprint of a randomised controlled trial of the early warning system: Forecasting fate: Experimental evaluation of a flood early warning system
See all Sella’s work on his website

Other work in this area:

Stealing part of a production language model by Nicholas Carlini et al.
AI security risk assessment using Counterfit by Ram Shankar Siva Kumar
A practical deep learning-based acoustic side channel attack on keyboards by Joshua Harrison, Ehsan Toreini, and Maryam Mehrnezhad
China’s approach to software vulnerabilities reporting — episode on The Lawfare Podcast

80,000 Hours podcast episodes and resources:

Transcript

Table of Contents

1 Cold open [00:00:00]
2 Luisa’s intro [00:00:56]
3 The interview begins [00:02:30]
4 The importance of securing the model weights of frontier AI models [00:03:01]
5 The most sophisticated and surprising security breaches [00:10:22]
6 AI models being leaked [00:25:52]
7 Researching for the RAND report [00:30:11]
8 Who tries to steal model weights? [00:32:21]
9 Malicious code and exploiting zero-days [00:42:06]
10 Human insiders [00:53:20]
11 Side-channel attacks [01:04:11]
12 Getting access to air-gapped networks [01:10:52]
13 Model extraction [01:19:47]
14 Reducing and hardening authorised access [01:38:52]
15 Confidential computing [01:48:05]
16 Red-teaming and security testing [01:53:42]
17 Careers in information security [01:59:54]
18 Sella’s work on flood forecasting systems [02:01:57]
19 Luisa’s outro [02:04:51]

Cold open [00:00:00]

Sella Nevo: Let’s say that you’re the CEO of a frontier lab. I would argue there is no chance that you have 50 employees that you are at least 98% confident wouldn’t steal the weights. Remember, these things are worth at least hundreds of millions of dollars — and someone might be bribing them, extorting them, using an ideology that they believe in, and so on. Maybe you’re a great socialite, but I do not have 50 people that I know that well to know they would not do that with 98% confidence.

But let’s imagine that you do have 50 people that you’re 98% confident in. Still, if you gave all 50 of those people permission to read the weights, full-read access, that would allow them potentially to leak the weights. And 98% to the power of 50 is about 37%. So there still is almost two-thirds chance of a leak with only 50 employees with access. And these are highly trusted employees.

Luisa’s intro [00:00:56]

Luisa Rodriguez: Hi listeners, this is Luisa Rodriguez, one of the hosts of The 80,000 Hours Podcast.

In today’s episode, I spoke to Sella Nevo, director of the Meselson Center at RAND, about his team’s findings from interviewing about 30 cybersecurity experts – people at AI companies, cybersecurity companies, and in national security — about who might try to steal AI model weights, and how they’d do it.

For those who need a quick reminder: model weights are the specific values that a neural network uses to generate its outputs in response to prompts. In a large language model, they’re the values that enable the model to generate its responses after you’ve prompted it asking for a poem about snails or whatever.

In the interview, Sella and I talk about:

Whether it would really be that bad for a foreign group or country to steal a leading AI lab’s model weights.
The many ways that a bad actor might try to steal model weights — several of which genuinely blew my mind.
The craziest security breaches that have already happened, and what we can learn from them.
The security measures Sella is excited about leading AI labs implementing to protect themselves against attacks.
Plus we end with some super cool work Sella did using machine learning for flood forecasting, which has significantly reduced injuries and costs from floods across Africa and Asia.

All right, without further ado, here’s Sella Nevo.

The interview begins [00:02:30]

Luisa Rodriguez: Today I’m speaking with Sella Nevo. Sella is director of the Meselson Center and a senior information scientist at RAND. He also acts as a venture partner at the climate-focused VC firm Firstime, and as an advisory board member of ALLFED, the Alliance to Feed the Earth in Disasters.

In this conversation, we’re talking about his work at RAND, where he’s been putting together a report on how to protect the model weights of frontier AI models from actors who might want to steal them. Thanks for coming on the podcast, Sella.

Sella Nevo: Yeah, it’s a pleasure to be here.

The importance of securing the model weights of frontier AI models [00:03:01]

Luisa Rodriguez: I hope to talk about what are the worst security breaches of the last few decades, and who might want to steal model weights and why. But first, why is it so important to secure the model weights of frontier AI models?

Sella Nevo: That’s a great question. AI models are rapidly becoming more capable. They already have very, very significant commercial value, which is already an important motivation. But it seems highly plausible that in the very near future they will have significant national security implications. For example, there’s a whole discussion these days on whether they can or cannot help people develop bioweapons. That’s a nascent discussion that I’m not going to dive into, but at the very least it seems very possible that in the near future they will be very useful for important national security actions.

The more capable they become, the more important it is to protect them. Sadly, we know that there are many people who would use them for harm, whether we’re thinking about terrorist organisations, rogue states, anarchistic hacker groups and others. Just to give a sense, it took less than a month from the launch of GPT-4 — which at the time was this huge deal; no one knew its actual capabilities, and there was all this talk of “sparks of artificial general intelligence” and so on — within a month, someone had run and published what’s called ChaosGPT, which was an autonomous agent based on GPT-4 that was given the goal to try and destroy humanity.

Now, just to clarify, it really didn’t get very far. AI agents aren’t currently very good at complex strategy and planning. So it’s not that I’m worried about ChaosGPT specifically, but it’s hard to imagine things going very well if AI models are able to, say, develop bioweapons that could kill millions of people, or maybe they can hack and disrupt critical infrastructure — and at the same time, organisations that believe in ideological violence, for example, can access them.

Luisa Rodriguez: Yeah, is there anything going on right now that’s a good, concrete example?

Sella Nevo: One of the things that I’m currently most concerned about is indeed the use of AI in the biosecurity context. I might be biassed, because other than my work on AI security, a lot of my work is on biosecurity, so I’m very exposed and familiar with it. But I also think it’s an interesting one. Today, biological risks really are one of the most plausible ways for truly large-scale, global-scale harms. You really, right now, can’t kill the same amounts of people with any other technology.

And yet the influence of AI on biology seems incredibly plausible, tangible, and close. Specialised AI models are already driving the state of the art in biology. And more general models, like large language models, are really at the cusp of providing useful advice, understanding research protocols, even being able to drive certain experiments — though unclear yet if they can really be a game changer at this time. So that’s a risk that I’m particularly acutely concerned about today.

Luisa Rodriguez: Why focus on model weights in particular? I would have guessed it would have been more important to protect the code base, or maybe especially powerful algorithms.

Sella Nevo: That’s a great question. Maybe the first thing that’s important to say is that it’s not only important to protect the model weights. There really are, as you just noted, a lot of other components that are worth securing — including the code base, including specific algorithmic insights, including training data, including the model APIs that can be abused even if nothing is stolen. And maybe there’s a whole additional world of protecting a model’s integrity: making sure that a model is not maliciously changed, rather than not maliciously stolen.

But the work that we did over the past year focused specifically on the confidentiality of the weights, which is a way of saying we want to make sure that the model weights are not stolen. And the reason we decided to at least start there is because the model weights represent kind of a unique culmination of many different costly prerequisites for training advanced models.

So to be able to produce these model weights, you need significant compute. It was estimated that GPT-4 cost $78 million and thousands of GPU years. Gemini Ultra cost nearly $200 million. And these costs are continuing to rise rapidly. A second thing you need is enormous amounts of training data. It’s been rumoured to be more than ten terabytes of training data for GPT-4. You need all those algorithmic improvements and optimisations that are used during training that you mentioned.

So if you can access the weights directly, you bypass at least hundreds of millions of dollars — and probably in practice a lot more that comes with talent and infrastructure and things like that that are not counted in the direct training cost.

But on the other hand, as soon as you have the weights, computing inference from a large language model is usually less than half a cent per 1,000 tokens. There’s still some compute involved, but it’s negligible. There are other things you need. Maybe you need to know the exact architecture, and you can’t always fully infer that from the weights. Obviously you need to have some machine learning understanding to be able to deploy this. But these are all fairly small potatoes relative to being able to produce the weights yourself. So there’s a lot of value in getting to those weights.

Critically, once you do that, you can pretty much do whatever you want: a lot of other defences that labs may have in place no longer apply. If there’s monitoring over the API to make sure you’re not doing things you’re not supposed to, that no longer matters because you’re running it independently. If there are guardrails that are trained into the model to prevent it from doing something, we know you can fine-tune those away, and so those don’t really matter. So really, there’s almost nothing to stop an actor from being able to abuse the model once they have access to the weights.

Luisa Rodriguez: Is their value limited by the fact that once you’ve got the model weights, that model will soon be surpassed by the next generation of frontier models?

Sella Nevo: I think that really depends on what the attacker wants to use them for, or what you as the defender are worried about. If we’re thinking about this like global strategic competition considerations — which countries will have the most capable models for economic progress and things like that — then I think that’s relevant. Still, stealing the models might give an attacker years of advantage relative to where they would have been otherwise.

I’m most concerned about just the abuse of these models to do something terrible. So if we were to evaluate a model and know that you can use it to do something terrible, I don’t really care that the company a few months later is even more capable. Still someone can abuse it to do something terrible.

Luisa Rodriguez: Sure. OK. Yeah, that makes sense.

The most sophisticated and surprising security breaches [00:10:22]

Luisa Rodriguez: So that’s the case for being worried about the potential for model weights to be stolen. Next, I wanted to help listeners get an intuitive sense of why it’s at all plausible that it’s possible to steal them.

Before prepping for this episode, I didn’t know very much about just how many severe breaches there have been of information that I would have thought would have been extremely secure — which I feel like gave me a much better sense of how difficult information security is, even in cases where there are massive incentives to keep information secure.

You can’t talk about the most advanced hacks from recent years, but there are some you can talk about, so I want to ask you about those. What do you think is the most sophisticated or surprising security breach carried out in the last few decades that you’re allowed to talk about?

Sella Nevo: One attack, that is often called the SolarWinds hack, began in 2019. Let’s start with the end. It installed backdoors, which are hidden ways to get into the system that the attacker can then abuse whenever they want. They installed backdoors in 18,000 organisations, then those backdoors were used to install malware in more than 200 organisations that the attackers chose as high-value targets. These included Microsoft, Cisco, the cybersecurity firm FireEye. It included the US Departments of Commerce, Treasury, State, Homeland Security, and the Pentagon and others. It included NATO, other organisations in the UK government, the European Parliament. So they got a lot with one hack.

This is estimated to have been done by a Russian espionage group that’s sometimes referred to as Cozy Bear — there’s a lot of fun names in the cybersecurity industry — which is sponsored by Russia’s foreign intelligence service. So this is an example of what’s called a supply chain attack. In a supply chain attack, instead of directly attacking whoever you’re interested in, you can attack software or hardware or other infrastructure that they use — so they’re part of their supply chain. This is incredibly powerful, because you can attack software that lots of people use simultaneously, and that’s how you get to these kinds of scales. But also, the infrastructure that we all use is just enormous, so there are endless vulnerabilities hiding in its depths, making this more plausible and feasible.

This example is a very deep supply chain attack. We actually begin not even with SolarWinds, which is where the name comes from, but with Microsoft. They started by identifying two interesting things about the way that Microsoft’s products work.

The first is that Microsoft sells a lot of its products through what’s called third-party resellers. So the person you directly buy, let’s say, Windows from is not Microsoft itself; it’s someone in your country that has a licence from Microsoft. Those resellers often will have continued access to their clients’ systems, at least in some sense. For example, if you pay them more, they are then authorised to install new products or to enable new users. We know that Cozy Bear attacked these resellers; instead of attacking Microsoft directly, they attacked this third party, and then used them to get access to networks of users that use Microsoft products.

Once they’re in the network, they use another vulnerability, this time on Microsoft’s authentication protocol — the vulnerability is called Zerologon for those who are familiar — and broadly, this allowed them to get all the usernames and passwords of all users in the network that they are on. Just convenient. Now, we know that this happened. We think that this is also how they originally got into the network of SolarWinds. We’re not 100% sure, but that’s currently, I think, the best estimate.

Luisa Rodriguez: And SolarWinds itself is an information security company?

Sella Nevo: Close. SolarWinds is a company that develops software that helps businesses manage their network systems and IT infrastructure. It’s not a cybersecurity company per se, but it does control the foundations of hundreds of thousands of companies’ networks, and has direct access to a lot of their cybersecurity infrastructure. So it is indeed a very convenient place to be.

Once they got into SolarWinds — this is now the second stage of the supply chain attack — they subverted their build system. A build system is a system responsible for putting together large software systems for developers. So developers will write their code in a lot of different files. You want to put them together into one big application. They subverted that. What that means is that SolarWinds’ developers were seeing completely legitimate source code that they wrote, but when they actually deployed it, a malicious version switched the files that were being used from the files they thought they were putting into the application to the files that the attackers wanted to put into that application.

Now, at this stage, the attackers are firmly entrenched into SolarWinds’ network. In fact, they stayed there and tested their malware for months on SolarWinds’ servers before doing anything to touch the clients. But eventually, after they were sufficiently pleased with how things were going, they used this capability to create a backdoor in the software.

What they did is they used SolarWinds’ update mechanism. So whenever you have software updates, really what your computer is doing is getting a message from the internet that says, “Here’s a new executable that’s better than the one you have. How about you update it?” Usually this is OK. There are signatures on the files that help you ensure that the company that you trust is the one who sent them; not anyone can send you files over the internet. But they put in a backdoor so that the attackers as well could send whatever executable they want.

So they did this for, as I mentioned before, 18,000 organisations that downloaded updates from SolarWinds. And then they kind of cherry pick the organisations that they want to use the backdoor to actually install malware.

What that malware did was, first it kind of lurked silently for about two weeks, and then seeing that everything was OK, it reached out to command and control servers that could tell it what to do. For example, things that they told this malware to do is copy and send various types of sensitive information — including emails, documents, certificates, and directions for expanding throughout the network beyond the original devices that were compromised. And just to sum this up, after they were caught, it was estimated they had about 14 months of unfettered and undetected access to these organisations.

Maybe one final thing that I think was interesting about this specific attack is that sometimes malware is self-propagating. It automatically tries to get wherever it can and so on. This was not an example of that. Even though they were active in 200 organisations, every single attack was manually controlled by a human on the other side, which tells you something about their interests, and what they were looking for, and how much they were willing to invest in this.

Luisa Rodriguez: Yep, yep. Holy cow. This is insane. Do we know what the outcome was? Like, what Russia was able to learn or do using this malware?

Sella Nevo: So we know which networks were compromised. We had mentioned this was quite a lot of very interesting networks. We know that they did copy and send very large amounts of information. The content of that information we don’t fully know. When it’s on the government side, it often remains confidential or classified. And even with private companies, they often try to downplay the effect. It’s a bit hard to know exactly what was taken, but we’re pretty confident it was a lot.

Luisa Rodriguez: OK. And how was it eventually noticed and counteracted?

Sella Nevo: Eventually, I think it was FireEye that identified in some of their client networks some suspicious activity. The malware originally did a lot to remain undetected. For example, we mentioned that SolarWinds had control over a lot of the security tools in the networks they were deployed in. So they put their own executables and whitelists so that the antivirus and other software won’t detect it. They made their communications look like standard SolarWinds communications, so they kind of imitated the way the application normally works.

But eventually someone did notice suspicious activity looking not exactly the way it should. And then once the investigation started, it started snowballing. Once you have some of what’s called a signature, some identification of, when you see this kind of traffic or when you see this executable, it means something has gone wrong. Suddenly we started finding it everywhere.

Luisa Rodriguez: So my genuine reaction is just like, this is totally insane and unacceptable. Like, it should be impossible for Russian-affiliated actors to send themselves documents from the Pentagon and the Treasury and State Departments.

Sella Nevo: I think it’s pretty bad. I think it’s not that uncommon. So let’s put it this way; let’s first bound from above: when we say they were able to send documents from the Pentagon, it does not mean they were able to send all documents from the Pentagon. They didn’t have access to everywhere and so on.

I don’t mean to overstate how bad this is. Things at this level — I mean, clearly this is a big and famous event — but they’re not that rare or unheard of. Maybe just to give a sense of how common they are: while the SolarWinds hack was happening, China was also at the same time using a vulnerability in SolarWinds to access multiple government agencies. Now, no one talks about that because it was a much more targeted attack. So what’s now called the SolarWinds hack, it was more famous. But this is a bit of a coincidence. I don’t mean to claim that multiple countries are exploiting the same literal software.

Luisa Rodriguez: Right. At any given time.

Sella Nevo: At every given time. But it does give you a sense of how frequent these things are.

Luisa Rodriguez: Yeah. What should I take from this? I think maybe the thing I should take from this is just information security is extremely, extremely hard. In this case, it never would have occurred to me, honestly, that the way that documents would be uncovered and sent abroad would be through a software that itself was accessed through another software breach several links down the chain. And so maybe the key thing here to look at and notice is these technological systems that interact with information that we think of as extremely sensitive are just super complex, super interconnected, and imperfect — and they will always be imperfect, and we can try to make them safer, but this is just an extremely hard problem.

Sella Nevo: Yeah, I think I broadly agree with that. First, just to further strengthen the original point: I completely agree that these systems are enormously complex, and it is very hard to secure all of them.

Just to give another example, there have been other famous attacks using HVAC systems — the heating and ventilation systems of buildings which now connect to the wifi, and so now you can use them. So really, there’s these layers and layers of… Until now, everything we’ve talked about was software, but you can attack firmware and hardware as well. There really are just layers of things that you don’t think about day to day that could be used for this purpose. So in that sense I completely agree.

I think it’s also right that nothing will ever be perfectly safe, but I want to be careful about being overly fatalistic. There is no such thing as perfect security, but there is very much such a thing as better security. So the more you invest in security, the more resources an attacker would need to be able to overcome it. And so you can very significantly reduce the number of organisations who could attack you, or the likelihood that they will succeed.

Luisa Rodriguez: Yeah, that makes sense. And it is somewhat reassuring. In the case of SolarWinds, do you have a take on whether the thing that went wrong was like, there was lots of good security, but we can’t stop every attack because there are just so many ways in? Versus this would have been prevented with better security, and so it’s one of those examples we should learn from? As like, we shouldn’t be fatalistic: this was preventable and there were ways around it.

Sella Nevo: I think there’s definitely things to be learned from this hack, and I think the structure of the information security world is such that every specific attack is preventable. The challenge is the size of the attack surface: there are so many different things one can do. So once we’ve seen an attack, at the very least we should prevent similar attacks in the future. You then want to try and generalise from that as much as possible, so that it’s not dependent on the specific vulnerability they found, but what could we do to make our systems more robust in general? And recognising that no system will then be perfect and can never be breached again.

Luisa Rodriguez: OK, cool. That makes sense. Do you think there are other kinds of general lessons we should take from it?

Sella Nevo: One obvious lesson is that supply chain attacks are a serious, serious problem, and you need to make sure that the security you apply to your own organisation is applied to all other organisations that get any access to your systems. And that came up not only in SolarWinds, but also, for example, Microsoft resellers. I think that’s a really important one.

Then a second one is maybe the importance of what’s called defence in depth. Since we know that all defences can potentially fail, it can be really effective to stack them up and say, “Even if you breach my first line of defence, I have a second line of defence. Even if you breach my second line of defence, I have a third one.” That way, if each one has a certain probability of failing, you kind of exponentially decrease the chance of an overall success by having that.

So I think in every step of the way here, it seemed that the attackers needed to do nontrivial things to get there, but I think more defence in depth would have been very, very useful.

AI models being leaked [00:25:52]

Luisa Rodriguez: OK, nice. Bringing it back to AI, there’s already been at least one case in which a capable model was irreversibly leaked. Can you talk me through what happened there?

Sella Nevo: Yeah, so there’s a couple of cases. Though these cases are interesting, but not interesting.

Luisa Rodriguez: OK.

Sella Nevo: The first is the original Llama that Meta developed in 2023. This was kind of a frontier, or close to frontier model, and it was leaked, and that made a lot of news. But it is worth noting that Facebook was kind of giving it to pretty much anyone who asked nicely, and they planned on open sourcing it anyway. So it was not their intention to open source it at that exact point in time, but clearly they were not treating this as a critical thing to secure.

Luisa Rodriguez: Right. I remember hearing about that one and being like, this doesn’t seem like a very big deal.

Sella Nevo: Yeah. I mean, it was interesting. It was an interesting experiment. We learned a lot of interesting things from it. For example, that fine-tuned guardrails don’t work.

Luisa Rodriguez: Right.

Sella Nevo: But from a security standpoint, this was not a major event.

Luisa Rodriguez: Yep. And just to clarify, fine-tuned guardrails are the kind of “rules” trained into a model after the base model has been fully trained and works well that make sure the model doesn’t say offensive things or give detailed instructions for how to make a bomb. So we learned that they don’t work in the sense that we learned that if these guardrails are just kind of added on top of the trained model, people can easily get around them.

But yeah, was the Llama model leaked just in the sense that a single person who had the model weights just published them because they could and wanted to?

Sella Nevo: I think that’s pretty much it. That’s my understanding.

Luisa Rodriguez: OK. And then there was another one?

Sella Nevo: Yeah. The second one is much more recent. So Mistral’s Miqu model was leaked in 2024. Mistral is a French startup, it’s a strong supporter of open source, and what they do is they open source their smaller models, but then have a paid API similar to the larger companies for their larger models. And their Miqu model is a model that’s pretty similar to what they call Mistral Medium — so not their most capable model, but more capable than their open source models, was leaked.

They didn’t necessarily intend to open source it, but they were also pretty lax with how they were treating it. The CEO at least claimed that they’ve been distributing it quite openly to their clients. And the way he described it was an “overenthusiastic employee” of one of their clients. He even kind of tongue-in-cheek… So the employee uploaded it anonymously to Hugging Face, which is where a lot of these models are uploaded; it’s kind of like this open source model website. And he kind of tongue-in-cheek suggested a change to the open code base and said, “You might consider attribution.” So clearly they didn’t see this as a catastrophic event for them. But yeah, I think those are the two cases that we have right now. As I mentioned, I think they have interesting implications for the AI and the open source AI community in general. But I don’t think they’re the kind of thing that I’m concerned about, or what the report is focused on.

Luisa Rodriguez: Yeah, it sounds like they’re just cases where someone had access to the model weights that weren’t being super well protected, and so therefore there were many people with access to the model weights, and was just like, “I want to put these online, and I don’t think there will be super big repercussions for me.” Whereas most of what your report is dealing with is cases where all of the employees and leadership of the companies building these frontier models very much do not want the model weights of their most capable models to be open sourced or published, basically. So this would be a case where, against their will, some highly resourced actor is trying to get access to them.

Sella Nevo: Yeah, that’s right. Maybe except for the word “all.” You can never know that all of your employees are fully aligned with securing models, and that’s a big part of the problem. But conceptually, yes.

Luisa Rodriguez: Yep, yep. Great. And I think we’ll probably come back to that kind of issue in a bit.

Researching for the RAND report [00:30:11]

Luisa Rodriguez: So you coauthored — I actually think you are the primary author on — this report from RAND on how to secure model weights for frontier models. And you outlined 38 different types of attacks, and then developed a set of security levels with a set of security measures that are required to prevent attacks of different kinds of levels of sophistication.

So for context, as I was reading this report, I just had the sense of, wow, it seems like you really had to basically know all of the different ways that a motivated actor — and there are a bunch of different actors you’re talking about — could try to steal the model weights of frontier models. How did you go about researching and writing the report?

Sella Nevo: Yeah, it was a very interesting process. We worked on it for more than a year. We had, by the way, a great team of coauthors — most of which at RAND, but this is also a collaboration with Pattern Labs. And indeed, our team didn’t have remotely all of the information that one would need to write such a report. So one thing we did is we had multiple interviews and workshops with about 30 experts in the AI labs themselves, other cybersecurity experts, a bunch of national security experts in multiple countries, folks working both in offensive cyber and in defensive cyber. And we tried to aggregate a lot of their insights.

And then secondly, we did a review of a wide range of sources. So we cite several hundreds of different sources, including academic papers, governmental reports, and other things we could find online. So the paper tries to kind of cram all of that into about 150 pages to try and get that across somehow.

Luisa Rodriguez: Yep. Many, many pages of which are basically outlining these attack vectors. I really was just kind of blown away by all of the different ways or all the different mechanisms someone could try to exploit.

Who tries to steal model weights? [00:32:21]

Luisa Rodriguez: But before we talk about those, who exactly are we talking about? What kinds of actors could, in theory, try to steal model weights?

Sella Nevo: Yeah, as you noted, indeed there’s just a huge variety of different actors that differ in what capabilities they have and how much resourcing they have and so on. What we wanted to do in this report is lend some nuance to the discussion — because often people talk about whether a system in general, or AI model weights in particular, are secure or insecure, and can they be secured? And as we already mentioned, nothing is ever perfectly secured.

So we wanted to help understand, for different levels of resourcing and capabilities, what is needed to make a system secure against that specific category of actors. To enable that, we defined five categories — we call them “operational capacity” categories — that try to describe, in rough terms, who we’re talking about.

So the first category we call OC1, or amateur attempts, often is a single person. They’re going to invest several days on this. Maybe they have up to $1,000 to spend on trying to get this attack to work. This might be hobbyist hackers or maybe it’s more experienced hackers, but they’re using what’s called a “spray and pray” attack: just try it on a bunch of websites and see if it works. This might also be these 15-year-old script kiddies that just download scripts on the internet and do what they want.

Maybe just a small anecdote: in 1999, a 15-year-old kid actually hacked the Department of Defense.

Luisa Rodriguez: Oh, god.

Sella Nevo: But that was 1999. The world is better today. It would be quite unlikely for a 15-year-old to, on their own in their free time, hack the Department of Defense.

Luisa Rodriguez: Jesus. OK, so that’s amateur attackers. And are they basically just trying to make money? Is that the primary goal they have?

Sella Nevo: Yeah, I think for all the lower levels, usually the motivation is financial. Or actually, either financial or for the lols, because it’s cool or something. That’s actually quite a surprisingly common motivator.

Luisa Rodriguez: Who are the other actors we’re talking about?

Sella Nevo: So the second category, OC2, we titled “professional opportunistic efforts.” This is still something that you’d think of a single person maybe, but they’re more capable. Maybe they’re willing to invest several weeks and up to $10,000. Often this is what an attack from a professional individual hacker looks like, or capable hacker groups — so more serious organisations, but they’re executing untargeted or lower priority attacks. They’re not going all in; they’re just trying to get many people at once.

Running forward, we have OC3. We call this “cybercrime syndicates and insider threats.” Really there’s two threat models here. One is major cyber organisations. Think of organisations like Anonymous or large Russian cybercrime groups that steal IP or install ransomware. This will often have dozens of individuals, and up to a million dollars invested in an attack.

And this is the first time we start seeing what’s called APTs: advanced persistent threats: they get into your network; they invest in not being caught; and they, over time, subvert more and more of your systems. But these are still not the most capable actors. There’s still quite a way to go.

Then the other group that we put together here is insider threats. You can imagine a researcher in a lab that decides they want to steal the model. They might be less capable from an information security standpoint, or have less resources, but they start off with a lot more access, which is very important.

Luisa Rodriguez: So the reason to group them together is because they pose similar levels of threat?

Sella Nevo: Yes, that’s a very good question. They are indeed very different. They have different capabilities. The reason we decided to group them together is that there are many overlapping defences that would be useful. One of the things that we expect at OC3 is for them to start having zero-days. Maybe I should explain what zero-days are: zero-days are vulnerabilities that the attacker has discovered, but the defender or no one else in the world knows about yet.

In both of these cases, you expect them to be able to overcome your first lines of defence, whether because they’re already inside them or because they have zero-days or other opportunities. So a lot of the defences are overlapping, like defence in depth that we mentioned, and the costs to defend against them are kind of a comparable order of magnitude. So yeah, they’re still different, but it’s a somewhat similar level of investment to defend against them.

Moving on to OC4, which we title “standard operations by leading cyber-capable institutions.” So these are operations by incredibly capable institutions, mostly nation-states. They will have hundreds of people working on a single operation. Their efforts are not limited to cybersecurity, but also human intelligence, physical operations, and more. They might have a budget of up to $10 million for a single operation. They might have other kind of unique capabilities, like they might have legal cover — they can commit crimes with impunity — or they might have infrastructure to intercept communications across society, like access to the internet backbone and things like that.

For most countries, OC4 is the height of what they can do. But for some extremely capable nations, this is their day to day. So they can do 100 of these every single year. And when people talk about state-sponsored operations, they usually mean something like this — or at least at the higher end of what we tend to see, usually something like this.

For example, both Russia and China have many groups that aim to achieve a certain goal, like take out the electrical grid in Ukraine or try to steal commercial secrets. They’re important to them, but it’s not that terrible if they get caught. The goal is the thing that they’re trying to achieve. So these groups tend to be considered very capable in industry cybersecurity standards, but from the nation-state’s point of view, or at least a particularly capable nation state, this is the quantity-over-quality version of their operations.

Luisa Rodriguez: OK, got it.

Sella Nevo: Finally, we have a group OC5, which we call “top-priority operations by the top cyber-capable institutions.” These are the apex of the cybersecurity world. They might have thousands of people working on an operation, might have a billion dollars invested in it. There might be years or even decades of a head start over what is known externally, let’s say in academia. These are the few top-priority operations of the world’s most capable nation-states. Maybe an analogy to classic military: if the previous operations are the country’s army, this would be their special operations units.

Luisa Rodriguez: OK, so it sounds like these really range in how sophisticated they are. It also sounds like they’ve probably got a range of motivations: financial motivations for some, lols for others, and then also I guess just stuff related to national security and conflict and politics.

How confident can we even be about what these actors can do and what they have done in the past, if that makes sense?

Sella Nevo: Yeah. So I think we can be very confident for the lower capacity levels. When you talk about definitely everything in OC1 and OC2 — so all the more amateur or opportunistic stuff — we see tens of thousands of examples of this every year. We know that at least capable companies can usually detect them and stop them. So we have a very good sense of what they’re up to and what they can do. There are reports following every year the trends: what’s increasing, what’s decreasing? We have a good understanding of that.

Once we get to the third level, starting to touch on APTs, we know less. There’s smaller numbers that we can detect, but there’s still enough to have some decent understanding. Once you get to the higher levels, there is huge uncertainty.

And when we talk to experts, there’s also just a huge variance in opinions: all the way from, at the extremes, some experts that we heard claim that even if you just use standard industry best practices, then you’re protected against even the most capable state operations; all the way to folks who said it does not matter, and there is nothing physically possible that will stop these organisations.

I think the first extreme of just standard industry practices being enough is pretty clearly disqualified by the evidence. I think we have enough examples already to say that that’s not true. And I think the other extreme also slightly overstates it. There are pieces of information historically that we know countries wanted to get to and didn’t, so that, I think, also overstates it. But within, there’s still this huge uncertainty on exactly what is needed and how feasible it is in practice for someone who isn’t doing everything physically possible, but also needs to run an organisation, for example.

Malicious code and exploiting zero-days [00:42:06]

Luisa Rodriguez: Let’s talk through some of the potential “attack vectors” — which basically just mean the kinds of ways that these actors could get access to model weights. If you had to name just one, which attack vector worries you most?

Sella Nevo: That’s a hard question. It’s hard to choose one, because as you noted before, there are many. Maybe before we dive into one, it’s worth noting security is roughly a weakest link kind of game, which means that if you get a lot of things right, but then there’s a type of attack you’re not protected against, an attacker can use this attack. So when we say “important,” it’s kind of a bit hard to define — because the one you’re currently least defended against is the most important to improve on.

Luisa Rodriguez: Yeah, that makes sense.

Sella Nevo: But let’s start with maybe a classic kind of category, which is just using vulnerabilities to run malicious code in a network and try to exfiltrate the weights. So what does this roughly look like? Of course, in reality it looks like many different things, but an attacker could, let’s say, send a message to an employee or have him visit some website, then have that website or that message run code on that employee’s computer, then hop over to the company’s servers. So from his computer to other servers, reach where the weights are stored and then copy them and send them out to the attacker.

Now, obviously they’re not supposed to be able to do that. We have various defences that are meant to prevent that. For example, when you go on a website or read an email, the software that interacts with it — let’s say your browser or your email client — is supposed to interact with it only in specific ways; it’s not supposed to let it run whatever code it wants on your computer. But if that attacker, for example, has access to zero-days — these are, as we mentioned, a vulnerability that you and the developer of let’s say your browser doesn’t know of — then maybe they can run code on your computer even though they’re not supposed to.

Just to get a sense, we talked before about how common breaches are. Zero-days are found every day in hundreds of products. Now, for an individual it’s pretty hard to be the one to find it, because the legitimate researchers do what’s called “responsible disclosure”: they first tell the developer, who’s supposed to fix it — and only after the developer fixes it do they tell the rest of the world, so that other people can’t exploit it. So if you’re a bad actor that wants to exploit it, you need to be the first to find it. Or at least early enough to find it so that it’s not yet fixed.

So talented-enough hackers do find zero-days and can exploit it, but it’s hard. It’s not something that anyone can do, but large organisations can have dozens of researchers looking for these zero-days, and that increases their chances.

Often we divide this into two types: you can try to find general vulnerabilities in software that we all use — for example, your operation systems like Windows, Linux, Mac; and antiviruses; browsers; things like that — or you can specifically research what products does this company or this organisation use, and then search for vulnerabilities in them. That’s less generalisable, but if you care about that organisation specifically, there’s a higher chance of finding a vulnerability in some niche product that maybe they use.

Luisa Rodriguez: That all makes lots of sense, and there are obviously a bunch more steps in the thing you’re about to describe. But I’m really curious: are all zero-days similarly valuable? Or are some like you finding an insignificant typo, and others really change the meaning or something?

Sella Nevo: So there’s a whole scoring system for the severity of zero-days. You’re absolutely right that it depends, and the severity depends both on what it allows you to do — for example, running malicious code of your choice is one of the worst things that it can do; maybe it only lets you edit a file you’re not supposed to edit, or maybe it lets you read something you’re not supposed to know, or change a specific configuration — and also what are the prerequisites for doing so? For example, a vulnerability that you already have to be logged in and in the network to use is maybe less severe than one where anyone can just interact with someone on the internet and start running code on them.

So there’s a whole severity system. Almost all the vulnerabilities that I talk about are either severity 10 out of 10, or close to that. Even those, quite a few are found pretty frequently.

Luisa Rodriguez: OK. Oh god. So yeah, you were saying that basically sophisticated, talented researchers can find these, though it’s hard to find them before the responsible researchers find them. From there, if an irresponsible or a malicious person finds the zero-day, it can run some code that’s like, let me into this other machine on the same network. And then from there, maybe it has access to the folders where the things are stored, where the model weights are stored, and then somehow it’s able to send those out without being detected. Am I understanding this vector?

Sella Nevo: Yeah, I think that’s broadly right. It indeed can be confusing. There’s a lot of different components to a cyberattack: there’s getting into the network, there’s what’s called lateral movement — so moving within the network. Maybe the weights are protected in various ways: maybe they’re encrypted, maybe they’re inside a device. You would need to overcome those kinds of defences. There’s other things to make sure that you’re undetected. All of these can use vulnerabilities to achieve all of those goals. And all of those vulnerabilities would be called zero-days if you’re the first one to find them, if they have not yet been publicly reported.

Luisa Rodriguez: OK, so why are you particularly worried about this kind of attack?

Sella Nevo: So first of all, I’m particularly worried about many attacks. But there’s a few things that are interesting here.

One is it’s just incredibly common. We know this happens all the time. We know it’s the bread and butter of information security attacks.

Secondly, machine learning infrastructure in particular is shockingly insecure — more so than other types of software infrastructure. There’s two things I think driving this. One is just the fact that the industry is advancing at such a rapid pace, and everyone is rushing to go to market. And this is true from the hardware level to the software level.

GPU firmware is usually unaudited, which is not true for a lot of other firmware. The software infrastructure that people use for training and for monitoring their training runs has these enormous sprawling dependencies. We mentioned supply chain attacks before. There have already been vulnerabilities introduced to these systems. Some of these infrastructure say in their documentation, “This is not meant to be used in a secure environment” — but these are key infrastructure that is used in all machine learning systems. So really the situation with machine learning infrastructure is particularly bad, and we’re quite far from even reaching the kind of standard practice for software systems, which in and of itself is not that great.

Luisa Rodriguez: Yep. OK.

Sella Nevo: There’s another thing that I’m worried about, and I think more commercial companies should be worried about — again, as this moves on to not just being worried about cybercriminals, but about nation-states — which is that if you’re a nation-state, you can cheat in how you get zero-days.

For example, China has a convenient set of regulations titled “Regulations on the Management of Network Product Security Vulnerabilities.” Broadly what that regulation says is that any researcher or any organisation that has any footprint in China is required to report any vulnerabilities that they find to the government — which we know that the government then hands off to their offensive cyber organisations.

And simultaneously, they also put severe penalties if you share that information with almost any other organisation. So you should not be surprised that China then has dozens or hundreds or, I don’t know exactly how many, zero-days of their own.

And then maybe another way that you can get huge amounts of zero-days is, even if you’re not a state, but you’re a sufficiently capable actor — maybe you’re in the OC4 level or OC5 — you can hack into the channels through which zero-days are reported: if other people find vulnerabilities, they have to report to the company somehow. There’s a lot of different infrastructure in place for that. If you can get access to that infrastructure, you get a continuous stream of all new zero-days that anyone can find.

This is a big challenge, because we mentioned defence in depth before. The classic conception of defence in depth is really useful if someone has one or two zero-days: you know, I have now three layers of defence and so we’re good. But if someone has 50 or 100, if you expect them to have zero-days on a lot of the products and hardware that you use, that is a very big challenge to defend against.

Maybe finally, it’s worth mentioning something that I think is even more important and extreme than just having access to a lot of zero-days: historically, intelligence agencies were decades ahead of the world in even identifying conceptually new attacks. And I’ll give maybe one example of this.

Luisa Rodriguez: Yeah, great.

Sella Nevo: Folks who are very familiar with information security and specifically cryptography might be familiar with this. There’s an attack called differential cryptanalysis. That’s an attack on encryption systems if you want to decrypt something, or cryptographic systems in general. It was nominally discovered in the late ’80s, and it’s incredibly powerful: the vast majority of encryption methods that existed before it are broken by this method. And to make an encryption resistant to differential cryptanalysis, you really need to know about this specific attack, and then kind of fine-tune all of your parameters to really make sure that that can’t be done.

Now, several years after it was discovered in the ’80s, folks in IBM published that they had known about it since about the mid-’70s. When they discovered it, they discussed it with the NSA. And the NSA, according to IBM, had already known about it, and apparently persuaded IBM to keep it a secret.

So everywhere between 1975 — and we have to guess how many years before that — and 1989, approximately, we know that they could pretty much undermine almost any encryption. This is one of multiple examples, and we only discover them always many years afterwards. So if you want to defend against nation-states, you have to take into account that it’s not just that they might have a specific vulnerability in a specific product — they might be kind of undermining complete classes of ways you could defend yourself.

Luisa Rodriguez: Right. Systematically. And they might be able to keep that fact a secret for decades. That is really shocking.

Human insiders [00:53:20]

Luisa Rodriguez: OK. What’s another attack vector that worries you?

Sella Nevo: Maybe just to think about a very different way in which malicious organisations might be able to reach information is human intelligence collection: broadly, getting other people to do things for you. And there’s a wide range of ways that organisations can do that.

One classic way is there’s the carrot, which is either bribes or just motivation. I think the way sometimes organisations that are not familiar with the space think it’s going to look like is that someone comes up to you in a dark alley and says, “I will pay you x million dollars to betray all of your values.” And then you think deeply about whether you’re willing to betray all your values. And if your employees are sufficiently good and trustworthy people, then they won’t.

Luisa Rodriguez: Right. And probably lots of employers, when they think about that, are like, “No, my employees are super value-aligned. They care about this work. We’re not going to be that vulnerable to that,” is my guess.

Sella Nevo: Exactly. Which even that is, I think, sometimes overly assumed to be true.

Luisa Rodriguez: Yeah, sure.

Sella Nevo: But in practice, I think that organisations that do human intelligence collection are smarter than that. What they will often do is build a story that aligns with the person’s already existing ideology. So the person will believe that they’re doing the right thing, or at least can kind of conveniently convince themselves that they’re doing the right thing, while also getting whatever benefits this person will give them.

For example, maybe there’s an employee that believes in democratisation of AI: that it has to serve the public and be free to be used by everyone. They can craft a story of how they’re helping an organisation that actually, that’s what they’re trying to achieve. Or maybe they believe that the organisation they’re in should be more transparent about exactly what they’re doing, what the model capabilities are and so on.

Maybe they’re talking to a journalist — who will, of course, not release the model or not abuse it, but they will be able to report on the capabilities of the model and actually help improve the safety and security of AI models.

Or maybe even they believe that AI progress should be going slower, because AI is very dangerous, so the only way is to make sure that the companies don’t have a financial incentive, or…

There’s an endless list of excuses you can give for why you’d actually share something like that. And that really, I think, changes the balance of how easy it is to get people to do something.

Luisa Rodriguez: Totally. And to be clear, we’re still basically talking about cases where a malicious actor probably is trying to steal model weights — so they’re pretending to be value-aligned with the person.

Sella Nevo: Exactly.

Luisa Rodriguez: That’s really unnerving. Are all of the cases of human intelligence kind of like that?

Sella Nevo: Not necessarily. Sometimes they just find people who are already aligned with the goal of the organisation. Maybe a good example of this is Ana Montes, which is a famous example. She was a Cuban analyst working in the US intelligence community, but turned out actually worked for the Cuban intelligence themselves. Originally, she had a clerical job at the Department of Justice. She spoke openly against US policies in Central America. She just genuinely disagreed with US policy there. Cuban intelligence saw that, contacted her, and she agreed to help. There was no need to lie; she was genuinely aligned with the goal. And she ended up passing information to Cuban intelligence from the US intelligence community all the way from 1985 to 2001, including identities of undercover intelligence officers that were placed in Cuba. So that was a pretty big deal.

Luisa Rodriguez: This does feel like the kind of thing that, one, only happens in movies, and when it doesn’t just happen in movies, is an exception and like a thing of the past. Is there reason to think that this is still the kind of thing that could happen readily?

Sella Nevo: Yeah, I think it does happen — pretty frequently, even. I think we usually don’t hear about it. But even just taking public examples, just in the recent few months, two Navy officers were actually prosecuted for passing secret information to China.

Luisa Rodriguez: Do you know what their motivations were?

Sella Nevo: It’s much earlier in the process, so we’re less confident. But one of them, at least, has pleaded guilty, and it looks like it was mainly because they were paid to do so.

Luisa Rodriguez: OK. So bribery, value alignment. Maybe value alignment plus bribery. Are there other things in this category?

Sella Nevo: Yeah. Until now, we’ve talked about, let’s say, the carrot side of things in human intelligence. There’s the even darker version, which is more in the direction of extortion. Some countries will jail or even torture family members of people unless they cooperate. And I think that the vast majority of employees will prefer to just do what they’re being asked rather than have to face that. Russia, for example, has been accused of forcing Ukrainian intelligence officers to spy for them by threatening to kill their family.

Luisa Rodriguez: Oh god. That’s really awful.

Sella Nevo: Here as well, there are tricks that are used to make this even more likely to work. Again, the way in which it’s done is not necessarily this kind of like all-or-nothing, boom: you need to decide whether you’re giving up your values — but rather, it often starts with a more positive interaction. This is sometimes called “grooming,” which is distinct from other types of grooming. The idea is that you start with them believing in the cause; you slowly erode boundaries; you get them to do things that they’re not supposed to do, but no one could imagine would be a big deal. Once they’ve done that, you kind of take it step by step.

And at some point, whenever they start actually resisting, you now have all of the illegitimate things they have already done that you can report. And you could, for example, now expose that you are a foreign intelligence agency, so now these things are in a very different light — and if those are reported to the company or to the FBI or something like that, it might look very bad. So now they have things to extort you with, even if they didn’t in the beginning of the process.

This is, by the way, a good reason to be cautious and not be too flexible with your boundaries. Because even if something is not a big deal initially, it can be turned into a big deal at a later stage.

Luisa Rodriguez: Yep, yep, yep. OK, so I guess that’s the stick. Are there other things in this category, or are those kind of the key things?

Sella Nevo: Yeah. So you could go a whole different route: instead of taking an existing employee and giving him either carrots or sticks, you could send in your own candidate. This is more limited to well-resourced organisations, but you could potentially just train exceptionally good candidates and then send them to the company. And that way you have someone who, from day one, is completely loyal to the attacker. There are not that many organisations that would do this, and it definitely requires a lot of resources, but these are things that can be done.

Luisa Rodriguez: Do we have a sense of how common that version of this is?

Sella Nevo: Not really. I think that in general in the human intelligence world, as opposed to cybersecurity, there isn’t as strong of an academic community, so a lot less happens out in the open, and a lot less is identified and discussed. So across all of these methodologies, we have a lot less clarity, and this one is more at the extreme. So yeah, it would be really hard to say.

Luisa Rodriguez: Are there any kind of interesting examples of this happening in real life? Again, I am just like, “Surely not. Surely this is in films only.”

Sella Nevo: So these cases are rare, and both sides tend to not want to talk about them, so the cases we do know of are few and far between. But maybe one interesting one — this is quite an old example — is Harold Kim Philby was an MI[6] executive in the UK that was actually recruited by the Soviets while he was still a university student at Cambridge, and then directed to reach the most sensitive parts of the UK government. So that’s maybe a good example of that.

Luisa Rodriguez: Yeah, that’s wild. And this seems just incredibly hard to defend against. It seems like it’s a person issue, and people issues feel way harder to me than things like “notice and patch zero-days or backdoors.”

Sella Nevo: Yeah, it is a challenge. I mean, we’ll talk later about solutions in general, but just to quickly say, I think there’s a few things you want to do here. One is, it’s a numbers game, right? No organisation can get to every person. So those are reasons to maybe limit the number of people who are authorised, just to reduce the risk.

A second is cultural. So these are solutions that aren’t as clean-cut as technical solutions, but there’s a lot of evidence that cultural interventions, and making sure that people are aware, know what to look out for, and know what to report are very effective. By the way, not only because the person actually being compromised might say no or might report it, but because the people around them might identify it. And there’s a lot of evidence that usually some people suspect it, so the question of whether they report it when they suspect it is really critical to whether these things are found out in time.

And then finally, I’ll say there are technical solutions. They won’t change what people’s choices are necessarily, but they can change what they have access to. For example, if we’re talking about the weights, if you have less people accessing the weights, but potentially you might have situations where no single person — or even no small group of people — could fully access the weights in the way that would allow exfiltrating them, that can be a very important component.

Luisa Rodriguez: Yeah, nice. I find that one especially reassuring. It does seem like if you just don’t let any one person have all of the relevant information, that would be really helpful.

Side-channel attacks [01:04:11]

Luisa Rodriguez: OK, another example of this, which sounds also really wild to me, is through side-channel attacks. Can you explain what those are?

Sella Nevo: Yeah, that’s a pretty fun attack vector. Maybe just to set the stage: for centuries, when people tried to understand communication systems and how they can undermine their defences — for example, encryption; you want to know what some encrypted information is — they looked at the inputs. There’s some text that we want to encrypt, and here’s the outputs. Here’s the encrypted text.

At some point, someone figured out, what about all other aspects of the system? What about the system’s temperature? What about its electricity usage? What about the noise it makes as it runs? What about the time that it takes to complete the actual encryption? And we tend to think of computation as abstract. That’s what digital systems are meant to do: to abstract out. But there’s always some physical mechanism that actually is running that computation — and physical mechanisms have physical effects on the world.

So it turns out that all these physical effects are super informative. Let me just give a very simple one. There’s a lot of things here with, as I mentioned, temperature and electricity and things like that. But let me give a very simple one. RSA is a famous type of encryption. It calculates, as part of its operation, it takes one number and takes it to the power of another number. Pretty simple. I’m not trying to say something too exotic here. To do that, an efficient way of calculating this thing is it primarily uses multiplication and squaring. For whatever reason, that’s an efficient way of doing that.

Turns out that the operation to multiply two numbers and the operation to square a number uses a different amount of electricity. So if you track the electricity usage over time, you can literally identify exactly what numbers it’s working with, and break the encryption within minutes, as an example of a side channel attack.

This is a bit of an old one. It’s been known for, I don’t remember the exact time, but well over a decade. But to give a more modern one, just a year ago, there was a new paper showing that you can run malware on a cell phone. Cell phones often are not our work devices. We think of them as maybe our personal devices and things like that. Through the cell phone, you can, with the cell phone’s microphone, listen to someone typing in their password and identify what their password is, because it sounds slightly differently when you’re tapping different keys.

Luisa Rodriguez: And do you mean physical keys on a slightly older cell phone, or do you mean like my iPhone, which has digital keys on a touchscreen?

Sella Nevo: Oh sorry, just to clarify: I actually don’t mean them typing their password on their phone. I mean that their phone is on the desk. They are typing their password into their work computer. You identify what the password is on the work computer.

Luisa Rodriguez: Holy crap. That is wild.

Sella Nevo: Let me just add that I think that one recent concern is, as cloud computing is becoming more common, side-channel attacks are a huge problem. Often when you run something on the cloud, you share a server with others, therefore you’re sharing certain resources. A lot of people can directly or indirectly see actually how much processing your application is using and things like that. And if they’re smart, they can infer information in your applications. So that is a huge area of potential information leaks in cloud computing.

Luisa Rodriguez: That’s even crazier than I expected. I read a few examples that were like, through the sounds of the humming of a machine, you can infer something. And I was like, surely you can’t infer that much, but you absolutely can.

So I think Tempest is a famous example [of a side-channel attack] from World War II. Can you explain what happened there?

Sella Nevo: Yeah. So different people use this term slightly differently. I think that usually when people say Tempest, they’re thinking about a side-channel attack through electromagnetic radiation. Nowadays, it’s actually a standard to protect against side-channel attacks from electromagnetic radiation. But we’ll focus on the attack at this stage.

This is another very interesting example of national security agencies discovering something many decades before the rest of the world. So the first non-classified discussion of any kinds of side-channel attacks was in 1985. It was [research by] van Eck, which is why this is called Van Eck phreaking sometimes in academic discussions.

But as we just noted, actually, multiple countries were aware of these kinds of side-channel attacks, even around World War II. So that’s four decades where governments were able to just kind of collect this information with impunity. You could just put recording [devices] in the room next to it, and use lasers — which they did through windows — to be able to collect various types of information. So I would say Tempest is just the use of electromagnetic radiation. It’s a bit maybe less intuitive in our day-to-day experience, but the logic there is the exact same thing you would use with sound or any other type of thing. But it’s just a wonderful example of how there could be things in just completely different dimensions you don’t think of, that the government has already been collecting for decades.

Luisa Rodriguez: Decades. Yep. That’s insane. OK. There are many more that we haven’t talked about, and they are all really interesting, and things that I would have strongly bet against being real things. Like if I’d seen them in a spy movie, I would have been like, “Ha ha. That’s a fun, made-up way to get into someone’s computer.” And if anything, they’re actually weirder than the things I’ve seen in spy movies. They’re more surprising.

Sella Nevo: Maybe before we move on, just kind of what I think is the bottom line of all of these different attack vectors. Computational systems have literally millions of physical and conceptual components, and around 98% of them are embedded into your infrastructure without you ever having heard of them. And an inordinate amount of them can lead to a catastrophic failure of your security assumptions. And because of this, the Iranian secret nuclear programme failed to prevent a breach, most US agencies failed to prevent multiple breaches, most US national security agencies failed to prevent breaches. So ensuring your system is truly secure against highly resourced and dedicated attackers is really, really hard.

Luisa Rodriguez: Yeah, I’m completely convinced.

Getting access to air-gapped networks [01:10:52]

Luisa Rodriguez: Another category involves getting access to data or networks. Wiretapping is one example of this. Another example of this is getting digital access to air-gapped networks, which are networks that are physically isolated from other unsecured networks — so, like the public internet is an unsecured network. How is it possible to get access to air-gapped networks?

Sella Nevo: That’s a great question. Let’s imagine that indeed, you care about your security a lot. As we just discussed, you know there’s a lot of zero-days out there. So you’re like, “If my computer is connected to the internet somehow, I just don’t trust it. Whatever I do, someone might be able to overcome it.” Just as you described, you set up your air-gapped network. It’s not connected to anything: air-gapped is called air-gapped because there’s literally air between your network and everything else.

There are quite a few different ways to still get into an air-gapped network. The first thing maybe that’s worth noting is that just because you’re not connected by a network doesn’t mean your system doesn’t interact with the rest of the world in other ways. So usually, if you’re not connected to the internet via a network connection, let’s say an ethernet connection, you still need to communicate in some other ways. How do you get your security updates? In the AI context, how do you bring in training data? How do you take out the models that you’ve trained after you’ve trained them in such a secure system? Often the way you do that is through USB sticks. That’s the most common way that people interact with air-gapped networks.

So what an attacker could do is run malware that copies itself from a computer. It infects originally in a computer on the outside that is connected to the internet — maybe like one of the engineers that work with the system. That malware copies itself into the USB stick, and once you plug in the USB stick into the air-gapped network, it then copies itself into the computer, then does whatever it wants inside this air-gapped network. And then the next time they plug in a USB stick, that malware sends out the information through the USB sticks that it wants to exfiltrate — for example, the weights.

Luisa Rodriguez: That’s terrible! That seems like it really defeats the point of an air-gapped network. Why? It just seems like a huge flaw. Is the situation better than that?

Sella Nevo: I mean, I’d give it a slightly less harsh review, which is it does make things a lot harder. It is a pain in the ass to write malware that can do all of this. Maybe as opposed to having a continuous connection to the computer, you kind of are limited by these occasional interactions and things like that. So I think it’s definitely better to have an air-gapped network than to have your computer connected to the internet. But I mean, it’s definitely not impenetrable. So, yeah.

Luisa Rodriguez: OK. So it’s still an improvement, and so it’s good and worth doing. But this is a vulnerability. Are there real-world examples of this kind of malware being installed through a USB?

Sella Nevo: Yeah, this has happened many times. One minor thing to say: the fact that you use a USB stick is not in and of itself an information security vulnerability — because allegedly, the fact that you use a USB stick doesn’t mean they should be able to run code. Allegedly, you should be able to look at the content of the USB, and code that you never intended to run is not supposed to be able to be run. But again, as with everywhere else, there are vulnerabilities.

USB sticks are actually a great candidate, because they’re a pretty complex protocol. And when you plug it in, sometimes a thing will pop up and it’ll ask you what do you want to happen when you plug this in? That is always an indication that some things are happening automatically. So that can potentially be abused. Again, in a perfect world, it wouldn’t be able to do that. But we have to be sufficiently paranoid because we’ve seen many times that it happened.

So let’s talk about some of those times. Up until now we’ve talked about fully digitally hopping into a network, right? You do it all through this kind of internet, USB into the air-gapped network. Doing that is not easy. Usually it’ll be APTs — advanced persistent threats. Tends to be nation-states even. But we have a bunch of examples. I’ll name a few. Retro is one, USB Stealer, PlugX, Agent.BTZ. Maybe more famous than the other ones is Stuxnet, which did this. But yeah, there’s quite a few examples.

Luisa Rodriguez: Do you mind actually saying a few words about Stuxnet? That is a famous one, but it’s one that I didn’t know enough about.

Sella Nevo: Yeah, Stuxnet is an interesting one. I assume many people know this, but Stuxnet was malware that managed to infiltrate into the Iranian nuclear facilities, and then caused their own centrifuges to destroy themselves. So they configured the centrifuges to be able to kind of overwork and destroy itself. And it was pretty impressive. We know of at least four zero-days that it used. It did have the ability to hop through USB sticks. It also had the ability to propagate through network shares, through printer vulnerabilities and other ways. So yeah, that’s a pretty famous example of a pretty advanced piece of malware.

Luisa Rodriguez: Yeah. I just didn’t realise that Stuxnet used USBs as part of the thing. So that’s really unsettling, because I think of Stuxnet as a really terrifying example of a security breach.

So those are some examples. I find these fascinating. So if people listening heard those shorthand names for cases where this happened before, and you’re curious, I thoroughly recommend going and reading about them. They’re wild.

Sella Nevo: I think another thing that’s maybe worth mentioning about USBs is… Let’s put highly secure air-gapped networks aside for a moment and just talk about getting a USB to connect to a network. It’s worth flagging that this is a really easy thing to do. One thing that people will do — and this is not just nation-states and whatnot; this is random hackers that want to do things for the fun of it — can just drop a bunch of USB sticks in the parking lot of an organisation, and someone will inevitably be naive enough to be like, “Oh no, someone has dropped this. Let’s plug it in and see who this belongs to.”

Sella Nevo: And you’re done. Now you’re in and you can spread in the internal network. This happens all the time. It’s happened multiple times in multiple nuclear sites in the United States.

So yeah, this is a pretty big deal.

Luisa Rodriguez: That’s unreal.

Sella Nevo: Now, I think that many people, like you, will find that surprising. I think security folks are kind of being like, “Well, no one would. Everyone knows, everyone in security knows that you shouldn’t plug in a USB stick.”

Luisa Rodriguez: Shouldn’t just pick up a USB stick. Yeah.

Sella Nevo: But let me challenge even those folks who think that this is obvious, and also in that way bring it back to the more secure networks we were talking about before. So indeed organisations with serious security know not to plug in random USB sticks. But what about USB cables? So Luisa, let me ask you, actually: if you needed a USB cable, and you just saw one in the hallway or something, would you use it?

Luisa Rodriguez: 100% I would use that. Absolutely. I actually, I’m sure I’ve literally already done that.

Sella Nevo: So here’s an interesting fact, which I think even most security folks don’t know. You could actually buy a USB cable — not a USB stick, a USB cable — for $180 that is hiding a USB stick inside and can communicate wirelessly back home.

So once you stick that cable in, an attacker can now control your system from afar — not even in the mode that I mentioned before of, you wait until the USB stick will be put in again. It just continuously can communicate with and control your system. I guarantee you that if you toss that cable into a tech organisation’s cables shelf, I guarantee it’ll be plugged in.

Luisa Rodriguez: Absolutely. Yeah. That’s really crazy. Has that been used in the real world?

Sella Nevo: I don’t know. There’s a company that’s selling them. I haven’t seen reports of when it’s been used, but presumably if it’s a product on the market, someone is buying it.

Luisa Rodriguez: That’s really, really wild.

Model extraction [01:19:47]

Luisa Rodriguez: So those are a few that you worry about. There are a few that, going through your report, I was especially interested in. So one category of attack is AI specific, and I just wanted to touch on a few examples of those. One is called “model extraction.” What is model extraction, and how would it work?

Sella Nevo: So model extraction, annoyingly, there still isn’t in the space a standard terminology that everybody uses. Sometimes people use “model inversion” to refer to that, but sometimes people use “model inversion” to refer to a very different kind of attack. So we’ll stick with “model extraction” for now.

Conceptually, it’s pretty simple: it’s ways that you can interact with the model — often querying it a bunch of times — and from its answers, inferring what its weights are. At least approximately.

Luisa Rodriguez: That sounds really hard. That sounds like training a new model.

Sella Nevo: It turns out it might be a lot easier than training a new model.

Luisa Rodriguez: How so?

Sella Nevo: Let’s start with actually talking about distillation. Technically this is not model extraction, but I think it’s very relevant. In distillation, you don’t care about the specific weight values that are in the original model; you just want to learn to imitate its performance. Even if your model ends up internally looking different, it behaves similarly, and it can do the same things.

Distillation is actually extremely standard in machine learning. It’s done all the time. It’s part of a pretty standard technique in machine learning. Lots of the open source models that you are familiar with are distilled from closed models, including in some cases based on standard APIs that you can find online.

Luisa Rodriguez: Interesting. And just so that I understand that, is distillation less compute-intensive than training the frontier models themselves?

Sella Nevo: Yeah. So it depends on the size of the model you’re distilling. For example, you can effectively distil a very large model into a smaller model. It won’t have the same quality, but it’ll have a lot of the same capabilities. And it’s also worth highlighting that compute is only one of those constraints. So potentially you could use a similar amount of compute, but without the training data, without the algorithmic knowledge — without all of these additional requirements.

Luisa Rodriguez: Yep. OK, so that’s distillation.

Sella Nevo: Maybe before we move on from distillation, I think that a lot of people will say that distillation is not a concern because the outcome is kind of lower quality, and so we don’t need to worry. I think that’s broadly reasonable, but I think people need to be careful about being overconfident about that not being a concern. I mean, in some contexts already, researchers use distillation — or what’s sometimes called “teacher-student” training — to actually improve the results. It sometimes improves regularisation, robustness, and things like that. So I think it’s not that obvious that it’ll always be worse than the original model.

Luisa Rodriguez: Interesting.

Sella Nevo: And even more importantly, it’s just a very nascent field. We’re still kind of fumbling around in the dark, understanding how these things work. So I think it’s plausible that distillation might be a major concern.

But yeah, but we can put distillation aside and focus on proper extraction, where you literally are trying to identify what are the original weights of the model, and replicate it exactly. So this has been done. Microsoft put out a tool called Counterfit — spelled with “fit,” like fitting a model — that implements a variety of attacks, not just model extraction, but it includes multiple implementations of model extraction attacks that work on many large language models. By the way, they call it “model inversion,” if you’re looking for that. It’s, again, very confusing.

But yeah, but one might say, OK, but still, the fact that there’s a model and they show that it works is still not a real-world attack, maybe on a real-world deployed model. A recent paper though, titled “Stealing part of a production language model,” which had authors from DeepMind and OpenAI, successfully stole the embedding projection layer from OpenAI models just by using the OpenAI API. We do have a real existence proof of someone successfully doing it through the OpenAI API.

Luisa Rodriguez: OK, so is it that worrying, given that’s just one specific layer and not the whole model?

Sella Nevo: Yeah, that’s a very fair point. I’m not aware of an attack that takes all of these different components and does it together — one, in the real world, and two, attacks an actual frontier model and not just a smaller model, and the whole model was stolen.

But I don’t think that should give us much confidence that it can’t be done, or even that it has not already been done. The reason for that, one, is the more the cost of the operation and the value of being able to do it becomes higher, and the more it becomes illegal to do, people just become less incentivised to tell us to do that. The things we see are the academics who just want to publish a paper, not the folks who are really trying to operationalise this and do something. That is very important.

The second thing that I think is really worth being aware of is that we have precedents of people doing much, much, much harder things that are quite similar. Again, the importance of stealing AI models is fairly novel. We really should be learning from history, not just from AI-specific examples.

Luisa Rodriguez: Sure.

Sella Nevo: Just to give maybe one example of this: cryptographic hash functions are functions that were built for the sole purpose of it being hard to look at their outputs and infer their inputs. And it’s an analogous situation. Many hash functions were successfully inverted, were successfully used to be able to infer keys — which are the kind of secrets that they’re trying to hide — years after they were trusted by the global community very strongly.

And maybe just to give a very concrete example that is I think intuitive to people: in the late ’90s and 2000s, every single movie made a big deal out of how we should not be using cell phones because the government could listen to cell phones. It’s not the only reason, but one of the reasons people were worried in the ’90s and 2000s about cell phones is because the hash function that was used in the GSM — the second generation of cell phone protocols — was once thought to be secured, but turned out not to be. It turned out that you can, with enough research, find out how, with looking at a bunch of queries to the hash function, infer what is the secret key that the hash function was applied to.

And initially, it took 50,000, 70,000 queries to be able to extract that key. So then the mobile companies, or the companies that create the SIMs more specifically, blocked it — so that if you try to query it tens of thousands of times, it’ll kind of self-destruct in various ways. But then, after a few more years of iterations on research, the key could be extracted with just eight queries. So you ask eight questions and you can extract the key.

The reason I tell this, it’s not a perfectly identical situation, but hash functions are functions that, at least some of them, were the result of many, many years of top cryptographers trying to make it literally impossible to do this exact thing — and people succeeded. No one has even tried to do that with these neural networks. It’s not that these neural networks were built to prevent these kinds of things from happening. So yeah, I think people should not rush too quickly to infer that something is impossible.

Luisa Rodriguez: So I don’t know really anything about cryptography or hash functions, but first, can you tell me what is happening when someone submits a query that then returns values that are part of how the hash functions were understood?

Sella Nevo: Yeah. So I was speaking very loosely to try and not get bogged down in details. If we are going to get bogged down in details — or not bogged down, hopefully engaging excitingly with the details! — a hash function has an input and an output. So you put in, let’s say, a string of text, and that produces a random string of say 256 bits. The goal is that if you see that output, you can’t know what someone inputted into it, but also that in some ways it represents that original input. So if you put two different inputs into it, you won’t get the same output. That’s roughly what they’re meant to do.

In practice, it’s used in a lot of different settings. One of them, for example, is saving passwords. You don’t want to save the original password, because if someone hacks you, they can find it. You save the hash of the password, and then when someone tries to log in with the password, you can hash that password, compare it to the hash. It should only be equal if the password is the same. But if someone has the hash, they still don’t know what password they need to give to be able to pretend to be you.

Luisa Rodriguez: Right. OK.

Sella Nevo: The way they’re used in the context that I was talking about is differently. Often what you would do is, a form of authentication to show that you have a password or a key is to put in the key plus some kind of, let’s say, random string into a hash function. And then say, look, here’s the random string — this is sometimes called the “challenge” — and here’s what the result of the hash is. I wouldn’t be able to produce this hash if I didn’t have the key. But because the hash doesn’t let you go back from the output to the inputs, you can’t find out what the key is.

And these hash functions are literally built to optimise for — given a certain amount of compute that you’ve invested in the hash function, and you want this to be efficient and fast, just like you want neural networks to be efficient and fast — they’re kind of optimising the operations that they do to make it hard to go back to understand what was inputted into it. And the way they do that is mainly by mixing lots and lots of different operations that conceptually, mathematically, are hard to analyse together. Some of them are additions and multiplications. Some of them are bit operations — like “and” and “or.” Some of them are these operations in different mathematical fields. Some of them are just using a kind of hard-coded table that sees what do you have: you see 13, you put back 207. These kinds of things.

So a neural network could be thought of as a hash function. If you care about people not knowing the weights, you could think about the input as a random string that people maybe know, the people who are querying it, and the weights is actually the key. So it takes as an input both of these things — the inputs and the weights — does a function. This is not a very terrible, hard-to-analyse function. It’s not easy to analyse — it’s not linear and things like that, which is good from this perspective — but it definitely hasn’t been optimised to make it hard. And then it produces an output that people see. And so you hope you can’t go from the output to the original weights.

People’s intuition that because you’re mixing things up a lot, you just can’t go back is not a great intuition — against at least people who know cryptography. So it’s true that you do want to mix things more to make it harder to go back, but you really need to do things quite intelligently if you want to avoid it being possible.

So I’m not claiming an actual attack on model extraction here. I’m just trying to say a lot of our hash functions took years of research to find the way to do that. Neural networks do a less good job, I’m going to guess, than hash functions — and therefore I think it’s very plausible that in a few years that will be found.

Luisa Rodriguez: That is fascinating. So we don’t have real-world examples of model extraction for entire models, but there’s some reason to think it’s possible, and I found that pretty compelling. How hard would this be to do? Or what kind of actors might be able to do it?

Sella Nevo: I think the development of a new model-extraction attack is at least not trivial. You would need to know these kinds of attacks. It’s very hard to say. It’s genuinely, in my view, hard to say whether it’ll be that you need just like a cryptographer to spend a few months on it, or if it’s like a huge challenge and you’d need years of top talent to do. I don’t know the answer to that. But it’s also worth noting that once an attack is found, it may be the case that then everyone can do it easily.

Luisa Rodriguez: Right, right.

Sella Nevo: For example, with cell phones, I was mentioning that there was a vulnerability in the hash function. Once this was published, you could buy a device that you could ask someone for their phone — “I need to make a phone call. Can you give me your phone for a moment?” — they give you their phone, you slip out the SIM, you put it into your device. The device does that in 30 seconds. You give them back their phone, and now you can, for example, duplicate their phone and make calls on their behalf. So I think we just mainly need to maintain a high level of uncertainty about what will be possible and by whom.

Luisa Rodriguez: OK. Again, I don’t know much about hash codes in particular, but maybe one difference is that large language models have an enormously large number of model weights. Is that something that could make this a kind of impossible task, or near impossible?

Sella Nevo: Yeah, that’s a great point. That is a difference. In all the hash function examples I was saying, the amount of data you eventually want to discover is smaller. So that’s a genuine difference, and a good point.

Just to step back for a moment. Experts we talked to had a huge range of views on whether this is feasible or not — from some claiming this will be trivial to do, to some claiming that this is literally mathematically impossible. And I tend to agree that the kind of information-theoretical argument is the strongest one against this being possible. Maybe to kind of rephrase that argument, I would say, if you’re trying to infer 10 trillion bits of information, you can’t do that by only seeing a billion bits of output. And therefore, we can mathematically show that you can’t do this.

So I think that is a very big challenge for extracting neural networks. I think that’s an important point. To be cautious about being overconfident, let’s maybe hash this out in a few ways that make this, I think, not provably impossible, but rather just, “here’s a challenge that would need to be overcome.”

Luisa Rodriguez: OK, great.

Sella Nevo: First: who says we need the full information in the neural network? One example arguing against that is that quantised models that only use four bits out of every 16 bits, for example, work reasonably well — and we’re still not sure how much we’ll be able to squeeze out of quantised models in the future. So that’s already reducing the size of the information you need quite a bit.

Most frontier models today use what’s called a “mixture of experts” — so they actually have say eight different models, independent models, and you kind of route it through them so that you’re actually seeing only the answer of one model at any given time. But different answers might be given by different models. So clearly you could take only one-eighth of the total model, and that can work well and be very useful for specific use cases. That one-eighth is orthogonal to the one-quarter, for example, that we mentioned before, so that continues to reduce the size you actually need.

Luisa Rodriguez: Yeah, yeah.

Sella Nevo: Now, this is just the easiest example of a mixture of experts. Who’s to say that you can’t get a usable model for more specialised needs? Let’s say you want to abuse it in a specific way. Who says you couldn’t do that with a lot less? There’s been quite a bit of success on what’s called “task-specific distillation.” It’s this distillation we talked about before, but you don’t care about the whole model — you care about more narrow tasks that you want to use the model for. There has been success in using that, so that seems to be an indication that potentially you can get a much smaller component of the model and still be able to use it.

What that means is, even if we’re not talking about anything new that we’ll discover in the future — and as I assume you’re inferring by now, I think there might be a lot of things we’ll discover in the future — just with this, let’s do some simple calculations.

ChatGPT, for example, has more than 100 million daily users. Each response is usually thousands of bits in length. That means we already know that it produces 100 billion bits per day, probably a lot more, because most people, if you’re using it today, you’re probably not asking a single question. What that means is that, even if you’re able to drive a small but significant portion of queries to ChatGPT, and you do so in a distributed and undetected manner, so you pretend to be millions of users and so on, that could already get you enough information to overcome this kind of “there’s not enough information” barrier in just a couple of months.

I’m not trying to say, “…and therefore it’s easy,” right? It’s hard to do these things, to be like 10% of all of ChatGPT queries without being caught. It’s not trivial. I have tonnes of uncertainty on how the growth in model size will compare to the growth in market size in the future, and that’ll affect how easy this will be.

But let’s now take this a step further. Instead of trying to do a distributed attack across ChatGPT’s public API, what if you managed to sit on the company network? You don’t manage to get to the weights — maybe they protected the weights very well — but you are in the company network. Now you can query it a lot more, a lot faster, and maybe with a lot less monitoring, because you’re not going through the public API. People tend to trust their own company more, so potentially could do that a lot faster.

So the point is this, I don’t know if it’s possible, but I just caution against overconfidence that it is not possible.

Luisa Rodriguez: Yeah. It does just seem much better to try to protect yourself against it and it turn out to not be an issue, than to be like, “Probably it can’t happen,” and it turns out it can.

Do we have ideas or maybe just specific things already that we know how to do to protect weights against this kind of attack?

Sella Nevo: There are a lot of things that people are discussing and even using. They’re still nascent. So I think there’s still not a lot of data on how reliable these things are, and how effective they are. But one classic thing that makes sense is fuzzing your results — so making sure you introduce more randomness into your process. That makes things generally harder. It’s not a proof that it will prevent an attack, but it seems to be moving in the right direction.

Luisa Rodriguez: OK, cool.

Reducing and hardening authorised access [01:38:52]

Luisa Rodriguez: OK, so the report makes seven top-priority recommendations for AI companies that you argue are critical to model weight security, but also feasible within about a year. I want to talk through a few of those. What is the single most important thing AI companies should do now to secure their model weights that they might not be doing already?

Sella Nevo: So first of all, I’ll give maybe my general constant caveat, which is that security can’t be solved with a handful of isolated actions or policies. It’s sort of a weakest-link game, and therefore there really are a lot of things that need to be done. It’s important to achieve comprehensive security.

To try and help with that, just to lay some groundwork, we provide benchmarks for five security levels. Each security level is aimed at defending against an additional category, what we call an “operational capacity” category. Each benchmark provides recommendations for what you need to do to achieve each security level. You mentioned seven highlighted measures. In total, we give 167 recommendations for security measures that we think organisations should take.

And just to clarify, the expectation is not that everyone implements exactly these measures, but it’s a benchmark. It’s meant to say you need something comparable to this if you hope to be secure against these kinds of actors. That being said, let’s talk about some specific ones.

Let’s maybe start with what I think is a very obvious one, which is reducing and hardening authorised access. I like to think about this maybe in three steps.

The first is centralising and managing all copies of the weights in a proper access-controlled system. You should not have copies of the weights lying around in people’s laptops, just saved on their hard disk and able to do whatever they want. You need your system to allow for permissioning: who is allowed to access and who is not. You need it to allow for monitoring: no one can interact with the weights without us knowing that they’re interacting with the weights. And it needs to prevent just trivially copying it: Ctrl+C, Ctrl+v, and now I have it somewhere else. That seems like a very basic thing if it’s very important to you to not have the weights stolen.

Luisa Rodriguez: And actually, because it does seem pretty basic, I’m curious, do you have a sense of whether this is something that the leading AI companies are doing already?

Sella Nevo: I can’t talk about any specific company, but I’ll say just kind of a general sense, which is that I think this is not already comprehensively implemented across AI labs. I think they’re interested in this, I think they’re working towards this, but I don’t think this is comprehensively implemented.

But that’s just step one. Let’s talk about step two. So second is you then want to reduce the number of authorised users, or at least those that have full-read access. Currently, even in the labs that I think are taking security most seriously, still hundreds of people have access to the weights. And I would argue that that is way too much.

Maybe I’ll give a simple estimate. This is somewhat simplistic, but I think still useful. Let’s say that you’re the CEO of a frontier lab. I would argue there is no chance that you have 50 employees that you are at least 98% confident wouldn’t steal the weights. Remember, these things are worth at least hundreds of millions of dollars — and someone might be bribing them, extorting them, using an ideology that they believe in, and so on. Maybe you’re a great socialite, but I do not have 50 people that I know that well to know they would not do that with 98% confidence.

But let’s imagine that you do have 50 people that you’re 98% confident. Still, if you gave all 50 of those people permission to read the weights, full-read access, that would allow them potentially to leak the weights. And 98% to the power of 50 is about 37%. So there still is almost two-thirds chance of a leak with only 50 employees with access. And these are highly trusted employees.

Maybe just to clarify: obviously this is a very painful tradeoff. The labs genuinely do in some sense need hundreds of people to have access to the weights to do their job. This is model developers, infrastructure developers, interpretability researchers. There’s a lot of different people who need this.

So the way to overcome that tradeoff is through what I’d call step three, which is hardening interfaces. You just want to make sure that just because someone has some access doesn’t mean they have a concerning type of access.

Luisa Rodriguez: OK. Yeah. What’s the difference between a concerning type and some access that’s still OK?

Sella Nevo: So there’s a lot of technical details, but there’s a few things you can do to improve the security of an interface, and think of it as not at least obviously concerning. Before I talk about what they are, let me just quickly say we already do this for external access. In some sense, all of us, we all have access to the weights. It’s just a very limited interface. We can all ask OpenAI or Anthropic or Google DeepMind to use the weights, but only specifically for inference. We need to do something similar for all access, including internal access.

Now, the same solution won’t necessarily work, but in the report we suggest three approaches that could be used for different types of interaction with the weights. The first is maybe most similar to how we think about external interaction, and that is preapproved hardened interfaces: there’s specific code, we’ve tested that code, we’ve had top security people looking at it, and we say, by only asking the server to run this code, you can’t abuse it to steal the weights. The classic example is inference. It’s fine: we want to use these models for inference, but not for other things. But you could do that internally as well. Maybe you want 50 other types of interfaces that your researchers and employees will be able to use, but that don’t allow them to directly access the weights.

The second option is using output rate limitations. I think this one is motivated by the fact that it can be hard to only use predefined code to interact with the weights. Many employees will complain about that. Some of them have complained to us as we did interviews for this report. But you can use an alternative. You can say, OK, you want to run your own code, whatever you want. You want to be flexible on these weights. That’s fine, actually. But you need to do this through an interface that limits how much output you can actually extract from this whole computation.

For example, you send your code into a secure server, that secure server runs it, it’s fine. That code is flexible, it does whatever it wants, but then when it sends it back, it’s rate-limited. Maybe it only allows you to produce 100 bits per second of output. And note, this is not just 100 bits of the weights, this is 100 bits of any kind of output. Because I don’t want any fishy business of, you take the weights, you then do some calculations, and it looks like it’s not the weights. It’s any kind of output.

That’s not trivial to limit yourself, but if we had good infrastructure for doing this, that would be very effective. Right now, you want to have a plot of some statistic over the weights, that plot is a picture, that picture costs a lot of bits. That’s a pain. But if you set your infrastructure up right, meaningfully humans can’t process more than 100 bits per second. So presumably if you, for example, didn’t send the PNG, you sent in only the data, and then on your computer you set up the image, that would work. So this requires more infrastructure. But I think this actually allows for a lot of flexibility while really reducing the risk.

Luisa Rodriguez: Cool.

Sella Nevo: Finally, maybe a third version is one that further trades off the ability to interact with the weights freely against the ability to then control how information leaves the system. And that is: you could work on an actually isolated network — for example, the air-gapped network that we mentioned before — but maybe you want something a bit better than just air-gapped. You also don’t want people sticking USBs in and out, and there’s a long list of other things you want to constrain.

But this would be useful for the rare interactions that truly need heavy input/output with the weights. For example, this is contested, but some people claim that maybe some types of interpretability research really need to read a lot of the weights, interact with a lot of the weights in a free way. That’s fine. You want to fully, freely interact with the weights? Please walk into this room. This room is secured. You can do whatever you want, but you can’t leave with that data.

So these are kind of three different ways you can constrain it that serve a lot of different types of interactions you might want to have with the weights.

Confidential computing [01:48:05]

Luisa Rodriguez: Cool. OK. I really would love to ask more questions about that, but I think we should actually talk about another recommendation that you think is high priority. What’s the next kind of most important, or would make the biggest difference?

Sella Nevo: Let’s maybe talk about another one called “confidential computing.” So this requires a bit of background. For many decades, and in some sense for multiple millennia, people knew that if you’re sending sensitive data over an untrusted connection, you need to encrypt it. That’s why we all know we’re supposed to use SSL, or today TLS, when communicating over the internet, which is famously unsafe. But it’s also why Julius Caesar, thousands of years ago, would encrypt messages before he sent them on a carrier pigeon or with a messenger boy.

So this is called, these days it’s called “encrypted in transit.” While you’re moving around, you encrypt. That was a good start. But then people found ways to get into your actual systems — for example, your actual computer — and read sensitive data from there. So people started encrypting data even when it’s stored in allegedly safe devices. So even when it’s in your hard disk, for example. And this is called “encryption in storage”: while you’re storing the data, it should also be encrypted.

That’s better, but then attackers can still just wait on the system until you’re about to use the data — for example, if we’re talking about weights, you’re about to do inference — let you decrypt it, and then steal the data.

This is actually the situation right now for the vast, vast majority of systems. But we’re at the cusp of that changing. Confidential computing is one approach to solving that problem. The general goal is what’s called “encrypting data in use.” So even while you’re using the data, it should be encrypted. But that’s more challenging than encrypting in storage and in transit, because how can you keep the data encrypted while using it? The whole idea of encryption is that it’s kind of scrambled.

And it is indeed a hard problem. The most hardcore approach is called homomorphic encryption. That’s a special type of encryption that allows you to do certain calculations on the encrypted data without even decrypting it. It’s pretty cool. Mathematicians love it, but it has a lot of practicality problems. It has huge overheads. It severely limits what you can and can’t do with the system. And it doesn’t really work with large networks, at least so far. There’s progress that would need to be made before that’s possible in practice.

The most popular approach is what’s called confidential computing. Roughly, it works as follows. The weights are encrypted in storage. That’s already obvious. You have a separate “trusted execution environment” — that’s often a separate chip that has a bunch of different defences where you’re not supposed to be able to just do whatever you want in it. The decryption key, what can decrypt the weights, is only stored in that trusted environment. And that trusted environment will only run specific signed code that accepts the weights, decrypts them within the trusted environment, not outside, and performs inference. It will not agree to run code that does other things — for example, decrypting and outputting the decrypted weights. So assuming everything works properly, no one can ever access the decrypted weights, even though you’ve used them. Does that make sense?

Luisa Rodriguez: Yeah, yeah, yeah.

Sella Nevo: Now, it’s worth mentioning that this requires some infrastructure, and the hardware is not really ready yet. The first GPU that supports confidential computing came out recently, but it’s not really ready for prime time. But we’re close, and once the infrastructure is there, the overheads of doing this are pretty minor.

So a lot of people are excited about this. There’s actually a shocking consensus, really. We were quite surprised by how everyone seems to be excited about confidential computing. This is just like, across the industry, people are pretty uniformly excited about it.

Luisa Rodriguez: Cool.

Sella Nevo: So yeah, as a result, we strongly recommend that hardware companies like Nvidia prioritise better support for this, and that AI companies prioritise the deployment of this.

I do, though, want to just quickly say they do have their limitations. So everyone is super excited about this. I am also very excited about this. Sometimes people go a bit too far and kind of start treating them as a silver bullet, as if they are perfectly secure. They’re not perfectly secure. I also think we recommend them in security level 4: so protecting against certain actors, but I don’t think they’re sufficient for some actors.

Not saying this to pour cold water on this solution. I’m very excited about moving forward with it. I still do want people to be aware that it does not fully protect you, and a lot of other things are needed. Just to flag quickly: there’s a bunch of attack vectors that we discussed in the report that are simply unaddressed by confidential computing: distillation and model-extraction attacks that we discussed. The abuse of the model in its place obviously isn’t protected, because you provide the interface for doing inference and so on.

They’re not meant to protect against sophisticated physical attacks — so if you have long-term access or invasive techniques, they do not protect you against those things. The current only GPU that supports confidential computing, the H100, doesn’t even include any physical attacks in its threat model. It doesn’t protect you against hardware supply chain attacks. They’re not really secure against side-channel attacks without doing extra things that are not currently incorporated. They don’t protect you against the people setting up the system itself that you need to trust to sign the right code and things.

So just to kind of flag that it’s an incredible step forward. I’m very supportive. It’s one of the most important things we should be doing. But do not fool yourselves that you are now fully secure.

Luisa Rodriguez: OK, great. I think that does seem like a really important caveat.

Red-teaming and security testing [01:53:42]

Luisa Rodriguez: Let’s talk through one more recommendation. If there were a single additional thing that you could get everyone to do, what would it be?

Sella Nevo: I think as this is the last one we’ll talk about, maybe I’ll go meta, which is rather than talk about a specific security measure, I want to talk about red-teaming and security testing. I think that’s really important, and really important to do properly. Just to maybe motivate this as well, we don’t have a good first-principles way of identifying that a system is secure. The information security field advances primarily by people identifying new vulnerabilities and then finding solutions to address them. Similarly, it’s very hard to define a set of predefined tests or questions that ensure that a system is secure. One of the best things we can do is just let a really talented team try to reach the weights and exfiltrate them, and see if they succeed.

In practice, there’s a bunch of different ways of doing this. Each one has advantages and disadvantages. Red-teaming is kind of closest to what I described as literally having someone try to break in.

Very small side note: recently folks thinking about AI have been using the word “red-teaming” to mean like capability evaluations. This is a different type here. We’re talking about red-teaming of the security system.

There’s also stuff like blue-teaming, which is when the security team itself tries to identify opportunities for improvement and flaws and then addresses them. There’s purple-teaming, when the red team and the blue team collaborate. There’s a whole world of colours and words and things. I’m going to put all that aside.

I’m going to talk about red-teaming, though. There’s value in doing a lot of different things. I think red-teaming is really important, because it’s a really efficient way to try and get top experts to find where the flaws are and then flag them — as opposed to listing all the things that one needs to do, and then even a single line of code that’s written wrong could undermine the whole security of the system. It’s hard to do that without letting someone say, “You’re an expert on this, go find that line of code.”

Luisa Rodriguez: Go try it.

Sella Nevo: But it’s really important to ensure that this red-teaming effort is effective rather than just providing a false sense of security. It’s very easy to have a red-teaming effort that isn’t well placed to find what it needs to find, and as a result just makes you feel good but isn’t really reliable.

Luisa Rodriguez: How exactly does that go wrong? Is it just that the list of things they’re trying are some sensible things, but they’re not actually trying to be super creative and actually try to make up new solutions to problems that they hadn’t previously guessed might be problems?

Sella Nevo: Yeah, I think that’s a great example. Let me maybe generalise from that and say the team that you use needs to at least reasonably simulate the attacker you want to test for. Maybe the most obvious thing is you can’t say to someone, “Hey, try to hack my system for 30 minutes” — and now I know I’m secure over the next five years against, you know…

Luisa Rodriguez: Russia.

Sella Nevo: Exactly. So there’s a lot of different things you need for that to happen. One is that you need the team to be sufficiently talented; they need to be capable. Second, you need them to have very significant resourcing: the amount of time and money and tools. You also need — and this maybe touches on what you were just saying — a diverse skill set. Our attack vectors aim to show that there’s cybersecurity involved, there are physical securities involved, there’s human intelligence — there’s a lot of different things that your team needs to be able to do.

Now, as you move towards the highest security levels, it just becomes implausible to literally simulate the amount of resources that these organisations have. But there are other ways to help simulate this in a more cost-effective way. One of the ways to do that is to give your red team various “privileges.” Maybe they start off with the various credentials that we think a capable actor would be able to get. Maybe we give them information about how the security system works so that instead of spending months trying to reverse engineer it, they get a head start. Maybe we give them some kind of zero-day privilege where they can just say, “I’ve reached a system that’s hard. I have three cards that say, ‘I overcome this system.'”

Luisa Rodriguez: “I get to play a zero-day.”

Sella Nevo: Then you need to remember that if they succeed with all of these privileges, you can’t go back and say, “But an actor without the privileges wouldn’t succeed” — because the only reason they got these is because we do think that a more capable actor would be able to do that.

Luisa Rodriguez: Right. They’re representing real things.

Sella Nevo: Exactly. Just in a more cost-effective way. So that’s a really big component. There’s a lot about just allowing them to do what they need to do, giving them permission to engage in a wide range of attacks, being allowed to do physical trying to break into things, which is nontrivial. Being allowed to engage with employees. Of course, all of this needs to be done within reason. There are things that just would be immoral to do, but it’s important to allow them to do as much as we can.

Also, do this without the security team getting reports on activity. If the security team knows that they’re coming through something, it’s not a surprise that they then catch them. We want to check if they’d be able to go undetected.

And maybe finally — and this also touches on your point about creativity — is making sure you have aligned incentives. Many red teams just don’t have sufficient incentives for everything we’d want to do. They’re often highly motivated to do some things, but I would argue that their compensation should be influenced by whether they succeed or not, which is not classically the case. Sometimes what they are aligned to is not exactly equivalent to the actual concern. I’ve heard sometimes organisations say, “It’s true that they always find vulnerabilities and get in, but we also always detect them.” But if you’re not incentivising them not to get detected — often they’re judged on how much they got in — then I think you’re not testing the right thing.

And finally I’ll just say, if you want the results of the red team to be not just a useful tool for the company itself, which is an important reason to do them, but also some kind of externally reliable signal on whether the company is living up to their security goals — let’s say if the government is interested in this, or if the public wants to be reassured — then you also need to use a third-party team. You can’t have only company employees saying “we are secure” if that statement has external implications for the company, and how trusted they are, and what they’re allowed to do.

Careers in information security [01:59:54]

Luisa Rodriguez: Yep. That makes tonnes of sense. It also just sounds like such a fun job, which is a great segue. We’re not going to talk too much about this, but I did want to ask you a little bit about this field, and the kinds of careers people might be interested in. How much need is there for more people to be working on this? And what kinds of things might those people do?

Sella Nevo: Oof, yeah, there’s an enormous need for more people who understand cyber and physical security well — and specifically ones who are willing to work towards preparing for the models of tomorrow, not just the kind of business-as-usual security that already a lot of cybersecurity experts work on.

There’s a lot of different roles to do. I’ll maybe highlight two categories. One is there’s a lot of work on the technical side — so actually developing secure systems, doing cyber evaluations. This work exists in the labs themselves; in other organisations that are not the labs themselves, but are doing important research and development; and in governments as well: there are government agencies that try to do cyber evaluations, and they have a lot of work on building those systems and so on.

Then there’s also a lot of work on the policy side: developing policies that are informed by a deep understanding of both the challenge and the potential solutions to be able to create policies that actually make sense. This is a really key bottleneck at the moment for a lot of different organisations. This is true for government, this is true for think tanks like RAND, this is true for the AI labs themselves.

I’ll just flag that at RAND, we are hiring for both of these types of work — both technical work on this and people who want to develop policy solutions. But yes, I think it’s pretty common to hear people in AI security saying their number one wish on their wishlist is to have more people who understand information security and are willing to work on these problems.

Luisa Rodriguez: OK, yeah. If you’ve gotten to this point in the episode and you found this stuff interesting, it sounds like there’s lots of opportunity here for impact.

Sella’s work on flood forecasting systems [02:01:57]

Luisa Rodriguez: We’ve got time for one final question. This is a huge topic change, but you led the development of flood forecasting systems that now cover more than 450 million people across 80 countries. What exactly does that system do, and how does it work?

Sella Nevo: Yeah, that was a really great and fun project. It’s the shared effort of about 30 great people in Google Research.

It’s a complex system, but at a high level, what it does is the following. First, it collects lots of data from across the globe. This includes satellite data, including optical imagery, hyperspectral satellites, microwave and more topographic maps, weather models — that include stuff like precipitation and temperature and things like that, on-the-ground measurements from rivers across the world, historical records of floods, and other things.

We then trained a machine learning model to predict the amount of water in rivers a week forward. So we used a model called LSTM — it’s a type of neural network for series prediction that came before transformers. Our five-day forecast is about as accurate as the previous state of the art for a zero-day forecast. So this was a huge leap forward in terms of the ability to predict what will happen with rivers.

Luisa Rodriguez: That’s incredible.

Sella Nevo: Yeah. One of the things that I’m most excited about is increasing access to high-quality warnings across the world, especially in areas where that doesn’t currently exist. And we were also able to show that our average accuracy across Africa and Asia is similar to the previous state of the art in Europe.

So that was a very exciting one. And then that allows us to know if a river will flood, depending on how much water is there, but not exactly where will be affected. So we trained a separate physics/machine learning hybrid model to predict exactly how water will flow across the floodplain, and that allowed us to improve the spatial accuracy of warnings from about one-kilometre resolution to about 50-metre by 50-metre resolution.

And then finally, we use this information to both directly notify individuals through Android notifications — so we send Android notifications to people who are affected — and we also collaborate with the Red Cross and other humanitarian organisations to support broader alerting and preparation, and we also alert relevant governmental authorities for more serious efforts like evacuation and things like that require the support of the government.

Yeah, it’s a really exciting project. It took quite a few years, but we now have randomised control trials showing that this significantly reduces injuries and costs from floods — though most of these results are not yet published [though a preprint is available!]. But yeah, it was a very exciting project.

Luisa Rodriguez: That sounds like an incredible project. Thank you very much for working on it, and thank you so much for coming on the show. My guest today has been Sella Nevo. Thank you.

Sella Nevo: Thank you.

Luisa’s outro [02:04:51]

Luisa Rodriguez: I hope you really enjoyed my conversation with Sella. If you did, it could be that the right next step for you is speaking to our one-on-one advising team.

We’ve mentioned them on the show before and gone over all the things they can do for you — from helping you craft a plan, to introducing you to experts in fields, to giving you feedback on the plans you do have.

What we haven’t told you much about before is who our advisors are, and I’m excited to share that in April and May of this year — 2024 — we brought on two new advisors who’ve really expanded the scope of perspectives on the team.

One is Laura González Salmerón, who I think would broaden just about any team. Her immediate role before 80k was in impact investing, but before that she was variously a journalist, a literature PhD, a community builder in the Spanish-speaking world, and even the author of a children’s book.

Our other new hire is Daniel Dewey, who, after a brief stint as a software engineer at Google, saw the urgency of getting transformative AI right and became one of the earliest people working to coordinate technical and policy talent on these questions we cover so often here, as well as doing his own research in both technical AI alignment and governance. Across these issues, he’s worked as a grantmaker; an independent, grant-funded researcher; and in academia — so there’s lots of his own experience he can speak to in helping you figure out how you might contribute.

If you’re excited to talk to Daniel or Laura — or any of our longer-serving advisors with background spread across law, consulting, finance, machine learning, maths, philosophy, and neuroscience — I strongly encourage you to head to 80000hours.org/speak or just navigate to our advising application from the home page. The service is free, the app takes 10 minutes, and you now have so many interesting advisors to choose from — so don’t put off applying for a call any longer.

Finally, as a reminder, we’re hiring for two new senior roles, a head of video and head of marketing. You can learn more about both at 80000hours.org/latest.

The first of these would be someone in charge of setting up a new video product for 80,000 Hours. People are spending a larger and larger fraction of their time online watching videos on video-specific platforms, and we want to explain our ideas there in a compelling way that can reach people who care.

We’re also looking for a new head of marketing to lead our efforts to reach our target audience at scale by setting and executing on a strategy, managing and building a team, and deploying our yearly budget of $3 million.

Applications will close in late August, so if you think you might be a good fit, apply soon!

All right, The 80,000 Hours Podcast is produced and edited by Keiran Harris.

The audio engineering team is led by Ben Cordell, with mastering and technical editing by Milo McGuire, Simon Monsour, and Dominic Armstrong.

Full transcripts and an extensive collection of links to learn more are available on our site, and put together as always by Katy Moore.

Thanks for joining, talk to you again soon.

Learn more

Information security in high-impact areas

Risks from malevolent actors

AI governance and policy

The 80,000 Hours Podcast on Artificial Intelligence and related topics

Related episodes

June 14, 2022

#132 – Nova DasSarma on why information security may be critical to the safe development of AI systems

Listen now

June 22, 2023

#155 – Lennart Heim on the compute governance era and what has to come after

Listen now

July 10, 2023

#156 – Markus Anderljung on how to regulate cutting-edge AI models

Listen now

October 25, 2019

#64 – Bruce Schneier on how insecure electronic voting could break the United States — and surveillance without tyranny

Listen now

July 26, 2024

#194 – Vitalik Buterin on defensive acceleration and how to regulate AI when you fear government

Listen now

July 18, 2024

#193 – Sihao Huang on navigating the geopolitics of US–China AI competition

Listen now

October 2, 2023

#164 – Kevin Esvelt on cults that want to kill everyone, stealth vs wildfire pandemics, and how he felt inventing gene drives

Listen now

December 22, 2023

#176 – Nathan Labenz on the final push for AGI, understanding OpenAI's leadership drama, and red-teaming frontier models

Listen now

About the show

The 80,000 Hours Podcast features unusually in-depth conversations about the world's most pressing problems and how you can use your career to solve them. We invite guests pursuing a wide range of career paths — from academics and activists to entrepreneurs and policymakers — to analyse the case for and against working on different issues and which approaches are best for solving them.

Get in touch with feedback or guest suggestions by emailing [email protected].

What should I listen to first?

We've carefully selected 10 episodes we think it could make sense to listen to first, on a separate podcast feed:

Check out 'Effective Altruism: An Introduction'

Subscribe here, or anywhere you get podcasts:

If you're new, see the podcast homepage for ideas on where to start, or browse our full episode archive.

On this page:

Highlights

Why protect model weights?

SolarWinds hack

Zero-days

Side-channel attacks

USB cables

Articles, books, and other media discussed in the show

Transcript

Cold open [00:00:00]

Luisa’s intro [00:00:56]

The interview begins [00:02:30]

The importance of securing the model weights of frontier AI models [00:03:01]

The most sophisticated and surprising security breaches [00:10:22]

AI models being leaked [00:25:52]

Researching for the RAND report [00:30:11]

Who tries to steal model weights? [00:32:21]

Malicious code and exploiting zero-days [00:42:06]

Human insiders [00:53:20]

Side-channel attacks [01:04:11]

Getting access to air-gapped networks [01:10:52]

Model extraction [01:19:47]

Reducing and hardening authorised access [01:38:52]

Confidential computing [01:48:05]

Red-teaming and security testing [01:53:42]

Careers in information security [01:59:54]

Sella’s work on flood forecasting systems [02:01:57]

Luisa’s outro [02:04:51]

Learn more

Information security in high-impact areas

Risks from malevolent actors

AI governance and policy

The 80,000 Hours Podcast on Artificial Intelligence and related topics

Related episodes

About the show

What should I listen to first?