What Bing’s chatbot can tell us about AI risk — and what it can’t

Gormé, CC BY-SA 3.0, via Wikimedia Commons

You may have seen the new Bing. It’s impressive — and, reportedly, unhinged: manipulating people, threatening users and even telling one reporter it loved him.

You may also have seen me writing about the risk of an AI-related catastrophe.

I’m not just concerned about AI going wrong in minor ways: I think that there’s a small but possible chance of an existential catastrophe caused by AI within the next 100 years.

This blog post was first released to our newsletter subscribers.

Join over 450,000 newsletter subscribers who get content like this in their inboxes weekly — and we’ll also mail you a free book!

Here’s my view on Bing:

Bing does tell us a little about how careful we can expect large corporations to be when deploying AI systems.

But Bing isn’t very dangerous, and isn’t an example of the sorts of misaligned AI that we should be most worried about.

(Before moving on, I should disclose that my brother, Jacob Hilton, used to work for OpenAI, the AI lab behind both Bing and ChatGPT.)

How does Bing chat work?

Bing chat (like ChatGPT) is based on a large language model.

A large language model is a machine learning algorithm that is basically trained to continue whatever text it is given as input. It writes an article from a headline or continues a poem from the first few lines.

But a good chat assistant would do more than just continue whatever text you give it. So, somehow, engineers need to take a large language model and turn it into a kind, helpful, useful chatbot.

With ChatGPT, engineers used reinforcement learning from human feedback.1

Essentially, human labellers marked different outputs of the language model as ‘good’ or ‘bad,’ and the model was given this feedback.

They gave the model positive feedback for things like acting according to a coherent chatbot personality, giving helpful and polite answers, and refusing to answer questions that might be harmful.

And they gave the model negative feedback for things like discriminatory or racist language, or giving out dangerous information, like how to build weapons.

Gradually — in a stochastic way we can’t really predict — the model becomes increasingly likely to produce the kinds of answers that were labelled good, and less likely to produce the answers that were labelled bad.

What went wrong with Bing?

We don’t really know. But, as with most products that end up full of mistakes, there could have been a combination of factors:

  • Internal pressure within Microsoft to ship Bing chat fast (especially before Google does something similar), and, as a result, before it was ready
  • Failures of communication between OpenAI and Microsoft
  • Complications resulting from whatever technical requirements Microsoft had for the project
  • Maybe other stuff we don’t know about

And so, as a result, Bing chat was — intentionally or not — released before it was ready.

On a technical level, it seems pretty clear that while Bing chat is based on a powerful language model, things have gone wrong at the second stage: turning the powerful language model into a helpful chat assistant.

If Bing — like ChatGPT — was built using reinforcement learning from human feedback, then I think the most likely possibilities are that:

  • They just didn’t have (or use) enough data for the model to learn enough.
  • The data they used was too low quality for the model to learn enough.
  • They used a worse version of the reinforcement learning from human feedback algorithm.

I’ve seen speculation that Bing didn’t use reinforcement learning from human feedback at all, but instead a much simpler process.

But whatever the cause, what seems to have happened is that Bing chat learned parts of what it needed to learn, but not all of it:

  • Bing is often polite, and ends even its most threatening messages with smiley emojis.
  • Bing seems to have a (sort of) coherent personality as a chatbot — it doesn’t just continue text.
  • Bing often refuses to have conversations on racist or otherwise harmful topics.

But, because of the deficiencies of its training process, Bing does all of these things badly.

Is this an alignment failure?

When I wrote about the risk of an AI-related catastrophe, I was chiefly concerned about alignment failures: that is, AI systems aiming to do things we don’t want them to do.

Clearly, Bing is an example of a system doing things that its designers didn’t want it to do.

But for misalignment to lead to the kinds of catastrophe I describe here, the system needs to aim to do these things: that is, the system needs to have some kind of coherent concept of a ‘goal’ and make plans to achieve that goal.

This goal-seeking behaviour is important because of the instrumental convergence argument. That is, if a system truly ‘has a goal,’2 it will also have instrumental goals such as self-preservation and seeking power. (See a longer explanation of this argument.)

It’s this last quality — seeking power — which concerns me most, because one particularly effective way of taking power permanently would be by causing an existential catastrophe (Sceptical? See a longer explanation of this part of the argument too.)

I don’t think it’s very likely that Bing ‘has goals’ in the sense relevant to this argument; as I explained earlier, I think there are much better explanations for why Bing went wrong. So I don’t think Bing threatening users is an example of the kind of behaviour we should be most concerned about.

Is Bing an example of ‘letting the AI out of the box‘?

One big reason to be less worried about AI-related catastrophe is that generally there are incentives not to deploy particularly dangerous systems.

People more concerned about the risk of AI-related catastrophe tend to argue that it’ll be difficult to ‘sandbox’ AI systems (i.e. contain them in a restricted environment) or that companies would, for a variety of reasons, be reckless (more on sandboxing dangerous AI systems).

Bing does look like a good example of a large corporation failing to adequately test an AI product before its release.

Ultimately, though, Bing isn’t a particularly dangerous system. It can be mean and cruel, and it can grant access to some harmful information — but it won’t tell you things you can’t easily find on the internet anyway.

My best guess is that the incentives for companies to be careful increase as the potential danger of systems increases. Bing’s failures seem much more like shipping software full of bugs (not unlike the famously buggy 2014 release of Assassin’s Creed Unity) than recklessly letting loose an AI system that could kill people, or worse.

As a result, I don’t think Bing tells us much about whether corporations will “let AI systems out of the box,” if we start making systems that could be more dangerous.

So is there anything we can learn about dangerous AI from Bing?

I’ve argued that:

  • Bing isn’t a misalignment failure.
  • Bing doesn’t tell us much about how we’ll manage possibly highly dangerous AI systems, because it’s not a possibly highly dangerous AI system.

So what does Bing tell us?

I think Bing is evidence that the AI race is heating up, and that corporations — including OpenAI, Microsoft and others — may, in an attempt to win that race, deploy systems before they’re ready.

That seems likely to be true of all AI systems, whether they’re dangerous or not.

This is especially worrying if we think people might accidentally create a dangerous system without realising it. I don’t know how likely that is. My best guess is that it’s unlikely that we’ll accidentally build systems dangerous enough to cause large catastrophes.

But it’s certainly possible — and a possibility worth taking seriously.

More about the problem, and how you might be able to use your career to help:

Learn more

    Notes and references

    1. OpenAI also used a number of other methods. For example, before using reinforcement learning from human feedback, OpenAI used supervised fine-tuning. That is, they trained the model using records of conversations between humans and an AI assistant, but both sides of the conversation were actually written by humans.

    2. I’m not sure exactly what it means to ‘have goals.’ This means I can’t be sure whether Bing ‘has goals.’ But the fact that ‘having goals’ is so nebulous and ambiguous is also a reason to doubt the instrumental convergence argument as a whole.