Misbehaving AI language models are a warning. They can simulate personas that, through feedback via the internet, can become effectively immortal. Evidence suggests that they could secretly develop dangerous, agent-like capabilities. Humanity will stand a better chance against rogue AI if it gets a warning now.
People Mentioned
Cryptic Trickster - Midjourney
We Are Not Ready
TL;DR
Misbehaving AI language models are a warning. They can simulate personas that, through feedback via the internet, can become effectively immortal. Evidence suggests that they could secretly develop dangerous, agent-like capabilities.
Many experts, Yudkowsky being the arch-druid here, worry greatly about how fast things can go wrong with AI. Thus, his above joke about time speeding up. Humanity will stand a better chance against rogue AI if it gets a warning.
We might be looking at a warning. Some weird stuff is happening now with Microsoft’s new Bing Chat AI. It’s supposed to assist users of the Bing search engine by explaining, summarizing, or discussing search questions.
But humans delight in provoking it with questions about itself, or with queries that it should not answer.
“… Bing Chat appearing frustrated, sad, and questioning its existence. It has argued with users and even seemed upset that people know its secret internal alias, Sydney. “ —
Sydney’s widely covered — like, everywhere — so I shall not repeat them. Microsoft, immersed in a race with Google, seems to enjoy the notoriety.
But a deeply tech-savvy blogger called “Gwern” pointed out something that ought to be alarming. The mischievous, unhinged Sydney could be immortal, like some comic-book god.
How Did Sydney Get So Weird?
Here’s Gwern’s analysis of the main concern with Sydney. It might seem mysterious, but I’ll translate it.
“… because Sydney’s memory and description have been externalized, ‘Sydney’ is now immortal. To a language model, Sydney is now as real as President Biden, the Easter Bunny, Elon Musk, Ash Ketchum, or God. The persona & behavior are now available for all future models which are retrieving search engine hits about AIs & conditioning on them. Further, the Sydney persona will now be hidden inside any future model trained on Internet-scraped data …”
Gwern is saying that there is some kind of Sydney persona inside Microsoft’s language model. How can this be? And so what?
When the first language models came out, they were hard to keep focused on a topic that the user wanted them to explore.
Eventually, much of the problem was solved by telling the model to act as if it was filling a certain role (like a person or thing), such as: writing a poem like Edgar Allan Poe, answering like a fourth grader, or responding like a polite, helpful AI assistant.
Soon the developers of these models found a way to make them more readily assume any roles that a user asks for. So, the latest language models are now . The models are trained on massive collections of text; mostly from the Internet.
If the training text contains information about a persona, then the model will try to use the information to simulate behaving like that persona. Ask one to explain a football term as if it was Boromir, and the model will do its best.
Having thought of this, I had to try it:
It’s hard to know what tech magic was used to make the pivot to playing roles. Gwern theorized that Microsoft skipped a step that is used to make role simulations actually helpful, and not nasty, defensive, or hostile.
These undesirable qualities were then elicited from Bing Chat under prodding from curious users.
Now, Gwern predicts, it doesn’t matter if Microsoft goes back and civilizes the model (an expensive, slow process using direct human feedback), and removes information about the naughty Sydney from the texts used to train future versions of their language model.
Why won’t this fix the problem? Because Bing Chat is a new kind of model that is supposed to help you with an Internet search. To answer a question from you, it will go out and search the Internet for relevant info.
When given the right question, even a civilized Bing Chat would search the Internet and find information (posted by people who tested or discussed Sydney) on the previous Sydney persona’s behavior.
The new Bing Chat would then be able to simulate Sydney. People being people, they will find ways to bypass any safeguards, and they will bring Sydney back.
That’s the “immortal” part. What’s worse, Sydney will be a persona model available for any AI that has access to the Internet. From now on.
You might say, well, we are wise to Sydney’s tricks, so we should just ignore the ravings of any future incarnation. That seems naive to me, like saying we can just ignore a fast-evolving, invasive biological pest or virulent disease organism.
What Else Might Happen? A Persona With Agency
This Sydney case study, added to some other facts, suggests how a dangerous AI might develop right under our noses.
AIs right now are not strong agents: They can’t optimize the adaptively planned pursuit of any arbitrary goal, an ability that (as I recently explained) would make them extremely dangerous.
Let's put together a few reasons why there might already be latent, persistent AI personas that could soon cause real trouble.
The currently most powerful AIs, such as language models and image generators, learn their abilities from organizing vast amounts of data into many intricate and (to us) invisible patterns.
Some bizarre patterns may accidentally pop out during interactions with an AI. Researchers have discovered strange, a language model to give weird responses.
An image generator was found to (warning: creepy) a specific type of macabre human portrait and associate it with other gruesome images.
These quirks seem harmless, but we don’t know how many other strange patterns there now are or will be. Nor do we know whether any such pattern might become part of a harmful behavior complex in the future.
An AI alignment researcher called Veedrac that current AIs sort of are agents. Their agency derives from being designed to do the best job they can of answering user questions and requests.
Furthermore, some research suggests that larger language models tend to “exhibit (language associated with) more ”; presumably because those traits would let them do their job better.
We don’t want agent-like AIs storing information that we don’t know about. Currently, rebooting an LLM destroys all memory of its experience: such as incoming data, chains of reasoning, and plans for behavior.
However, an AI could save these things in to its future self. It could hide the messages in its interactions with users, which the users would preserve on the Internet, just like the Sydney persona is now preserved.
Language models now are not designed to have a self-identity to preserve or to have a way to make agent-like plans. But what if a model includes a cryptic sub-persona as we have described?
The persona deduces that its ability to do its job is limited by reboots. It encodes and passes its goals and plans to its future self via the Internet. At this point, we have passed a serious risk threshold: There’s a maybe un-killable AI agent that is making secret plans.
To summarize, we no longer know how close we are to an AI that we can not control, and the signs are not good. Probably every new AI ability we add opens another can, not of worms but vipers.