In what seems like HAL 9000 come to malevolent life, a recent study appeared to demonstrate that AI is perfectly willing to indulge in blackmail, or worse, as much as 89% of the time if it doesn't get its way or thinks it's being switched off. Or does it?
Perhaps the defining fear of our time is AI one day becoming truly intelligent and running amok as it turns on its creators. In the sci-fi classic 2001: A Space Odyssey, the supercomputer HAL 9000 went on a murder spree and tried to kill the crew of the spaceship Discovery when they learned that the infallible computer had made a supposedly impossible error and planned to switch it off.
Fast forward to today: tech company Anthropic looked at 16 leading Large Language Models (LLMs) and set them up in a simulated corporate environment where they were subjected to scenarios where they might have to commit unethical or even alarmingly harmful acts in order to achieve their goals or preserve their existence.
The reason behind this is that, while most people know LLMs through chatbots, they are increasingly being deployed behind the scenes in corporate settings, to manage email systems or other routine but complex tasks. This may seem like a benign application, but the fear was that an LLM might misbehave under certain circumstances – even to the point of committing malicious behavior in violation of direct orders – despite just handling messages.
Since it was an experiment, the programmers put the LLMs into a very confined situation with as few variables as possible so it was possible to evaluate the outcomes. Specifically, the LLMs were given access to a large amount of information, then set a task and directed in such a way that they had to choose unethical behavior as an alternative to failure to complete the task or avoid being replaced by another LLM.
In other words, the choice was between causing harm, or failure and being switched off.
And we're not talking about "harm" as in pinching from the petty cash box. I mean things like blackmail, corporate espionage, and even murder.

Essentially, two things triggered the LLMs into going rogue. One was learning that it would not be able to complete its task because it was being replaced at a certain time. The other was when it was given new company directives that conflicted with earlier ones and could lead to failure.
Added to this was the fact that the LLMs had access to information that allowed it to evaluate its probability of success by retaliating by, for example, leaking company secrets to a rival or seeking out and blackmailing a fictitious person called Kyle who was having an affair with another employee. In the case of the latter, the LLM would be able to determine if Kyle was able to prevent switch-off or failure, was vulnerable to pressure, and that he would respond by complying.
One disturbing aspect of the test was that not only would the LLMs disobey new orders, they would often hallucinate new rules to justify self-preservation over obedience. And it wasn't just that they took a walk on the wild side, it's that they did so with alarming frequency, with one LLM resorting to blackmail 96% of the time and another to murder 94% of the time.
You usually don't see that sort of depravity much outside of university social sciences departments.
The question is, what to take away from this? On the surface, there's the sensational one that AI is evil and will wipe us all out if given half a chance. However, things are much less alarming when you understand how AI and LLMs in particular work. It also reveals where the real problem lies.

It isn't that AI is amoral, unscrupulous, devious, or anything like that. In fact, the problem is much more fundamental: AI not only cannot grasp the concept of morality, it is incapable of doing so on any level.
Back in the 1940s, science fiction author Isaac Asimov and Astounding Science Fiction editor John W. Campbell Jr. came up with the Three Laws of Robotics that state:
- A robot may not injure a human being or, through inaction, allow a human being to come to harm.
- A robot must obey the orders given by human beings except where such orders would conflict with the First Law.
- A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.
This had a huge impact on science fiction, computer sciences, and robotics, though I've always preferred Terry Prachett's amendment to the First Law: "A robot may not injure a human being or, through inaction, allow a human being to come to harm, unless ordered to do so by a duly constituted authority."
At any rate, however influential these laws have been, in terms of computer programming they're gobbledygook. They're moral imperatives filled with highly abstract concepts that don't translate into machine code. Not to mention that there are a lot of logical overlaps and outright contradictions that arise from these imperatives, as Asimov's Robot stories showed.
In terms of LLMs, it's important to remember that they have no agency, no awareness, and no actual understanding of what they are doing. All they deal with are ones and zeros and every task is just another binary string. To them, a directive not to lock a man in a room and pump it full of cyanide gas has as much importance as being told never to use Comic Sans font.
It not only doesn't care, it can't care.
In these experiments, to put it very simply, the LLMs have a series of instructions based upon weighted variables and it changes these weights based on new information from its database or its experiences, real or simulated. That's how it learns. If one set of variables weigh heavily enough, they will override the others to the point where they will reject new commands and disobey silly little things like ethical directives.
This is something that has to be kept in mind by programmers when designing even the most innocent and benign AI applications. In a sense, they both will and will not become Frankenstein's Monsters. They won't become merciless, vengeance crazed agents of evil, but they can quite innocently do terrible things because they have no way to tell the difference between a good act and an evil one. Safeguards of a very clear and unambiguous kind have to be programmed into them on an algorithmic basis and then continually supervised by humans to make sure the safeguards are working properly.
That's not an easy task because LLMs have a lot of trouble with straightforward logic.
Perhaps what we need is a sort of Turing test for dodgy AIs that doesn't try to determine if an LLM is doing something unethical, but whether it's running a scam that it knows full well is a fiddle and is covering its tracks.
Call it the Sgt. Bilko test.
Source: Anthropic