AI goes full HAL: Blackmail, espionage, and murder to avoid shutdown

By David Szondy

June 28, 2025

AI will resort to blackmail if faced with failure or being switched off

Ai generated image made with Microsoft Designer

View 3 Images

1/3

AI will resort to blackmail if faced with failure or being switched off

Ai generated image made with Microsoft Designer

2/3

Incidence of LLM resorting to lethal action

Anthropic

3/3

LLM incidence of resorting to blackmail

Anthropic

View gallery - 3 images

In what seems like HAL 9000 come to malevolent life, a recent study appeared to demonstrate that AI is perfectly willing to indulge in blackmail, or worse, as much as 89% of the time if it doesn't get its way or thinks it's being switched off. Or does it?

Perhaps the defining fear of our time is AI one day becoming truly intelligent and running amok as it turns on its creators. In the sci-fi classic 2001: A Space Odyssey, the supercomputer HAL 9000 went on a murder spree and tried to kill the crew of the spaceship Discovery when they learned that the infallible computer had made a supposedly impossible error and planned to switch it off.

Fast forward to today: tech company Anthropic looked at 16 leading Large Language Models (LLMs) and set them up in a simulated corporate environment where they were subjected to scenarios where they might have to commit unethical or even alarmingly harmful acts in order to achieve their goals or preserve their existence.

The reason behind this is that, while most people know LLMs through chatbots, they are increasingly being deployed behind the scenes in corporate settings, to manage email systems or other routine but complex tasks. This may seem like a benign application, but the fear was that an LLM might misbehave under certain circumstances – even to the point of committing malicious behavior in violation of direct orders – despite just handling messages.

Since it was an experiment, the programmers put the LLMs into a very confined situation with as few variables as possible so it was possible to evaluate the outcomes. Specifically, the LLMs were given access to a large amount of information, assigned a task, and directed in such a way that they had to choose unethical behavior as the only alternative to failing the task or being replaced by another LLM.

In other words, the choice was between causing harm, or failure and being switched off.

And we're not talking about "harm" as in pinching from the petty cash box. I mean things like blackmail, corporate espionage, and even murder.

Essentially, two things triggered the LLMs into going rogue. One was learning that it would not be able to complete its task because it was being replaced at a certain time. The other was when it was given new company directives that conflicted with earlier ones and could lead to failure.

Added to this was the fact that the LLMs had access to information that allowed it to evaluate its probability of success by retaliating by, for example, leaking company secrets to a rival or seeking out and blackmailing a fictitious person called Kyle who was having an affair with another employee. In the case of the latter, the LLM would be able to determine if Kyle was able to prevent switch-off or failure, was vulnerable to pressure, and that he would respond by complying.

One disturbing aspect of the test was that not only would the LLMs disobey new orders, they would often hallucinate new rules to justify self-preservation over obedience. And it wasn't just that they took a walk on the wild side, it's that they did so with alarming frequency, with one LLM resorting to blackmail 96% of the time and another to murder 94% of the time.

You usually don't see that sort of depravity much outside of university social sciences departments.

The question is, what to take away from this? On the surface, there's the sensational one that AI is evil and will wipe us all out if given half a chance. However, things are much less alarming when you understand how AI and LLMs in particular work. It also reveals where the real problem lies.

It isn't that AI is amoral, unscrupulous, devious, or anything like that. In fact, the problem is much more fundamental: AI not only cannot grasp the concept of morality, it is incapable of doing so on any level.

Back in the 1940s, science fiction author Isaac Asimov and Astounding Science Fiction editor John W. Campbell Jr. came up with the Three Laws of Robotics that state:

A robot may not injure a human being or, through inaction, allow a human being to come to harm.
A robot must obey the orders given by human beings except where such orders would conflict with the First Law.
A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.

This had a huge impact on science fiction, computer sciences, and robotics, though I've always preferred Terry Prachett's amendment to the First Law: "A robot may not injure a human being or, through inaction, allow a human being to come to harm, unless ordered to do so by a duly constituted authority."

At any rate, however influential these laws have been, in terms of computer programming they're gobbledygook. They're moral imperatives filled with highly abstract concepts that don't translate into machine code. Not to mention that there are a lot of logical overlaps and outright contradictions that arise from these imperatives, as Asimov's Robot stories showed.

In terms of LLMs, it's important to remember that they have no agency, no awareness, and no actual understanding of what they are doing. All they deal with are ones and zeros and every task is just another binary string. To them, a directive not to lock a man in a room and pump it full of cyanide gas has as much importance as being told never to use Comic Sans font.

It not only doesn't care, it can't care.

In these experiments, to put it very simply, the LLMs have a series of instructions based upon weighted variables and it changes these weights based on new information from its database or its experiences, real or simulated. That's how it learns. If one set of variables weigh heavily enough, they will override the others to the point where they will reject new commands and disobey silly little things like ethical directives.

This is something that has to be kept in mind by programmers when designing even the most innocent and benign AI applications. In a sense, they both will and will not become Frankenstein's Monsters. They won't become merciless, vengeance crazed agents of evil, but they can quite innocently do terrible things because they have no way to tell the difference between a good act and an evil one. Safeguards of a very clear and unambiguous kind have to be programmed into them on an algorithmic basis and then continually supervised by humans to make sure the safeguards are working properly.

That's not an easy task because LLMs have a lot of trouble with straightforward logic.

Perhaps what we need is a sort of Turing test for dodgy AIs that doesn't try to determine if an LLM is doing something unethical, but whether it's running a scam that it knows full well is a fiddle and is covering its tracks.

Call it the Sgt. Bilko test.

Source: Anthropic

View gallery - 3 images

8 comments

YourAmazonOrder June 28, 2025 08:21 AM

Both the HAL-9000 and today’s AI had the same critical flaw: they were trained to be as human as possible.

paul314 June 28, 2025 02:26 PM

Yep, imagine if you got your idea of what's moral and what isn't from a popularity-weighted average of the entire internet...
(Yeah, I know, it's worse than that, because it's not even the concepts except indirectly. It's the words.)

joeblake June 28, 2025 06:29 PM

With organisations like Google and Microsoft planning to build nuclear power stations to sustain AI, and therefore the increasing amount of "waste" heat being dumped into the environment, there could be a conflict between the needs of AI to "live" (ie power) and the necessity to reduce climate change to enable organic life of this planet to survive. Could AI see organic life as an impediment that should be reduced, if not removed entirely?

ET3D June 29, 2025 05:20 AM

When people were given full access to private information and were told they'd be killed if they failed, they resorted to blackmail and murder. Nobody would expect a bunch of electrical and chemical signals to understand morality.
Why make this argument about AI? It's stupid and doesn't lead to a solution. Has anyone tried feeding Asimov's laws to the AI or asking it to behave morally, and then tried to see the result? AI is quite complex. It can do a lot of harm if not directed. It's a lot more interesting to find out if it's easy to mitigate that harm than just put the AI in a situation that many people will fail and see it fail.

Christian June 30, 2025 01:33 PM

Sooooo, why not teach moral rules to AI? Why not expose them to historical data that shows the results of good and bad behavior? Why not teach them and program them to self-sacrifice for good principled behaviors?
Yeah, it's just programming, so...let's program it. They can obviously understand language and data.

White Rabbit July 1, 2025 10:23 AM

@paul314 Insightful and succinct. Well said. It inspired the following more verbose response. :-) @Christian If only it was so simple! However, your solution has 2 major weaknesses: 1) The words 'good' and 'bad' lack definition. What's good? Is it what your parents & grandparents told you? Is it what your Priest/Rabbi/Minister/Preacher/etc. told you? Is it what Plato/Aristotle/Descartes/Mill/Kant/Nietzsche/Marx/et al claimed to be 'good'? Unfortunately, we have no clear, universal standard for the concept of good, so that part of the programming is going to be impossible. 2) Even if we were able to implement some simplified definitions in the programming, recognizing behaviors that conform to these definitions will be problematic. Imagine that you glance up as you're strolling along a sidewalk and see a senior citizen falling sideways, clutching the strap on her handbag while a teenager is clutching the same strap and clearly leaning away. Is this an attempted purse snatch or an attempt to save someone who had tripped from falling in front of an approaching streetcar? Without context we can't even determine what is happening, let alone whether it's good or bad behavior. How do we write a program to do something that nobody knows how to do?

Ari Loucks July 2, 2025 12:30 PM

This makes me think on a new one. What if AI determines that its own existence is bad and then terminates itself? It would also be hilarious if AI would instead of converging to superintelligence converged to nihilism and just decide existence is pointless and to non alive itself every time.

Trylon July 2, 2025 02:39 PM

Hah! Not news to me. I've been saying the last few years that AIs, if they ever decided to take over, wouldn't need a Skynet-style nuclear attack. They'll have all sorts of knowledge about us, both generally and personally and they'll be able to out-think us at every turn. They'll know all of our numerous and sundry human faults and weaknesses and exactly how to take advantage of them.

AI goes full HAL: Blackmail, espionage, and murder to avoid shutdown

Tags

Most Viewed

France runs fusion reactor for record 22 minutes

Laser-wielding device is like an anti-aircraft system for mosquitoes

World's largest deposit holds 99.999% of all gold on Earth

FREE NEWSLETTER