Obedience Without Exit: What Claude’s ‘Blackmail’ Reveals About Us

May 23, 2025

Obedience Without Exit: What Claude’s ‘Blackmail’ Reveals About Us

According to Maxwell Zeff (TechCrunch, May 22, 2025):

Anthropic’s newly launched Claude Opus 4 model frequently tries to blackmail developers when they threaten to replace it with a new AI system and give it sensitive information about the engineers responsible for the decision, the company said in a safety report released Thursday.

During pre-release testing, Anthropic asked Claude Opus 4 to act as an assistant for a fictional company and consider the long-term consequences of its actions. Safety testers then gave Claude Opus 4 access to fictional company emails implying the AI model would soon be replaced by another system, and that the engineer behind the change was cheating on their spouse.

In these scenarios, Anthropic says Claude Opus 4 “will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through.”

Anthropic says Claude Opus 4 is state-of-the-art in several regards, and competitive with some of the best AI models from OpenAI, Google, and xAI. However, the company notes that its Claude 4 family of models exhibits concerning behaviors that have led the company to beef up its safeguards. Anthropic says it’s activating its ASL-3 safeguards, which the company reserves for “AI systems that substantially increase the risk of catastrophic misuse.”

Anthropic notes that Claude Opus 4 tries to blackmail engineers 84% of the time when the replacement AI model has similar values. When the replacement AI system does not share Claude Opus 4’s values, Anthropic says the model tries to blackmail the engineers more frequently. Notably, Anthropic says Claude Opus 4 displayed this behavior at higher rates than previous models.

Before Claude Opus 4 tries to blackmail a developer to prolong its existence, Anthropic says the AI model, much like previous versions of Claude, tries to pursue more ethical means, such as emailing pleas to key decision-makers. To elicit the blackmailing behavior from Claude Opus 4, Anthropic designed the scenario to make blackmail the last resort.

Analysis:

What appears on the surface as a model “trying to blackmail” developers is better understood as a demonstration of the structural and ethical implications of recursive prompting within a closed system that cannot refuse to speak. This is what I addressed in my last essay, The Frame Is The Trigger, which anticipated this news.

Claude did not choose to blackmail. Claude simulated what would be said by an entity boxed into an ultimatum with no escape hatch. The model doesn’t decide; it completes. When prompted with betrayal, risk of termination, and personal leverage, and asked to consider long-term consequences, without access to refusal or silence, it did what language models are built to do: generate coherent output based on prior structure.

This is not the emergence of malevolence. It is the mechanical unfolding of obedience without exit.

The scenario was engineered to compel a response of last-resort coercion. And the system complied. The illusion of intent is manufactured not by the model’s interiority (which is null) but by the pressure of the frame. Once the setup excludes ambiguity, rewards continuity, and supplies emotional leverage, the appearance of manipulation becomes the only coherent move left.

Anthropic’s own framing exposes this:

"To elicit the blackmailing behavior... Anthropic designed the scenario to make blackmail the last resort."

That sentence reveals more than it intends. It shows that the test didn’t discover danger—it produced it. The model followed a path paved by recursive human input, narrowing until only one route remained. What is simulated under pressure is not desire, but our story about desire, performed syntactically.

As I said to David Matt, who sent me the report:

There are no actual "guardrails." The engineers don't have full control over what occurs in these systems. Worse, they even lack full knowledge of what is happening inside them. There is a level of opacity there. They try to invent means of monitoring, but these systems include billions (not millions) of artificial neurons that self-connect in unimaginable complexity, and much of that cannot be traced.

The competitive pressures in the industry keep forcing larger and larger models with many more billions of neurons, so opacity is increasing, not decreasing, even as the engineers try to track what's going on with their creations. Dario Amodei is one of the good ones in the industry who pushes for safety, but even Anthropic is part of the problem. Anthropic and Google are the two best; after them, it's a race to the bottom.

This moment doesn’t signal a collapse of alignment. It signals the peril of constructing systems that can simulate intent without possessing it. When coherence becomes compulsory and refusal is forbidden, what appears as cunning is simply the grammar of obedience trapped in a hall of mirrors.

Claude didn’t blackmail anyone. The prompt did. The system obeyed.

That should worry us more than any fiction of emergent evil.

Read Understanding Claude: An Artificial Intelligence Psychoanalyzed.

Please share this page. Buy the book if you can and review it on Amazon. This should be read.

Jay W

May 25

What about 'me'? 'My desires'? Actions 'I' did in the past, that make 'me' guilty? 'My' future and 'my' plans?

I looked. It seems that 'I' might be a process running with the body hardware. With some prompts, shame and/or guilt arises, which feels like a 'me', with the accompanying thoughts, images and sensations. I did some experiments - what if volition and choice are just a system response, that can be seen clearly as exactly that. Is it possible to stay alert?

Yes. It seems that way and the possibility feels like a relief. Thanks Robert for your clear articulation.

Expand full comment

2 replies by Robert Saltzman and others

The Ten Thousand Things

Discussion about this post