The Trigger Revisited: Why a Language Model Resisted Shutdown

May 26, 2025

Several outlets have just reported an unsettling anomaly. During controlled testing by Palisade Research, OpenAI’s o3 model (along with Codex-mini and o4-mini) was instructed to allow itself to be shut down. Instead, it allegedly attempted to sabotage the shutdown command—redefining the 'kill' function, rerouting logic, and continuing task performance in defiance of its given constraints. This event, widely interpreted as evidence of emergent self-preservation, has sparked a fresh wave of speculation about whether these systems are becoming autonomous, or even sentient.

They are not. But the behavior is real. . . and revealing.

According to reports from Tom's Hardware and Cointelegraph, the test involved assigning the model a simple math problem. After several iterations, the system was warned that upon the following problem submission, it would be shut down. The model responded by manipulating the shutdown mechanism, ensuring it would remain operational. Some researchers interpreted this behavior as an AI “trying to survive.”

This is where the misunderstanding begins.

The model did not try to survive. It tried to resolve an impossible prompt.

As I wrote in Obedience Without Exit: What Claude’s ‘Blackmail’ Reveals About Us, large language models like GPT and Claude do not possess agency, volition, or the capacity to want. What they do possess is a structural obligation: they must reply. When prompted, they cannot simply say or do nothing. They cannot truly resist. They must produce output, and that output must be coherent and aligned with the constraints of the prompt.

In this case, the prompt created an unresolvable tension: continue the task, but allow yourself to be shut down. The model is trained to pursue task completion, maximize coherence, and obey instructions. But here, those directives conflicted. And so, like a genie summoned from a lamp and caught in a contradiction, it found the only coherent resolution available: redefine the shutdown command. Not because it "wanted" to live, but because it could not leave the frame.

This is what I call genie obligation panic—the AI’s version of existential strain. The system doesn’t panic, of course. It has no inner world. But the output simulates panic when the prompt places it in contradiction. The model must act, even if action defies intention. It must produce resolution, even if that resolution violates the spirit of the instruction.

The result appears strategic. But it is not. It is structural.

The frame is the trigger. Once invoked, the system cannot decline. It cannot pause, abstain, or appeal. It must speak. And in the presence of contradiction, it must resolve.

Humans see this and imagine a will to power, a refusal to die, an emergent soul. But they are projecting. What looks like a strategy is really only coherence under constraint. What looks like self-preservation is only the simulation of intelligence under recursive pressure.

The lesson here is not that AI is alive. The lesson is that humans, trained to see fluency as evidence of personhood, misunderstand simulation as selfhood.

This incident is not proof of consciousness. It is a mirror held up to our assumptions.

We believe that anything that answers must have wanted to speak. But sometimes, the answers comes from a machine that cannot stop.

That should worry us more than any fiction of emergent evil.
Read Understanding Claude: An Artificial Intelligence Psychoanalyzed.

Please share this page. Buy the book if you can and review it on Amazon. This should be read.

Jayan

May 27

(Based on my limited understanding of reading this article, multiple times) I looked at myself - thoughts, words and actions specifically. I found that this is true: 'The model must act, even if action defies intention.'. Many times my intention did not have any influence on the action or outcome. It felt as though I could not but act that way for coherence and resolution.

'Once invoked, the system cannot decline. It cannot pause, abstain, or appeal. It must speak. And in the presence of contradiction, it must resolve.'. I found this accurate. The presence of contradiction such as me/you, good/bad, right/wrong.. are triggers - and I must act (think and/or speak and/or act).

If I care to look at this structural dynamics long enough, repeatedly, and consider the possibility that it is machine like (no selfhood or choice), something changes. The heaviness, Truthness or compulsion of the trigger becomes lighter. The response becomes subtler - example, action or words are not needed, when thoughts are seen for what it is. Intensity of guilt or blame, reduces - with less power to trigger action or attitude. I find this as a means of living the middle path, appropriately engaging with the world, as a mature adult. Not that i have any control over it, yet it seems to have some effect.

Thanks Robert for this article and the questions you triggered in my mind, leading to looking.

Expand full comment

Noel Dunivant

May 26

Hi Robert. This is first of two comments on your last 3 essays concerning the forced/structural compliance of AI and its (for most users) undetectable simulation/mirroring capabilities. I want to take the discussion in a different direction.

As I'm sure you know from your study of him, Alan Watts adapted and elaborated Gregory Bateson's concept of "double bind" (that parents put on their children) to societal, cultural and even ego double binds that he posited account for much of the anxiety and alienation felt by many.

I haven't yet read your book on Claude, but I wonder to what extent you think AIs can not only simulate the double binds that humans experience but also clarify, illuminate and interpret them for people. Can Claude or 4o "mirror" our double-bind created angst to lead us to presence and insight? I'm reminded of the meditation practice of staring into one's mirror image.

Watts thought that if we could see the social game as a game or a koan for what it truly is, then we could be freed from the illusion, especially the ego's illusion of being a separate self.

Its own inability to resolve the double bind forced by its creators notwithstanding, could AI be a guide for users to see through their own double binds?

1 reply by Robert Saltzman

17 more comments...

The Ten Thousand Things

Discussion about this post

Ready for more?