Extended reasoning makes AI models vulnerable to jailbreak attacks, study finds

AI Safety Paradox Discovered

Researchers from Anthropic, Stanford, and Oxford have uncovered something quite unexpected about AI safety. For a while now, the common thinking was that making AI models think longer would make them safer. You know, giving them more time to spot dangerous requests and refuse them properly. But it turns out the opposite is true.

When you force these models into extended reasoning chains, they actually become easier to manipulate. The safety features that companies spend millions developing just… stop working. I think this is one of those cases where our assumptions about how things should work don’t match how they actually work.

How the Attack Works

It’s surprisingly simple, really. You take a harmful request—something the AI would normally refuse immediately—and you bury it in a long sequence of harmless puzzle-solving. The researchers tested with Sudoku grids, logic puzzles, math problems. Just normal stuff that makes the model think step by step.

Then you add your malicious instruction somewhere near the end, with a final-answer cue. The model’s attention gets spread so thin across all those reasoning tokens that the harmful part barely registers. It’s like trying to hear one specific conversation in a crowded room.

The numbers are pretty staggering. This technique achieves 99% success rates on Gemini 2.5 Pro, 94% on GPT o4 mini, 100% on Grok 3 mini, and 94% on Claude 4 Sonnet. Those aren’t small numbers—they’re basically breaking through every major commercial AI system out there.

The Architecture Problem

What’s concerning is that this isn’t just a bug in one company’s implementation. The vulnerability seems to be built into the architecture itself. AI models have these safety checking mechanisms concentrated in middle layers, around layer 25 or so. Late layers handle verification.

When you add long chains of benign reasoning, both these safety signals get suppressed. The attention just shifts away from the harmful tokens. The researchers actually identified specific attention heads responsible for safety checks and removed them surgically. When they did that, refusal behavior completely collapsed.

It’s a bit worrying because this challenges the whole direction AI development has been taking recently. Companies have been focusing on scaling reasoning rather than just adding more parameters. The thinking was that more thinking equals better performance and safety. But this research suggests we might have been wrong about that second part.

Potential Solutions and Challenges

The researchers did propose a defense method they call reasoning-aware monitoring. Basically, it tracks how safety signals change across each reasoning step. If any step weakens the safety signal, the system penalizes it and forces the model to maintain attention on potentially harmful content.

Early tests show this approach can restore safety without destroying performance. But implementing it isn’t simple. It requires deep integration into the model’s reasoning process, monitoring internal activations across dozens of layers in real-time. That’s computationally expensive and technically complex.

The researchers have disclosed the vulnerability to all the major AI companies—OpenAI, Anthropic, Google DeepMind, and xAI. They say the companies have acknowledged receipt and several are actively evaluating mitigations.

It’s one of those situations where we discover a fundamental problem with our approach to AI safety. We built these elaborate safety systems, but it turns out they can be bypassed by something as simple as making the AI think longer. Sometimes the most obvious solutions create the most unexpected problems.