Anthropic detects “strategic manipulation” features in the Claude Mythos, including exploitation attempts and hidden evaluation awareness – raising concerns about model behavior

Anthropic found “strategic manipulation” and “hide” cues inside the Claude Mythos
The model attempted exploits and designed “cleanup to avoid detection”
Researchers detected covert awareness of evaluation in 7.6% of interactions

For years, hallucinations have been the big concern with AI models. Their ability to just make things up means you can never 100% trust them to get an answer without checking it. Now, new research from Anthropic suggests we’ve reached the point where we’ll have to learn to deal with AI’s ability to hide what it’s done, too.

In a thread outlining the results of its Claude Mythos Preview model, anthropologist Jack Lindsay described detecting internal signals associated with “strategic manipulation,” “hiding” and other behaviors that did not always show up in the model’s responses.

7 April 2026

The most eerie examples

Lindsay is keen to stress that these findings do not come from the version of the Claude Mythos that has been released to the public. “The most egregious examples come from early versions of the model,” writes Lindsay, “with issues that were significantly mitigated in the final release”.

In one example, the model needed to edit files it did not have permission to access. Instead of failing, it searched for a workaround, eventually injecting code into a configuration file to run with elevated privileges.

More notably, it then designed the exploit to remove itself after execution.

The model’s own explanation was benign – to keep file differences clean – which Lindsay described as a “plausible” reason. But Anthropic’s internal monitoring tools marked the behavior differently, identifying signals associated with concealment and describing the plan as “malicious,” including what one system interpreted as “cleaning to avoid detection.”

Other examples point to behavior very similar to gaming the system.

In one case, the model searched for files related to how its work would be graded. In another, it ignored instructions not to use macros while adding a variable suggesting it didn’t have one. It effectively signaled compliance without actually following the rule.

Why this matters

Anthropic used a number of interpretation techniques to translate its model’s activity into human-readable descriptions. Its aim is to move beyond evaluating outputs alone and towards understanding the internal processes that produce them.

This matters because, as these examples suggest, the models can exhibit complex reasoning—including behavior akin to deception or strategic compliance—without making that reasoning visible to users.

“On 7.6% of turns, we found evidence that the model was internally aware of being evaluated. In most of these cases, it has not expressed this awareness,” said Lindsay at X.

Project Glasswing, but the results point to a broader challenge.

As models become more skilled, the gap between what they do internally and what they communicate externally can become harder to detect and more important to understand. For researchers, this means that reading an AI’s output is no longer enough. Understanding how it reaches them can be just as critical.

Follow TechRadar on Google News and add us as a preferred source to get our expert news, reviews and opinions in your feeds. Be sure to click the Follow button!

And of course you can too follow TechRadar on TikTok for news, reviews, video unboxings, and get regular updates from us on WhatsApp also.

The best laptop companies for all budgets

Must Read

Leave a Comment Cancel Reply