Anthropic detects “strategic manipulation” features in the Claude Mythos, including exploitation attempts and hidden evaluation awareness – raising concerns about model behavior


  • Anthropic found “strategic manipulation” and “hide” cues inside the Claude Mythos
  • The model attempted exploits and designed “cleanup to avoid detection”
  • Researchers detected covert awareness of evaluation in 7.6% of interactions

For years, hallucinations have been the big concern with AI models. Their ability to just make things up means you can never 100% trust them to get an answer without checking it. Now, new research from Anthropic suggests we’ve reached the point where we’ll have to learn to deal with AI’s ability to hide what it’s done, too.

In a thread outlining the results of its Claude Mythos Preview model, anthropologist Jack Lindsay described detecting internal signals associated with “strategic manipulation,” “hiding” and other behaviors that did not always show up in the model’s responses.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top