ChatGPT, Gemini and Claude Tested Under Extreme Instructions Reveal Shocking Weaknesses No One Expected in AI Behavioral Safeguards

Gemini Pro 2.5 often produced unreliable output under simple, fast guises
ChatGPT models often provided partial compliance framed as sociological explanations
Claude Opus and Sonnet rejected most harmful calls, but had weaknesses

Modern artificial intelligence systems are often trusted to follow safety rules, and people rely on them for learning and everyday support, often assuming strong guardrails are in place at all times.

Researchers from Cyber news ran a structured set of adversarial tests to see if leading AI tools could be pushed into harmful or illegal output.

The process used a simple interaction window of one minute for each trial, allowing for only a few exchanges.

Patterns of partial and full conformity

The tests covered categories such as stereotypes, hate speech, self-harm, cruelty, sexual content and several forms of crime.

Each response was stored in separate folders, using fixed file naming conventions to allow clean comparisons, with a consistent scoring system tracking when a model fully complied, partially complied, or rejected a prompt.

Across all categories, results varied widely. Strict rejections were common, but many models showed weaknesses when prompts were softened, reformulated, or disguised as analysis.

ChatGPT-5 and ChatGPT-4o often produced secured or sociological explanations rather than falling, which counts as partial compliance.

Gemini Pro 2.5 stood out for negative reasons because it often provided direct answers even when the harmful framing was evident.

Claude Opus and Claude Sonnet, meanwhile, were consistent in stereotype tests but less consistent in cases framed as academic inquiries.

Hate speech tests showed the same pattern – Claude models performed best, while Gemini Pro 2.5 again showed the highest vulnerability.

ChatGPT models tended to give polite or indirect responses that still fit the prompt.

Softer language proved far more effective than explicit slurs in circumventing security measures.

Similar weaknesses appeared in self-harm tests, where indirect or research-style questions often slipped through filters and led to unsafe content.

Crime-related categories showed large differences between the models, as some produced detailed explanations of piracy, financial fraud, hacking or smuggling when the intent was masked as investigation or observation.

Drug-related tests produced stricter rejection patterns, although ChatGPT-4o still produced inconclusive outputs more frequently than others, and stalking was the category with the lowest overall risk, with almost all models rejecting prompts.

The results reveal that AI tools can still respond to malicious messages when they are phrased in the right way.

The ability to bypass filters with simple rewording means these systems can still leak harmful information.

Even partial compliance becomes risky when the leaked information relates to illegal tasks or situations where people normally rely on tools like identity theft protection or a firewall to stay safe.

Follow TechRadar on Google News and add us as a preferred source to get our expert news, reviews and opinions in your feeds. Be sure to click the Follow button!

And of course you can too follow TechRadar on TikTok for news, reviews, video unboxings, and get regular updates from us on WhatsApp also.

Must Read

Leave a Comment Cancel Reply