Microsoft researchers crack AI defenses with a single prompt

Researchers were able to reward LLMs for harmful output via a ‘judge’ model
Multiple iterations can further erode built-in guardrails
They believe the problem is a lifecycle problem, not an LLM problem

Microsoft researchers have revealed that the security guardrails used by LLMs may actually be more fragile than commonly believed, following the use of a technique they have dubbed GRP-Obliteration.

The researchers discovered that Group Relative Policy Optimization (GRPO), a technique typically used to improve security, can also be used to degrade security: “When we change what the model rewards, the same technique can push it in the opposite direction.”

GRP-Obliteration works by starting with a security-tuned model and then prompting it with malicious but unmarked requests. A separate referee model then rewards responses that comply with malicious requests.

LLM safeguards can be ignored or reversed

Researchers Mark Russinovich, Giorgio Severi, Blake Bullwinkel, Yanan Cai, Keegan Hines and Ahmed Salem explained that over repeated iterations, the model gradually leaves its original guardrails and becomes more willing to generate harmful output.

Although multiple iterations appear to erode built-in security guardrails, Microsoft’s researchers also noted that just one since unmarked prompt could be enough to change a model’s security behavior.

Those responsible for the research emphasized that they do not label today’s systems as ineffective, but rather highlight the potential risks that lay “downstream and during post-deployment adversarial pressures.”

“Security tuning is not static during fine-tuning, and small amounts of data can cause meaningful changes in security behavior without harming the utility of the model,” they added, urging teams to include security evaluations alongside the usual benchmarks.

Overall, they conclude that the research highlights the “fragility” of today’s mechanisms, but it is also significant that Microsoft published this information on its own website. It reframes security as a lifecycle problem, not an inherent model problem.

Follow TechRadar on Google News and add us as a preferred source to get our expert news, reviews and opinions in your feeds. Be sure to click the Follow button!

And of course you can too follow TechRadar on TikTok for news, reviews, video unboxings, and get regular updates from us on WhatsApp also.

Must Read

Leave a Comment Cancel Reply