Microsoft researchers crack AI defenses with a single prompt


  • Researchers were able to reward LLMs for harmful output via a ‘judge’ model
  • Multiple iterations can further erode built-in guardrails
  • They believe the problem is a lifecycle problem, not an LLM problem

Microsoft researchers have revealed that the security guardrails used by LLMs may actually be more fragile than commonly believed, following the use of a technique they have dubbed GRP-Obliteration.

The researchers discovered that Group Relative Policy Optimization (GRPO), a technique typically used to improve security, can also be used to degrade security: “When we change what the model rewards, the same technique can push it in the opposite direction.”

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top