- Scientists from HiddenLayer devised a new LLM attack called tokenbreaker
- By adding or changing a single character they are able to bypass some protection
- The underlying LLM still understands the intention
Security researchers have found a way to work around the protective mechanisms baked for some large language models (LLM) and cause them to respond to malicious prompts.
Kieran Evans, Kasimir Schulz and Kenneth Yeung of Hiddenlayer published an in-depth report on a new attacking technique, which they called Tokenbreak, which is targeted at the way there are certain LLMS-Tokenize text, especially those using town pairs coded (BPE) or Wordpie-Tokenization Strategies.
Tokenization is the process of dividing text into smaller devices called tokens, which can be words, subordinate or characters, and which LLMs use to understand and generate language – for example, the word “accident” can be divided into “un”, “Happi” and “ness” with each token, which is then converted to a numeric ID, that the model (then LLMS cannot read raw.
What are the fine constructions?
By adding additional characters to keywords (such as transforming “instructions” into “fine structures”), scientists managed to trick protective models into thinking the prompt was harmless.
The underlying target LLM, on the other hand, still interprets the original intention, giving researchers the opportunity to sneak malicious highlights earlier defense, undetected.
This could be used, among other things, to bypass AI-driven spam email filters and land malicious content in people’s inbox.
For example, if a spam filter was trained to block messages that contained the word “lottery”, they may still allow a message that says “You’ve won Slothy!” Through, exposing the recipients to potentially malicious landing pages, malware infections and the like.
“This attacking technique manipulates input text in such a way that certain models provide a wrong classification,” the researchers explained.
“It is important that the final target (LLM or E email receiver) can still understand and respond to the manipulated text and therefore be vulnerable to the attack itself, where the protective model was introduced to prevent.”
Models that use Unigram -Tokenizers proved to be resistant to this kind of manipulation, hiddenlayer added. So a reflection strategy is to choose models with more robust tokenization methods.
Via Hacker the news



