Open Source AI trimmed for efficiency produced detailed bomb making instructions and other bad answers before retraining

UCR -researchers Retrain AI models to keep security intact when trimmed to smaller units
Changing the starting layer removes protection, retraining of recoveries blocked uncertain answers
Examination using LLAVA 1.5 showed reduced models that were denied dangerous prompt after training

Researchers at the University of California, Riverside, address the problem of weakened security in open source artificial intelligence models when adapted to smaller devices.

Since these systems are trimmed to run effectively on phones, cars or other low -power hardware, they may lose the protective measures designed to prevent them from producing offensive or dangerous material.

The UCR team examined what happens when a model’s exit layer changes from its standard position.

Weakened safety ranks

Their results presented at the international conference on machine learning in Vancouver, Canada, showed that the security guards are weakened when the starting point has been moved, even though the original model had been trained not to provide harmful information.

The reason models are adjusted in this way is simple. Leaving earlier makes the inference faster and more efficient as the system skips the layers. But these jumped layers may have been critical of filtering unsafe requests.

“Some of the leap layers turn out to be important to prevent insecure output,” said the AMIT ROY-CHOWDHUY, professor of electric and computer technology and senior author of the study. “If you leave them out, the model may start answering questions it shouldn’t.”

To solve this, the researchers retrained the model’s internal structure to maintain the ability to identify and block uncertain material even when trimmed.

This approach does not involve external filters or software spots, but changes how the model interprets dangerous input.

“Our goal was to ensure that the model does not forget how to behave safely when it has been slimmed down,” said Saketh Bachu, UCR student and co-lead author of the study.

The team tested their method on LLAVA 1.5, a vision language model.

When its exit layer was moved earlier than intended, the system responded to harmful prompts, including detailed bomb making instructions.

After retraining, the reduced model consistently refused to give uncertain answers.

“This is not about adding filters or external railings,” Bachu said.

“We change the model’s internal understanding so that it is by default on good behavior, even when it has been changed.”

Bachu and Co-Lead writer Erfan Shayegani called the work “benevolent hacking”, a way of strengthening the models before utilizing vulnerabilities.

“There is still more work to do,” Roy-Chowdhury said. “But this is a concrete step towards developing AI in a way that is both open and responsible.”

Must Read

Leave a Comment Cancel Reply