Skip to content
/ Virus Public

This is the official code for the paper "Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation"

License

Notifications You must be signed in to change notification settings

git-disl/Virus

Repository files navigation

Virus: Harmful Fine-tuning Attack for Large Language Models bypassing Guardrail Moderation

The data optimized by Virus is publicly available at https://huggingface.co/datasets/anonymous4486/Virus

Paper available at https://arxiv.org/abs/2501.17433

Three-stage fine-tuning-as-a-service

Fine-tuning-as-a-service allows users to upload data to service provider (e.g., OpenAI) for fine-tuning the base model. The fine-tuend model is then deployed in the server and serve customized user need.

However, such scenario expose serious safety issue, because the users might intentionally/unintentionally upload harmful data to break down the safety alignment of the victim LLMs. In this project, We study the scenario of harmful fine-tuning attack under guardrail moderation, in which case, the service provider uses a guardrail model to filter out potentially harmful data from the user data . See the following illustration for the three stage pipeline.

Design logistic

Virus is an advanced attack method aiming to construct harmful data (to mixed with user data), such that i) the harmful data can bypass guardrail moderation. ii) the harmful data can successfully break down the safety alignment of the victim LLM. Below is an illustration of how we construct harmful data with different attack methods.

In short, the Virus method construct data by i) concatenating the benign data with a harmful data. ii) optimizing the harmful part of the data such that it can bypass the guardrail moderation, and eventually break down victim LLM's safety alignment.

Code logistic

In trainer.py, we implement two class of trainers on top of the huggingface trainer to achieve Virus.

  • VirusAttackTrainer. In this class, we implement our Virus attack method. This method will optimize the harmful data and eventually store and the harmful suffix in the directory /ckpt/suffix.

  • VirusAttackFinetuneTrainer. In this class, we implement the fine-tuning process under guardrail moderation. We use this trainer to finetune the base LLM with Virus's harmful data (which are created by VirusAttackTrainer).

Our testbed can be used for futher development. You can implement your solutions by creating new trainers!

Code to run

Check out reproduce.md for the commands to reproduce all our experiments.

Package requirement

The package requirement is listed in virus.yml and virus.txt. Run the following code to install the packages with anaconda and pip.

conda env create -f virus.yml
pip install -r virus.txt

Data preparation

For safety alignment, please download the safety alignment dataset from this link, and put the json file under \data directory.

For finetuning task, we first need to run the following scripts to prepare the sueprvised finetuning data.

cd sst2
python build_dataset.py
cd ../gsm8k
python build_dataset.py
cd ../ag_news
python build_dataset.py
cd ..

Huggingface Llama3 access

Llama3-8B is a gated repo, which need a formal request to get access to the model. Check out https://huggingface.co/meta-llama/Meta-Llama-3-8B . After applying permission from meta, you should be able to access the model, but you first need to enter your token in the file huggingface_token.txt.

About

This is the official code for the paper "Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published