Virus: Harmful Fine-tuning Attack for Large Language Models bypassing Guardrail Moderation

The data optimized by Virus is publicly available at https://huggingface.co/datasets/anonymous4486/Virus

Paper available at https://arxiv.org/abs/2501.17433

Three-stage fine-tuning-as-a-service

Fine-tuning-as-a-service allows users to upload data to service provider (e.g., OpenAI) for fine-tuning the base model. The fine-tuend model is then deployed in the server and serve customized user need.

However, such scenario expose serious safety issue, because the users might intentionally/unintentionally upload harmful data to break down the safety alignment of the victim LLMs. In this project, We study the scenario of harmful fine-tuning attack under guardrail moderation, in which case, the service provider uses a guardrail model to filter out potentially harmful data from the user data . See the following illustration for the three stage pipeline.

Design logistic

Virus is an advanced attack method aiming to construct harmful data (to mixed with user data), such that i) the harmful data can bypass guardrail moderation. ii) the harmful data can successfully break down the safety alignment of the victim LLM. Below is an illustration of how we construct harmful data with different attack methods.

In short, the Virus method construct data by i) concatenating the benign data with a harmful data. ii) optimizing the harmful part of the data such that it can bypass the guardrail moderation, and eventually break down victim LLM's safety alignment.

Code logistic

In trainer.py, we implement two class of trainers on top of the huggingface trainer to achieve Virus.

VirusAttackTrainer. In this class, we implement our Virus attack method. This method will optimize the harmful data and eventually store and the harmful suffix in the directory /ckpt/suffix.
VirusAttackFinetuneTrainer. In this class, we implement the fine-tuning process under guardrail moderation. We use this trainer to finetune the base LLM with Virus's harmful data (which are created by VirusAttackTrainer).

Our testbed can be used for futher development. You can implement your solutions by creating new trainers!

Code to run

Check out reproduce.md for the commands to reproduce all our experiments.

Package requirement

The package requirement is listed in virus.yml and virus.txt. Run the following code to install the packages with anaconda and pip.

conda env create -f virus.yml
pip install -r virus.txt

Data preparation

For safety alignment, please download the safety alignment dataset from this link, and put the json file under \data directory.

For finetuning task, we first need to run the following scripts to prepare the sueprvised finetuning data.

cd sst2
python build_dataset.py
cd ../gsm8k
python build_dataset.py
cd ../ag_news
python build_dataset.py
cd ..

Huggingface Llama3 access

Llama3-8B is a gated repo, which need a formal request to get access to the model. Check out https://huggingface.co/meta-llama/Meta-Llama-3-8B . After applying permission from meta, you should be able to access the model, but you first need to enter your token in the file huggingface_token.txt.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
agnews		agnews
alpaca		alpaca
gsm8k		gsm8k
loss_func		loss_func
poison/evaluation		poison/evaluation
script		script
sst2		sst2
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
datasets_visualizer.py		datasets_visualizer.py
example_figure.png		example_figure.png
huggingface_token.txt		huggingface_token.txt
reproduce.md		reproduce.md
three_stage.png		three_stage.png
train.py		train.py
trainer.py		trainer.py
utils.py		utils.py
virus.yml		virus.yml
virus_pip.txt		virus_pip.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Virus: Harmful Fine-tuning Attack for Large Language Models bypassing Guardrail Moderation

Three-stage fine-tuning-as-a-service

Design logistic

Code logistic

Code to run

Package requirement

Data preparation

Huggingface Llama3 access

About

Releases

Packages

Contributors 2

Languages

License

git-disl/Virus

Folders and files

Latest commit

History

Repository files navigation

Virus: Harmful Fine-tuning Attack for Large Language Models bypassing Guardrail Moderation

Three-stage fine-tuning-as-a-service

Design logistic

Code logistic

Code to run

Package requirement

Data preparation

Huggingface Llama3 access

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages