Compose RL is a framework for Reinforcement Learning with Human Feedback (RLHF), designed to streamline the integration of various RLHF techniques. By leveraging the flexibility of MosaicML's Composer and LLM Foundry, Compose RL enables modular and composable experimentation, catering to researchers and practitioners exploring RLHF methodologies.
Note: This repository is currently in alpha. Expect frequent changes and potential instabilities. Use at your own risk!
-
Composable RLHF Training:
- Experiment with Policy Proximal Optimization (PPO), Direct Preference Optimization (DPO) and its variants, and reward model training.
- Seamlessly integrate custom components or replace default implementations.
-
Powered by Databricks MosaicML:
This repo utilizes Composer, an open-source deep learning training library optimized for scalability and usability. Composer simplifies the implementation of distributed training workflows on large-scale clusters, abstracting complexities like parallelism techniques, distributed data loading, and memory optimization. Additionally, this repo leverages LLM Foundry, an open-source repository containing code for training large language models (LLMs) to help enable rapid experimentation with the latest techniques.
-
Modularity:
- Decoupled components for policy optimization, preference modeling, and reward model training.
- Flexible configurations to adapt to a variety of RLHF workflows.
You'll find in this repo:
compose-rl
- source code for models, datasets, utilitiesscripts
- scripts for generating datayamls
- yamls for training runs
Clone the repository and install the dependencies:
git clone https://github.com/databricks/Compose-RL.git
cd Compose-RL
pip install -e .[gpu]
python3 -m spacy download en_core_web_sm
We highly recommend you use the llm-foundry images and install Compose RL and LLM Foundry and its dependencies on top of the images.
Note: when using the LLM Foundry images, please install LLM Foundry as it's not natively included with the images.
Here is an end-to-end workflow of performing data preparation for training along with how to launch model training.
Below is the set of commands to run to prepare datasets into the appropriate Mosaic Data Shard (MDS) format, which is a pre-tokenized version of the data, that we will use for training.
Below is the command to prepare preference data -- which can be used for reward model or offline RL (e.g. DPO) training:
cd scripts
python data/unified_tokenize_dataset.py --dataset_name allenai/ultrafeedback_binarized_cleaned \
--local_dir pref_data \
--dataset_type preference \
--tokenizer_name meta-llama/Llama-3.1-8B-Instruct \
--split train_prefs
Below is the command to prepare prompt data -- which can be used for online RL (e.g. PPO) training:
cd scripts
python data/unified_tokenize_dataset.py --dataset_name allenai/ultrafeedback_binarized_cleaned \
--local_dir prompt_data \
--dataset_type single_prompt \
--tokenizer_name meta-llama/Llama-3.1-8B-Instruct \
--split train_prefs
Below are the scripts to launch training runs assuming you ran the data preparation scripts above. Additionally, these scripts assume that we are in the root directory where Compose RL and LLM Foundry were cloned. This is because we utilize LLM Foundry's Registry System in order to take advantage of existing features in LLM Foundry.
Reward Model Training
Below is the command to run reward model training:
composer llm-foundry/scripts/train/train.py \
compose-rl/yamls/local_reward.yaml \
train_loader.dataset.local=/compose-rl/scripts/pref_data/ \
train_loader.dataset.split=train_prefs
DPO Training
Below is the command to run for DPO training (along with its variants):
composer llm-foundry/scripts/train/train.py \
compose-rl/yamls/local_dpo.yaml \
train_loader.dataset.local=/compose-rl/scripts/pref_data/ \
train_loader.dataset.split=train_prefs
For DPO we support other variants of DPO including: Reward Aware Preference Optimization (RPO), REgression to RElative REward Based RL (REBEL), Identity-PO (IPO), and Kahneman-Tversky Optimization (KTO).
PPO Training
Below is the command to run Online PPO training:
composer llm-foundry/scripts/train/train.py \
compose-rl/yamls/local_ppo.yaml \
train_loader.dataset.local=/compose-rl/scripts/prompt_data/ \
train_loader.dataset.split=train_prefs
Adding new data processing
In the unified_tokenize_dataset.py
script, we can add new capabilities to process data by modifying the __iter__
function. Typically all datasets are pre-tokenized before model training.
Creating new reward models
In order to modify the training loss for reward models we need to define a new Enum
here and update the pairwise_loss
function. Modifying the loss will result in a different loss_type
in the model yaml here
Additionally, in order to modify the reward model architecture for huggingface models we need to update AutoModelForCausalLMWithRM
here. Additionally, any model config updates, we can update the config here. Config updates will apply to the model yaml here
Creating new Offline RL Variants
In order to add new offline RL based algorithms, we need to add a new Enum
to DPOEnum
here and then we can add the new loss function in the dpo_loss
method. The loss function field will be updated in the model section of the DPO yaml here
Creating new Online RL Variants
In order to create new online RL based algorithms, we need to add a new forward
and loss
function into ppo/modeling_utils.py
. From here we need to define a new model in ppo/model.py
.
A high level overview of LLM Foundry Plugins
LLM Foundry plugins allows us to take advantage of many of the functions within LLM foundry, while augmenting the code with other models and methods. Plugins requires us to define registry entrypoints into LLM foundry which is done in the pyproject.toml
file in this repo. See the commented code around entry points
in the file here, where we define various entrypoints for models, dataloaders, callbacks, and metrics. For more details on plugins and its registry system see here.
How to write HuggingFace checkpoints
Since we use LLM Foundry's plugins, we are able to export our models as HuggingFace Models and Checkpoints. We are able to do this by using the hf_checkpointer
callback from LLM Foundry, which can be defined in a yaml as the following:
callbacks:
hf_checkpointer:
# Specify an appropriate save interval (e.g. 1 epoch)
save_interval: 1ep
# Specify the appropriave save path (this can be a local checkpoint, s3, oci, or any other
# saving mechanism supported by Composer.)
save_folder:
# Specify the precision to save the checkpoint.
precision: bfloat16