Skip to content

databricks/Compose-RL

Repository files navigation

Compose RL

Compose RL is a framework for Reinforcement Learning with Human Feedback (RLHF), designed to streamline the integration of various RLHF techniques. By leveraging the flexibility of MosaicML's Composer and LLM Foundry, Compose RL enables modular and composable experimentation, catering to researchers and practitioners exploring RLHF methodologies.

Note: This repository is currently in alpha. Expect frequent changes and potential instabilities. Use at your own risk!

Key Features

  • Composable RLHF Training:

    • Experiment with Policy Proximal Optimization (PPO), Direct Preference Optimization (DPO) and its variants, and reward model training.
    • Seamlessly integrate custom components or replace default implementations.
  • Powered by Databricks MosaicML:

This repo utilizes Composer, an open-source deep learning training library optimized for scalability and usability. Composer simplifies the implementation of distributed training workflows on large-scale clusters, abstracting complexities like parallelism techniques, distributed data loading, and memory optimization. Additionally, this repo leverages LLM Foundry, an open-source repository containing code for training large language models (LLMs) to help enable rapid experimentation with the latest techniques.

  • Modularity:

    • Decoupled components for policy optimization, preference modeling, and reward model training.
    • Flexible configurations to adapt to a variety of RLHF workflows.

You'll find in this repo:

  • compose-rl - source code for models, datasets, utilities
  • scripts - scripts for generating data
  • yamls - yamls for training runs

Installation

Clone the repository and install the dependencies:

git clone https://github.com/databricks/Compose-RL.git
cd Compose-RL
pip install -e .[gpu]
python3 -m spacy download en_core_web_sm

We highly recommend you use the llm-foundry images and install Compose RL and LLM Foundry and its dependencies on top of the images.

Note: when using the LLM Foundry images, please install LLM Foundry as it's not natively included with the images.

Quickstart

Here is an end-to-end workflow of performing data preparation for training along with how to launch model training.

Data preparation

Below is the set of commands to run to prepare datasets into the appropriate Mosaic Data Shard (MDS) format, which is a pre-tokenized version of the data, that we will use for training.

Below is the command to prepare preference data -- which can be used for reward model or offline RL (e.g. DPO) training:

cd scripts
python data/unified_tokenize_dataset.py --dataset_name allenai/ultrafeedback_binarized_cleaned \
--local_dir pref_data \
--dataset_type preference \
--tokenizer_name meta-llama/Llama-3.1-8B-Instruct \
--split train_prefs

Below is the command to prepare prompt data -- which can be used for online RL (e.g. PPO) training:

cd scripts
python data/unified_tokenize_dataset.py --dataset_name allenai/ultrafeedback_binarized_cleaned \
--local_dir prompt_data \
--dataset_type single_prompt \
--tokenizer_name meta-llama/Llama-3.1-8B-Instruct \
--split train_prefs

Model training

Below are the scripts to launch training runs assuming you ran the data preparation scripts above. Additionally, these scripts assume that we are in the root directory where Compose RL and LLM Foundry were cloned. This is because we utilize LLM Foundry's Registry System in order to take advantage of existing features in LLM Foundry.

Reward Model Training

Below is the command to run reward model training:

composer llm-foundry/scripts/train/train.py \
compose-rl/yamls/local_reward.yaml \
train_loader.dataset.local=/compose-rl/scripts/pref_data/ \
train_loader.dataset.split=train_prefs

DPO Training

Below is the command to run for DPO training (along with its variants):

composer llm-foundry/scripts/train/train.py \
compose-rl/yamls/local_dpo.yaml \
train_loader.dataset.local=/compose-rl/scripts/pref_data/ \
train_loader.dataset.split=train_prefs

For DPO we support other variants of DPO including: Reward Aware Preference Optimization (RPO), REgression to RElative REward Based RL (REBEL), Identity-PO (IPO), and Kahneman-Tversky Optimization (KTO).

PPO Training

Below is the command to run Online PPO training:

composer llm-foundry/scripts/train/train.py \
compose-rl/yamls/local_ppo.yaml \
train_loader.dataset.local=/compose-rl/scripts/prompt_data/ \
train_loader.dataset.split=train_prefs

Helpful code pointers

Adding new data processing In the unified_tokenize_dataset.py script, we can add new capabilities to process data by modifying the __iter__ function. Typically all datasets are pre-tokenized before model training.

Creating new reward models In order to modify the training loss for reward models we need to define a new Enum here and update the pairwise_loss function. Modifying the loss will result in a different loss_type in the model yaml here

Additionally, in order to modify the reward model architecture for huggingface models we need to update AutoModelForCausalLMWithRM here. Additionally, any model config updates, we can update the config here. Config updates will apply to the model yaml here

Creating new Offline RL Variants In order to add new offline RL based algorithms, we need to add a new Enum to DPOEnum here and then we can add the new loss function in the dpo_loss method. The loss function field will be updated in the model section of the DPO yaml here

Creating new Online RL Variants In order to create new online RL based algorithms, we need to add a new forward and loss function into ppo/modeling_utils.py. From here we need to define a new model in ppo/model.py.

A high level overview of LLM Foundry Plugins LLM Foundry plugins allows us to take advantage of many of the functions within LLM foundry, while augmenting the code with other models and methods. Plugins requires us to define registry entrypoints into LLM foundry which is done in the pyproject.toml file in this repo. See the commented code around entry points in the file here, where we define various entrypoints for models, dataloaders, callbacks, and metrics. For more details on plugins and its registry system see here.

How to write HuggingFace checkpoints Since we use LLM Foundry's plugins, we are able to export our models as HuggingFace Models and Checkpoints. We are able to do this by using the hf_checkpointer callback from LLM Foundry, which can be defined in a yaml as the following:

callbacks:
    hf_checkpointer:
	    # Specify an appropriate save interval (e.g. 1 epoch)
        save_interval: 1ep
        # Specify the appropriave save path (this can be a local checkpoint, s3, oci, or any other
        # saving mechanism supported by Composer.)
        save_folder:
        # Specify the precision to save the checkpoint.
        precision: bfloat16

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published