-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes made in Unsloth and openInstruct to get a successful Online DPO run #1494
Comments
Thank you so much for this we'll take a look and review it :) |
Yeah apologies I should mention this change too in the |
Ok so I will go and say where some changes should be made in this file from the allenAI/openInstruct repo that I edited, I will try to go through the changes in the order that I listed in the above issue. Any changes that I made in Unsloth do not need to be transfered into TRL but anything from 3-5 in the OP needs to be transfered to TRL. I will discuss potentially what could just be implemented natively to unsloth. I will maybe try to implement these changes myself sometime soon and get a successful run, (apologies semester is starting soon and have quite a bit of other olbigations.) 1: This is more of a general note for anyone else reading this besides @shimmyshimmer and @danielhanchen Its important that you train your own reward model, preferably DPO style like the RM trainer does on hugging face already and you test if the classification accuracy is good, and that you SFT your model on a dataset and check if it produces EOS tokens correctly and produces similar responses given a prompt and the answer to that prompt in the SFT dataset. Online DPO will not work with arbitrary policies and reward models. 2: One big change I would do is in the args, pass an arge like put something like
in here: Although if you want to use the
Around these lines https://github.com/huggingface/trl/blob/763738f457f283270772ac9bd5b3e4027fd424d5/trl/trainer/online_dpo_trainer.py#L130-L216 3: Just make sure to I would recomend also setting the model and ref policy to 4: Ok so, I will say it does not seem like they don't do batch generations here, but if you were to apply batch generation, you would just copy and paste my code to these lines, and make sure that you also set
Otherwise if we aren't doing batch generation we leave it to be the same. 5: Just redo the forward logits with the forward funciton instead of the outputs from the generations, copy from here `
to here: (Seems like the online DPO trainer in hugging face already does this but for its own reasons.) 6: I would just put this funciton into the online_dpo_trainer.py file in TRL and just call it whenever you get rewards: The above line replaces or overwrites this function here: |
Apologies to mention, there is a ton of hard coded name of models and datasets in the script as I honeslty got a bit lazy with figuring out the API, regardless, if you pull the allenAI/OpenInstruct repo and you also integrate the changes from unsloth in there, I ran the script with:
My reward model saved weirdly so you have to load it like this:
Hope this helps. |
Thanks a lot @pluesclues we'll actually probably be releasing a release just for Reward Modelling/Reinforced Learning soon and this addition would be stellar! |
Thats great! I will try to get working on the TRL notebook when my semester starts, however, I am not sure if a lot of these changes could be added in unsloth rather than editing TRL itself, I will do it regardless, but I think possibly there maybe merit in making like an RL specific generate function to include the padding changes as well as the specific generation changes that I had. However, for the KL divergence issue I mentioned above, I think you may have to edit TRL regardless to use the forward logits of the sequence generated from batch generations. |
Alright so as promised from the unsloth reddit post https://www.reddit.com/r/LocalLLaMA/comments/1hqkeyn/comment/m4rbtto/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button, I will highlight the changes I had made in the allenAI open isntruct repo that I forked (https://github.com/pluesclues/us_open-instruct) and the unsloth repo (https://github.com/pluesclues/unsloth/tree/main) I had forked in order to get things working, the changes were overall minimal and I had tried my best to make as least code changes as possible so they were easy to integrate. Lets start with the changes I made in unsloth as they were quite simple compared to the ones I made in the open instruct repo. I am going to highlight mainly three different things I had to focus on in order to get Unsloth to be compatible with the openInstruct repo.
DISCLAIMER: I GOT THIS WORKING WITH MAINLY THE LLAMA MODELS, THESE CHANGES CAN ALSO BE APPLIED TO THE OTHER MODELS AS WELL (Although I should make better code to do this)
DISCLAIMER 2: Apologies, TLDR reddit dataset has inappropriate text at time, I will try to censor it.
Lets start with the changes I made in unsloth:
https://github.com/pluesclues/unsloth/blob/main/unsloth/kernels/fast_lora.py
In fast_lora.py, I acutally addressed this issue:
Lora downcasting issue #320
it can be fixed by adding
with torch.amp.autocast('cuda', dtype=torch.bfloat16):
(or `torch.float16' depending on your system ) above all of the matrix multiplication comptuations to enable mixed precision. I do not know why it doesn't work when you doThe changes were applied to:
LoRA_MLP:
https://github.com/pluesclues/unsloth/blob/389b98f4860ab007f02af27258bd68d594749a66/unsloth/kernels/fast_lora.py#L116C8-L116C63
LoRA_QKV:
https://github.com/pluesclues/unsloth/blob/389b98f4860ab007f02af27258bd68d594749a66/unsloth/kernels/fast_lora.py#L275
LORA_W:
https://github.com/pluesclues/unsloth/blob/389b98f4860ab007f02af27258bd68d594749a66/unsloth/kernels/fast_lora.py#L392
But that solves the lora downcasting issue atleast when you try to do torch.backwards(loss) on a custom loss calculated by torch functions.
The second change I want to highlight is quite simple and it is in https://github.com/pluesclues/unsloth/blob/main/unsloth/models/llama.py
So in these lines https://github.com/pluesclues/unsloth/blob/4705906536f8aa1a10143a3cfa814ddd50f05bdc/unsloth/models/llama.py#L1507-L1539
are made in order to reserve the original forward functions for the llama models. This is because, if you use
AutoModelForSequenceClassification
It is not compatible with the unsloth overwritten forward functions, so the need to be kept and set and reset in variables when you are calculating the rewards during your RL updates given that your models must generate responses and get rewards during training. (I will highlight these chnages when going intot he allen AI repo)OK THATS ALL THE UNSLOTH CHANGES, next will be all of the changes that were made in AllenAI openInstruct, but will need to be transfered to TRL, we first will start with the initialization of the models, it mostly stays the same, except for the reward model.
I also had to use the reset and set functions from the unsloth changes I made to initialize my reward model.
https://github.com/pluesclues/us_open-instruct/blob/5375f58e2b893554da018c9c6be472ce0d1ed220/open_instruct/unsloth_online_dpo.py#L347-L357
I also intialized the policy and reference policy for training before going into the loop.
https://github.com/pluesclues/us_open-instruct/blob/5375f58e2b893554da018c9c6be472ce0d1ed220/open_instruct/unsloth_online_dpo.py#L456-L457
4: Ok this is where the most important changes are and that has to do with generation, I will try to highlight all of the functions that are linked together as well as where it starts and it starts in this file: https://github.com/pluesclues/us_open-instruct/blob/5375f58e2b893554da018c9c6be472ce0d1ed220/open_instruct/unsloth_online_dpo.py#L459-L467, so TRL also unwraps the model for generation and that function remains the same, I am going to go over
unsloth_batch_generation
and its dependencies:https://github.com/pluesclues/us_open-instruct/blob/5375f58e2b893554da018c9c6be472ce0d1ed220/open_instruct/model_utils.py#L473-L504
Ok so I intitialize
FastLanguageModel.for_inference(model)
before it generates from the batches and set it back toFastLanguageModel.for_training(model)
after the funciton is done generating.I will go into the logistics of the https://github.com/pluesclues/us_open-instruct/blob/5375f58e2b893554da018c9c6be472ce0d1ed220/open_instruct/model_utils.py#L447-L470 notice here I only set
max_new_tokens=53
this is because of the fact that, you will generate weird responses if you only setmin_new_tokens=53
and also if you set both to 53, the generation will not produce EOS token.Ok so, one problem with only setting
max_new_tokens=53
the unsloth model will padd any tokens after the first EOS token with more EOS tokens which is actually fine, but for batch generation, the query_responses length won't match up when you have to doreturn torch.cat(query_responses, 0)
when returning the batch generations.Alright also note, I should have not hard coded 53 for the
max_new_tokens
but essentially what these lines of code here https://github.com/pluesclues/us_open-instruct/blob/5375f58e2b893554da018c9c6be472ce0d1ed220/open_instruct/model_utils.py#L497-L502 do is essentially if the response was not 53 tokens long, it will pad it with more EOS tokens up until 53, (you can change 53 to whatever yourmax_new_tokens
you need for your specific dataset, 53 is used for the tldr datset https://huggingface.co/datasets/trl-internal-testing/tldr-preference-sft-trl-style). Essentially it will make sure all the shapes are the same and the padding makes sense according to how unsloths generate function does it.Ill explain what happens if you do not do this more clearly, typically when you generate for like a batch of 4 you will see the following shapes for the 4 samples note assume they are all tensors:
[64, 308]
[64, 299]
[64, 304]
[64, 308]
Essentially, to concatenate the batches they have to be the same shape, so we just pad with EOS tokens up until 53 tokens are generated from the response. I am not sure however if unsloth supports batch generation natively with this generation function and if this problem isnt exactly an issue. It is also imperative that there is a EOS token in each of the responses as that accounts for most of the reward given from the response, Online DPO will not work unless if there is atleast an EOS token in the generation.
I used this generation function
Example of
max_new_tokens
andmin_new_tokens
to be 53TL;DR: I have to cut contact with my ex's friends or it'll hurt me. How to do it without hurting them? Is it the right thing to do? Is it healthy? Or am I being a b***? Thanks! :) :) :)
Example of
min_new_tokens
to be 53I feel like I have to break contact with these girls because I'm not sure if I want to keep up the friendship. But also because I don't want to hurt them. Will they accept or will it hurt? How do I make it work? Thanks for reading. :) -f/22. :D ^_^ :D<|end_of_text|>
Example of
max_new_tokens
to be 53 (Apologies, this is not generated the same prompt, but this is what happens durring training loops, but either way does not change the logic I have implemented. )\nTL;DR: I don't like Halloween and I don't allow my son to trick-or-treat, but everyone else insists that I'm forcing him to miss out on something and I don't feel like I'm doing anything wrong.<|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|><|end_of_text|>'
logitss
are returned from the generate function and the thing is unsloth given the same prompt has different logits due to 4bit precision I believe? I had talked with @danielhanchen about this in this reddit post: https://www.reddit.com/r/unsloth/comments/1f90cgo/generation_instability_between_the_forward_probs/, This however is fixed with just using the forward function and not storing the output logits:https://github.com/pluesclues/us_open-instruct/blob/5375f58e2b893554da018c9c6be472ce0d1ed220/open_instruct/unsloth_online_dpo.py#L473-L476
This is so the KL is stable and actually starts at 0 since the ref_policy and policy should be the same when starting the DPO run.
I tried my best to highlight all the changes please let me know if anything is confusing, I will try to write about where I will put the openInstruct changes into TRL in a comment below. I look forward to getting this integrated into unsloth as soon as possible and possibly make a notebook for it.
The text was updated successfully, but these errors were encountered: