You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for open sourcing the code!
Was looking for guidance on advancing DiffuLLaMA with a code specific dataset in order to evaluate if coding tasks can perform better on the DiffuLLaMA architecture.
Current Attempts and Results
Full Pre-training Approach:
Dataset: 120M token code pretraining dataset
Base model: Nous Hermes LLaMA 2
Infrastructure: 8x A100 GPUs
Training duration: 1 week
Using CPU Offloading
Result: Model unable to produce legible code output
Plausible failure condition: Undertrained, no clearly visible loss pattern as it wasnt sufficient compute
Result: No significant improvement in inference quality
Plausible failure condition: Lora activations don't sufficiently overlap with the way the model inferences and therefore lora doesn't seem to have a dent on the model's evaluation metrics (both the lora adapter version and the base version generate exactly the same outputs)
It would be great if you could comment on below questions wrt the direction we are taking for training the same.
Are there recommended hyperparameters or training configurations specifically for code generation using DiffuLLaMA?
What would be an ideal dataset size/composition for code-specific training?
(We were thinking initially to swap the ratio of the dataset mix to achieve a coding specific dataset, 30% from SlimPajama and 70% from StarCoder)
Are there known limitations or considerations when adapting DiffuLLaMA specifically for code generation?
Would you recommend any architectural modifications for code-specific tasks?
Additional Information
Using default configurations from the repository for both the full finetune and the lora
Any guidance or suggestions would be greatly appreciated.
The text was updated successfully, but these errors were encountered:
Hi, thank for your interests in DiffuLLaMA.
My empirical findings include:
(1) it's easier to learn supervised data (pair-wise) than un-supervised data (pre-training corpus);
(2) even loss is converged, more training FLOPS might required.
(3) generation algorithm is important. Please try different generation algorithms including different temperatures.
Hi!
Thank you for open sourcing the code!
Was looking for guidance on advancing DiffuLLaMA with a code specific dataset in order to evaluate if coding tasks can perform better on the DiffuLLaMA architecture.
Current Attempts and Results
Full Pre-training Approach:
Dataset: 120M token code pretraining dataset
Base model: Nous Hermes LLaMA 2
Infrastructure: 8x A100 GPUs
Training duration: 1 week
Using CPU Offloading
Result: Model unable to produce legible code output
Plausible failure condition: Undertrained, no clearly visible loss pattern as it wasnt sufficient compute
LoRA Fine-tuning Approach:
Dataset: 175M token code question answer pair dataset
Base model: diffufamily/diffullama
Method: LoRA rank 16
Result: No significant improvement in inference quality
Plausible failure condition: Lora activations don't sufficiently overlap with the way the model inferences and therefore lora doesn't seem to have a dent on the model's evaluation metrics (both the lora adapter version and the base version generate exactly the same outputs)
It would be great if you could comment on below questions wrt the direction we are taking for training the same.
(We were thinking initially to swap the ratio of the dataset mix to achieve a coding specific dataset, 30% from SlimPajama and 70% from StarCoder)
Additional Information
Any guidance or suggestions would be greatly appreciated.
The text was updated successfully, but these errors were encountered: