Training code-specific diffusion model using DiffuLLaMA #4

theycallmeloki · 2024-12-31T10:47:01Z

Hi!

Thank you for open sourcing the code!
Was looking for guidance on advancing DiffuLLaMA with a code specific dataset in order to evaluate if coding tasks can perform better on the DiffuLLaMA architecture.

Current Attempts and Results

Full Pre-training Approach:
- Dataset: 120M token code pretraining dataset
- Base model: Nous Hermes LLaMA 2
- Infrastructure: 8x A100 GPUs
- Training duration: 1 week
- Using CPU Offloading
- Result: Model unable to produce legible code output
- Plausible failure condition: Undertrained, no clearly visible loss pattern as it wasnt sufficient compute
LoRA Fine-tuning Approach:
- Dataset: 175M token code question answer pair dataset
- Base model: diffufamily/diffullama
- Method: LoRA rank 16
- Result: No significant improvement in inference quality
- Plausible failure condition: Lora activations don't sufficiently overlap with the way the model inferences and therefore lora doesn't seem to have a dent on the model's evaluation metrics (both the lora adapter version and the base version generate exactly the same outputs)

It would be great if you could comment on below questions wrt the direction we are taking for training the same.

Are there recommended hyperparameters or training configurations specifically for code generation using DiffuLLaMA?
What would be an ideal dataset size/composition for code-specific training?
(We were thinking initially to swap the ratio of the dataset mix to achieve a coding specific dataset, 30% from SlimPajama and 70% from StarCoder)
Are there known limitations or considerations when adapting DiffuLLaMA specifically for code generation?
Would you recommend any architectural modifications for code-specific tasks?

Additional Information

Using default configurations from the repository for both the full finetune and the lora

Any guidance or suggestions would be greatly appreciated.

summmeer · 2025-01-15T07:18:27Z

Hi, thank for your interests in DiffuLLaMA.
My empirical findings include:
(1) it's easier to learn supervised data (pair-wise) than un-supervised data (pre-training corpus);
(2) even loss is converged, more training FLOPS might required.
(3) generation algorithm is important. Please try different generation algorithms including different temperatures.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training code-specific diffusion model using DiffuLLaMA #4

Training code-specific diffusion model using DiffuLLaMA #4

theycallmeloki commented Dec 31, 2024

summmeer commented Jan 15, 2025

Training code-specific diffusion model using DiffuLLaMA #4

Training code-specific diffusion model using DiffuLLaMA #4

Comments

theycallmeloki commented Dec 31, 2024

Current Attempts and Results

Additional Information

summmeer commented Jan 15, 2025