Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training code-specific diffusion model using DiffuLLaMA #4

Open
theycallmeloki opened this issue Dec 31, 2024 · 1 comment
Open

Training code-specific diffusion model using DiffuLLaMA #4

theycallmeloki opened this issue Dec 31, 2024 · 1 comment

Comments

@theycallmeloki
Copy link

Hi!

Thank you for open sourcing the code!
Was looking for guidance on advancing DiffuLLaMA with a code specific dataset in order to evaluate if coding tasks can perform better on the DiffuLLaMA architecture.

Current Attempts and Results

  1. Full Pre-training Approach:

    • Dataset: 120M token code pretraining dataset

    • Base model: Nous Hermes LLaMA 2

    • Infrastructure: 8x A100 GPUs

    • Training duration: 1 week

    • Using CPU Offloading

    • Result: Model unable to produce legible code output

    • Plausible failure condition: Undertrained, no clearly visible loss pattern as it wasnt sufficient compute

  2. LoRA Fine-tuning Approach:

    • Dataset: 175M token code question answer pair dataset

    • Base model: diffufamily/diffullama

    • Method: LoRA rank 16

    • Result: No significant improvement in inference quality

    • Plausible failure condition: Lora activations don't sufficiently overlap with the way the model inferences and therefore lora doesn't seem to have a dent on the model's evaluation metrics (both the lora adapter version and the base version generate exactly the same outputs)

It would be great if you could comment on below questions wrt the direction we are taking for training the same.

  1. Are there recommended hyperparameters or training configurations specifically for code generation using DiffuLLaMA?
  2. What would be an ideal dataset size/composition for code-specific training?
    (We were thinking initially to swap the ratio of the dataset mix to achieve a coding specific dataset, 30% from SlimPajama and 70% from StarCoder)
  3. Are there known limitations or considerations when adapting DiffuLLaMA specifically for code generation?
  4. Would you recommend any architectural modifications for code-specific tasks?

Additional Information

  • Using default configurations from the repository for both the full finetune and the lora

Any guidance or suggestions would be greatly appreciated.

@summmeer
Copy link
Contributor

Hi, thank for your interests in DiffuLLaMA.
My empirical findings include:
(1) it's easier to learn supervised data (pair-wise) than un-supervised data (pre-training corpus);
(2) even loss is converged, more training FLOPS might required.
(3) generation algorithm is important. Please try different generation algorithms including different temperatures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants