Grad overflow on iteration and loss "nan" when using One-bit Adam #1472

Vladimir-Bayes · 2021-10-21T03:41:16Z

Vladimir-Bayes
Oct 21, 2021

When I use 1-bit Adam to train my model, there are lots of "Grad overflow on iteration" and "Overflow detected. Skipping step" in my log. Eventually, the loss becomes "nan" and the training process seems blocked.
Is there any suggestion for this problem?
BTW, I'm using one machine with 8 gpus and my ds_config.json is
{
"train_batch_size": 320,
"train_micro_batch_size_per_gpu": 40,
"steps_per_print": 1000,
"optimizer": {
"type": "OnebitAdam",
"params": {
"lr": 1e-4,
"weight_decay": 1.0e-5,
"bias_correction": false,
"freeze_step": 400,
"cuda_aware": false,
"comm_backend_name": "nccl"
}
},
"fp16": {
"enabled": true,
"loss_scale": 0
}
}

Answered by conglongli

Oct 21, 2021

Hi, thanks for trying 1-bit Adam. Unfortunately the information you provided is not sufficient to determine the potential root cause, but here are some suggestions/questions:

Have you tried the same configs but with Adam? If so does that work?
Did the nan loss happen at the very beginning, or after the number of "freeze_step"? If it is before reaching "freeze_step" steps, then it is actually still running baseline Adam.
How many total steps does your training has? As mentioned in our tutorial, we recommend to set "freeze_step" 15-25% of the total training steps for a given model in first try, because we need to use baseline Adam long enough to wait until the variance becomes stable. If y…

View full answer

conglongli · 2021-10-21T04:05:28Z

conglongli
Oct 21, 2021

Hi, thanks for trying 1-bit Adam. Unfortunately the information you provided is not sufficient to determine the potential root cause, but here are some suggestions/questions:

Have you tried the same configs but with Adam? If so does that work?
Did the nan loss happen at the very beginning, or after the number of "freeze_step"? If it is before reaching "freeze_step" steps, then it is actually still running baseline Adam.
How many total steps does your training has? As mentioned in our tutorial, we recommend to set "freeze_step" 15-25% of the total training steps for a given model in first try, because we need to use baseline Adam long enough to wait until the variance becomes stable. If you haven't done so, I highly recommend reading our tutorial https://www.deepspeed.ai/tutorials/onebit-adam/#14-configuration-of-1-bit-adam and paper https://arxiv.org/abs/2102.02888 to first understand more about 1-bit Adam.

2 replies

Vladimir-Bayes Oct 21, 2021
Author

I have ever used Adam to train my model and it works well. It seems that the nan loss happeens after the "freeze_step". The number of my training steps are very large. I realize that the setting "freeze_step": 400 is not appropriate.
Thanks so much. I will read the guidelines to fix this problem.

conglongli Oct 21, 2021

Sounds good. Yeah please try larger freeze_step and let us know if there is still an issue. Please also leave a message if this solves the issue :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grad overflow on iteration and loss "nan" when using One-bit Adam #1472

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Grad overflow on iteration and loss "nan" when using One-bit Adam #1472

Vladimir-Bayes Oct 21, 2021

Replies: 1 comment · 2 replies

conglongli Oct 21, 2021

Vladimir-Bayes Oct 21, 2021 Author

conglongli Oct 21, 2021

Vladimir-Bayes
Oct 21, 2021

Replies: 1 comment 2 replies

conglongli
Oct 21, 2021

Vladimir-Bayes Oct 21, 2021
Author