Skip to content

Grad overflow on iteration and loss "nan" when using One-bit Adam #1472

Answered by conglongli
Vladimir-Bayes asked this question in Q&A
Discussion options

You must be logged in to vote

Hi, thanks for trying 1-bit Adam. Unfortunately the information you provided is not sufficient to determine the potential root cause, but here are some suggestions/questions:

  1. Have you tried the same configs but with Adam? If so does that work?
  2. Did the nan loss happen at the very beginning, or after the number of "freeze_step"? If it is before reaching "freeze_step" steps, then it is actually still running baseline Adam.
  3. How many total steps does your training has? As mentioned in our tutorial, we recommend to set "freeze_step" 15-25% of the total training steps for a given model in first try, because we need to use baseline Adam long enough to wait until the variance becomes stable. If y…

Replies: 1 comment 2 replies

Comment options

You must be logged in to vote
2 replies
@Vladimir-Bayes
Comment options

@conglongli
Comment options

Answer selected by Vladimir-Bayes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants