Train with weight decay and momentum #4

milliema · 2021-02-26T04:46:52Z

I'm using SLS to train my own model, but I found it's different to train with plain SGD or SGD+wd+mom.
When I use plain SGD, the step size increase at first, following exponential trend, which is consistent with you published work.
However, if I use SGD+weight decay+momentum, the step size is very stable (0.02~0.03) for most of the time.
Can you explain why? Is SPS incompatible with optimizer momentum and weight decay?

IssamLaradji · 2021-02-26T15:41:56Z

We have noticed the same behaviour with the step size when incorporating momentum. I am not sure why that is happening, but our team is investigating this phenomenon, because it is an interesting behavior.

milliema · 2021-02-26T16:23:55Z

Thanks for your reply.

We have noticed the same behavior with the step size when incorporating momentum.

So the behavior is only related with momentum? Did you test with weight decay or not? I guess it may because the weight norm increase when momentum is adopted, the grad norm may increase as well, so computed step size decreases. But if we use weight decay+momentum, normally the weight norm is stable, that makes me confused with the results I get.
BTW, have you ever tested SPS or SLS on larger datasets (e.g. ImageNet)? The idea seems very interesting and promising for diverse applications.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train with weight decay and momentum #4

Train with weight decay and momentum #4

milliema commented Feb 26, 2021

IssamLaradji commented Feb 26, 2021

milliema commented Feb 26, 2021 •

edited

Loading

Train with weight decay and momentum #4

Train with weight decay and momentum #4

Comments

milliema commented Feb 26, 2021

IssamLaradji commented Feb 26, 2021

milliema commented Feb 26, 2021 • edited Loading

milliema commented Feb 26, 2021 •

edited

Loading