Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train with weight decay and momentum #4

Open
milliema opened this issue Feb 26, 2021 · 2 comments
Open

Train with weight decay and momentum #4

milliema opened this issue Feb 26, 2021 · 2 comments

Comments

@milliema
Copy link

I'm using SLS to train my own model, but I found it's different to train with plain SGD or SGD+wd+mom.
When I use plain SGD, the step size increase at first, following exponential trend, which is consistent with you published work.
However, if I use SGD+weight decay+momentum, the step size is very stable (0.02~0.03) for most of the time.
Can you explain why? Is SPS incompatible with optimizer momentum and weight decay?

@IssamLaradji
Copy link
Owner

We have noticed the same behaviour with the step size when incorporating momentum. I am not sure why that is happening, but our team is investigating this phenomenon, because it is an interesting behavior.

@milliema
Copy link
Author

milliema commented Feb 26, 2021

Thanks for your reply.

We have noticed the same behavior with the step size when incorporating momentum.

So the behavior is only related with momentum? Did you test with weight decay or not? I guess it may because the weight norm increase when momentum is adopted, the grad norm may increase as well, so computed step size decreases. But if we use weight decay+momentum, normally the weight norm is stable, that makes me confused with the results I get.
BTW, have you ever tested SPS or SLS on larger datasets (e.g. ImageNet)? The idea seems very interesting and promising for diverse applications.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants