You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using SLS to train my own model, but I found it's different to train with plain SGD or SGD+wd+mom.
When I use plain SGD, the step size increase at first, following exponential trend, which is consistent with you published work.
However, if I use SGD+weight decay+momentum, the step size is very stable (0.02~0.03) for most of the time.
Can you explain why? Is SPS incompatible with optimizer momentum and weight decay?
The text was updated successfully, but these errors were encountered:
We have noticed the same behaviour with the step size when incorporating momentum. I am not sure why that is happening, but our team is investigating this phenomenon, because it is an interesting behavior.
We have noticed the same behavior with the step size when incorporating momentum.
So the behavior is only related with momentum? Did you test with weight decay or not? I guess it may because the weight norm increase when momentum is adopted, the grad norm may increase as well, so computed step size decreases. But if we use weight decay+momentum, normally the weight norm is stable, that makes me confused with the results I get.
BTW, have you ever tested SPS or SLS on larger datasets (e.g. ImageNet)? The idea seems very interesting and promising for diverse applications.
I'm using SLS to train my own model, but I found it's different to train with plain SGD or SGD+wd+mom.
When I use plain SGD, the step size increase at first, following exponential trend, which is consistent with you published work.
However, if I use SGD+weight decay+momentum, the step size is very stable (0.02~0.03) for most of the time.
Can you explain why? Is SPS incompatible with optimizer momentum and weight decay?
The text was updated successfully, but these errors were encountered: