Move scale_weight_norms inside sync_gradients #1908
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
scale_weight_norms
was happening every step vs every gradient step (like with gradient accumulation) causing larger gradient_accumulation_steps to process the scaling many more times without any changes to the weights.I also moved the other sync_gradients outside the accumulation inside the accumulation and merged the 2. Possibly not the correct approach but felt it was appropriate for them to happen inside that accumulation but they might have the same end result. This is for sampling images and saving per step.
This PR has some formatting that should be applied with another PR for validation loss improvements #1903. Will probably wait for that PR to go through before this one to align the formatting changes from black.
A limited test but scale_weight_norm = 2.5. Assessments done with 1 epoch so a better measurement might prove different results.
With this PR
1 epoch, GA steps: 4 = 2:35 epoch
With sd3 branch
1 epoch, GA steps: 4 = 2:51 epoch
GA: 8, 2:26 vs 2:44
Different dataset
GA: 64, 5:08 vs 5:52
GA: 114 (half batch), 5:04 vs 5:50