Val loss improvement #1903

kohya-ss · 2025-01-27T12:11:33Z

train/eval state for the network and the optimizer.
stable timesteps
stable noise
support block swap

stepfunction83 · 2025-01-27T15:02:39Z

I love the approach to holding the rng_state aside, setting the validation state using the validation seed, and then restoring the rng_state afterwards. It's much more elegant than tracking the state separately and has no overhead.

stepfunction83 · 2025-01-27T15:26:49Z

I would also add that once this is put in place, there won't be a need for a moving average to track the validation loss. Using consistent timesteps and noise will make it almost entirely stable, so displaying the mean of the validation loss amounts for each validation run should be all that's needed.

Since the validation set is subject to change if the core dataset changes, I've found tracking the validation loss relative to the initial loss is also helpful to make progress across different training runs comparable.

rockerBOO · 2025-01-27T19:49:56Z

This looks great!

What are you using for formatting the code? I've been manually formatting but might be easier to align the formatting if I use the same formatting tool.

kohya-ss · 2025-01-27T22:51:53Z

I would also add that once this is put in place, there won't be a need for a moving average to track the validation loss. Using consistent timesteps and noise will make it almost entirely stable, so displaying the mean of the validation loss amounts for each validation run should be all that's needed.

That makes sense. Currently, there is a problem viewing logs in TensorBoard, but I would like to at least get the mean of the validation loss to be displayed correctly.

What are you using for formatting the code? I've been manually formatting but might be easier to align the formatting if I use the same formatting tool.

For formatting, I use black with the --line-length=132 option. I would like to at least provide a guideline on this.

gesen2egee · 2025-01-28T05:51:47Z

It seems that correction for timestep sampling works better (I previously used debiased 1/√SNR, which is similar in meaning).
Perhaps averaging wouldn’t be necessary in this case.

Additionally, I have some thoughts on the args.
For validation_split, how about making it an integer greater than 1 to automatically represent the number of validation samples?
This would be more convenient.

gesen2egee · 2025-01-28T06:28:51Z

https://github.com/[spacepxl/demystifying-sd-finetuning](https://github.com/spacepxl/demystifying-sd-finetuning)
Here's a suggestion for a function that, while not perfect, can normalize the losses across different timesteps to the same magnitude. I believe this approach is more reliable

kohya-ss · 2025-01-28T13:09:29Z

Here's a suggestion for a function that, while not perfect, can normalize the losses across different timesteps to the same magnitude. I believe this approach is more reliable

This makes some sense.
However, I believe that users already apply timestep weighting if necessary. For example, min snr gamma or debiased estimation etc.
Also, the validation loss should be the same as the training loss, so I think no additional correction should be necessary.

For validation_split, how about making it an integer greater than 1 to automatically represent the number of validation samples?

Although it means giving multiple meanings to a single setting value, it is worth considering.

spacepxl · 2025-01-28T16:59:14Z

@gesen2egee you would need a different fit equation for each new model, and it's not really relevant when you make validation fully deterministic. I've tried applying it to training loss and it was extremely harmful.

You can also visualize the raw training loss by plotting it like so:

That was done by storing all loss and timestep values, and coloring them by training step. Not sure if there's a way to do that natively in tensorboard/wandb, I did this with matplotlib and just logged it as an image.

rockerBOO · 2025-01-29T04:24:59Z

File "/mnt/900/builds/sd-scripts/library/train_util.py", line 5968, in get_timesteps
timesteps = torch.randint(min_timestep, max_timestep, (b_size,), device="cpu")
RuntimeError: random_ expects 'from' to be less than 'to', but got from=200 >= to=200

In get_timesteps maybe

if min_timestep < max_timestep:
    timesteps = torch.randint(min_timestep, max_timestep, (b_size,), device="cpu")
else:
    timesteps = torch.ones(b_size, device="cpu") * min_timestep

I know this isn't completed but I tried it anyways.

kohya-ss added 7 commits January 27, 2025 20:50

formatting

532f5c5

Fix gradient handling when Text Encoders are trained

86a2f3f

call optimizer eval/train fn before/after validation

b6a3093

add network.train()/eval() for validation

29f31d0

validation: Implement timestep-based validation processing

0750859

Merge branch 'sd3' into val-loss-improvement

42c0a9e

use same noise for every validation

45ec02b

kohya-ss mentioned this pull request Jan 27, 2025

Validation Loss Enhancements #1900

Open

rockerBOO mentioned this pull request Jan 28, 2025

Move scale_weight_norms inside sync_gradients #1908

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Val loss improvement #1903

Val loss improvement #1903

kohya-ss commented Jan 27, 2025 •

edited

Loading

stepfunction83 commented Jan 27, 2025 •

edited

Loading

stepfunction83 commented Jan 27, 2025

rockerBOO commented Jan 27, 2025

kohya-ss commented Jan 27, 2025

gesen2egee commented Jan 28, 2025 •

edited

Loading

gesen2egee commented Jan 28, 2025 •

edited

Loading

kohya-ss commented Jan 28, 2025

spacepxl commented Jan 28, 2025 •

edited

Loading

rockerBOO commented Jan 29, 2025

Val loss improvement #1903

Are you sure you want to change the base?

Val loss improvement #1903

Conversation

kohya-ss commented Jan 27, 2025 • edited Loading

stepfunction83 commented Jan 27, 2025 • edited Loading

stepfunction83 commented Jan 27, 2025

rockerBOO commented Jan 27, 2025

kohya-ss commented Jan 27, 2025

gesen2egee commented Jan 28, 2025 • edited Loading

gesen2egee commented Jan 28, 2025 • edited Loading

kohya-ss commented Jan 28, 2025

spacepxl commented Jan 28, 2025 • edited Loading

rockerBOO commented Jan 29, 2025

kohya-ss commented Jan 27, 2025 •

edited

Loading

stepfunction83 commented Jan 27, 2025 •

edited

Loading

gesen2egee commented Jan 28, 2025 •

edited

Loading

gesen2egee commented Jan 28, 2025 •

edited

Loading

spacepxl commented Jan 28, 2025 •

edited

Loading