Previously, the code has some issues (I used alpha_bars = alphas[:]
, which lead to the change of alphas
when alpha_bars
changes). Now I have fixed it.
I found that the keys to implement the U-Net structure for diffusion are:
- a proper noise schedule (the cosine schedule is a good choice)
- large models (in contrast to VAEs, which can be small), since the model learns a harder task
- residual connections and attention (with residual connections) can help. Remember to initialize the residual parts as an identity function
- small learning rate and train for a long time (~100 epochs)
- in case of training instabilities, try to use layernorm
The final model isn't that good, with an unconditional FID of 33. The model has 2M parameters, and it takes a morning to train on a V100 GPU for 90 epochs. The loss can still descent; but due to computational constraints, I don't want to train it more.
Diffuse:
Samples:
The loss curve (loss vs. diffusion time step), unfortunately, got lost on the remote server. Generally, (with the cosine schedule), you should expect the loss to only be high at first 30 steps, then stable at the middle, and achieve near zero at the end (9001000).
This is the plot of