Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] FSDP error: Could not find the transformer layer class SanaBlock in the model #129

Open
Pevernow opened this issue Jan 5, 2025 · 8 comments
Labels
Answered Answered the question

Comments

@Pevernow
Copy link
Contributor

Pevernow commented Jan 5, 2025

25-01-05 17:35:47 - [Sana] - INFO - Load checkpoint from /home/linjl/zzc/Sana_1600M_1024px_MultiLing.pth. Load ema: 
False.
2025-01-05 17:35:47 - [Sana] - WARNING - Missing keys: ['pos_embed']
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/linjl/zzc/Sana-pev/train_scripts/train_local.py", line 958, in <module>
[rank0]:     main()[rank0]:   File "/home/linjl/anaconda3/envs/sana/lib/python3.10/site-packages/pyrallis/argparsing.py", line 158, in wr
apper_inner
[rank0]:     response = fn(cfg, *args, **kwargs)
[rank0]:   File "/home/linjl/zzc/Sana-pev/train_scripts/train_local.py", line 942, in main
[rank0]:     model = accelerator.prepare(model)
[rank0]:   File "/home/linjl/anaconda3/envs/sana/lib/python3.10/site-packages/accelerate/accelerator.py", line 1339, i
n prepare
[rank0]:     result = tuple(
[rank0]:   File "/home/linjl/anaconda3/envs/sana/lib/python3.10/site-packages/accelerate/accelerator.py", line 1340, i
n <genexpr>
[rank0]:     self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank0]:   File "/home/linjl/anaconda3/envs/sana/lib/python3.10/site-packages/accelerate/accelerator.py", line 1215, i
n _prepare_one
[rank0]:     return self.prepare_model(obj, device_placement=device_placement)[rank0]:   File "/home/linjl/anaconda3/envs/sana/lib/python3.10/site-packages/accelerate/accelerator.py", line 1487, i
n prepare_model
[rank0]:     self.state.fsdp_plugin.set_auto_wrap_policy(model)
[rank0]:   File "/home/linjl/anaconda3/envs/sana/lib/python3.10/site-packages/accelerate/utils/dataclasses.py", line 1
754, in set_auto_wrap_policy
[rank0]:     raise ValueError(f"Could not find the transformer layer class {layer_class} in the model.")
[rank0]: ValueError: Could not find the transformer layer class SanaBlock in the model.[rank0]:[W105 17:35:48.866070663 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed bef
ore we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure 
that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point 
and block the progress of another member of the process group. This constraint has always been present,  but this warn
ing has only been added since PyTorch 2.4 (function operator())
[rank3]: Traceback (most recent call last):
[rank3]:   File "/home/linjl/zzc/Sana-pev/train_scripts/train_local.py", line 958, in <module>
[rank3]:     main()
[rank3]:   File "/home/linjl/anaconda3/envs/sana/lib/python3.10/site-packages/pyrallis/argparsing.py", line 158, in wr
apper_inner
[rank3]:     response = fn(cfg, *args, **kwargs)

Already passed the accelerate test.

The environment was prepared using the official steps of this project.

with 4x 3090

How to reproduce:

bash train_scripts/train.sh   configs/sana_config/1024ms/Sana_1600M_img1024.yaml   --data.data_dir="[asset/example_data]"   --data.type=SanaImgDataset   --model.load_from=/home/linjl/zzc/Sana_1600M_1024px_MultiLing.pth   --model.multi_scale=false   --train.train_batch_size=4 --train.use_fsdp=true
@lawrence-cj
Copy link
Collaborator

Unfortunately, FSDP is not supported for now.

@lawrence-cj lawrence-cj added the Answered Answered the question label Jan 5, 2025
@Pevernow
Copy link
Contributor Author

Pevernow commented Jan 5, 2025

@lawrence-cj
Can fsdp be added to TodoList?
It seems that there is already fsdp implementation in the repository, although it seems incomplete.
I am looking forward to this feature.

It is hard to imagine that 4x RTX3090 cannot train Sana, which only needs 32G of video memory.

Thank you very much.

@lawrence-cj
Copy link
Collaborator

We aim to use FSDP for the Sana-Video version, but we don't have a specific timeline for it. Will try our best.

@Pevernow
Copy link
Contributor Author

Pevernow commented Jan 5, 2025

We aim to use FSDP for the Sana-Video version, but we don't have a specific timeline for it. Will try our best.

@lawrence-cj Thanks, does Sana support enabling offloading to CPU while enabling DDP?

(No matter what, I have to save these poor RTX3090,they have been idle for a week.)

@lawrence-cj
Copy link
Collaborator

Understood. Which part of Sana are you talking about offloading to CPU?

@Pevernow
Copy link
Contributor Author

Pevernow commented Jan 5, 2025

Understood. Which part of Sana are you talking about offloading to CPU?

@lawrence-cj
Perhaps user selectable, VAE or text encoder?
This partly depends on which part you think will have less performance impact if offloaded.

@Pevernow
Copy link
Contributor Author

Pevernow commented Jan 6, 2025

@lawrence-cj
https://github.com/frutiemax92/YAT
This repo provided an example to train sana by offloading.

@Pevernow
Copy link
Contributor Author

@lawrence-cj It's urgent, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Answered Answered the question
Projects
None yet
Development

No branches or pull requests

2 participants