Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

使用python predict_i2v.py后OOM #2

Open
397688551 opened this issue Dec 18, 2024 · 12 comments
Open

使用python predict_i2v.py后OOM #2

397688551 opened this issue Dec 18, 2024 · 12 comments

Comments

@397688551
Copy link

提示如下:
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 109.01 GiB. GPU 0 has a total capacity of 47.45 GiB of which 27.51 GiB is free. Process 3256 has 19.94 GiB memory in use. Of the allocated memory 18.77 GiB is allocated by PyTorch, and 421.82 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. Seedocumentation for Memory Management

@cellzero
Copy link
Collaborator

能问下生成多大的分辨率吗,感觉 Tried to allocate 109.01 GiB 有点太大了。

仓库中的 predict_i2v.py 应该 24 GiB 显存可以运行的。

@397688551
Copy link
Author

能问下生成多大的分辨率吗,感觉 Tried to allocate 109.01 GiB 有点太大了。

仓库中的 predict_i2v.py 应该 24 GiB 显存可以运行的。

我没有改任何的内容,直接下载好模型后,就调用predict_i2v.py去生成,这个应该是使用的默认的吧?

@cellzero
Copy link
Collaborator

cellzero commented Dec 18, 2024

是的,predict_i2v.py 是默认的设置。

挺奇怪的,通常每次只会申请 500 MiB ~ 1.2 GiB 左右,想象不到哪边会申请 109 GiB。

能发送一下运行环境等信息吗?可能是 PyTorch 版本的问题,我这边用 2.5.1 可以运行的。

@397688551
Copy link
Author

是的,predict_i2v.py 是默认的设置。

挺奇怪的,通常每次只会申请 500 MiB ~ 1.2 GiB 左右,想象不到哪边会申请 109 GiB。

能发送一下运行环境等信息吗?可能是 PyTorch 版本的问题,我这边用 2.5.1 可以运行的。

这是conda环境:
(ruyi) root@aistudio-121620-prod-0:/aistudio# pip list
Package Version
------------------------ -----------absl-py 2.1.0
accelerate 1.2.1
aiohappyeyeballs 2.4.4
aiohttp 3.11.10
aiosignal 1.3.2albucore 0.0.21
albumentations 1.4.22
annotated-types 0.7.0antlr4-python3-runtime 4.9.3
async-timeout 5.0.1
attrs 24.3.0
av 14.0.1
beautifulsoup4 4.12.3certifi 2024.12.14
charset-normalizer 3.4.0
datasets 3.2.0decord 0.6.0
diffusers 0.31.0
dill 0.3.8einops 0.8.0
eval_type_backport 0.2.0
filelock 3.16.1frozenlist 1.5.0
fsspec 2024.9.0
ftfy 6.3.1func_timeout 4.3.5grpcio 1.68.1huggingface-hub 0.27.0idna 3.10imageio 2.36.1
imageio-ffmpeg 0.5.1
importlib_metadata 8.5.0
Jinja2 3.1.4
lazy_loader 0.4
Markdown 3.7
MarkupSafe 3.0.2
mpmath 1.3.0
multidict 6.1.0
multiprocess 0.70.16
networkx 3.4.2
numpy 2.2.0
nvidia-cublas-cu12 12.4.5.8
nvidia-cuda-cupti-cu12 12.4.127
nvidia-cuda-nvrtc-cu12 12.4.127
nvidia-cuda-runtime-cu12 12.4.127
nvidia-cudnn-cu12 9.1.0.70
nvidia-cufft-cu12 11.2.1.3
nvidia-curand-cu12 10.3.5.147
nvidia-cusolver-cu12 11.6.1.9
nvidia-cusparse-cu12 12.3.1.170
nvidia-nccl-cu12 2.21.5
nvidia-nvjitlink-cu12 12.4.127
nvidia-nvtx-cu12 12.4.127
omegaconf 2.3.0
opencv-python 4.10.0.84
opencv-python-headless 4.10.0.84
packaging 24.2
pandas 2.2.3
pillow 11.0.0
pip 24.2
propcache 0.2.1
protobuf 5.29.1
psutil 6.1.0
pyarrow 18.1.0
pydantic 2.10.3
pydantic_core 2.27.1
python-dateutil 2.9.0.post0
pytz 2024.2
PyYAML 6.0.2
regex 2024.11.6
requests 2.32.3
safetensors 0.4.5
scikit-image 0.25.0
scipy 1.14.1
sentencepiece 0.2.0
setuptools 75.1.0
simsimd 6.2.1
six 1.17.0
soupsieve 2.6
stringzilla 3.11.1
sympy 1.13.1
tensorboard 2.18.0
tensorboard-data-server 0.7.2
tifffile 2024.12.12
timm 1.0.12
tokenizers 0.21.0
tomesd 0.1.3
torch 2.5.1
torchdiffeq 0.2.5
torchsde 0.2.6
torchvision 0.20.1
tqdm 4.67.1
trampoline 0.1.2
transformers 4.47.0
triton 3.1.0
typing_extensions 4.12.2
tzdata 2024.2
urllib3 2.2.3
wcwidth 0.2.13
Werkzeug 3.1.3
wheel 0.44.0
xxhash 3.5.0
yarl 1.18.3
zipp 3.21.0

image

@397688551
Copy link
Author

是的,predict_i2v.py 是默认的设置。

挺奇怪的,通常每次只会申请 500 MiB ~ 1.2 GiB 左右,想象不到哪边会申请 109 GiB。

能发送一下运行环境等信息吗?可能是 PyTorch 版本的问题,我这边用 2.5.1 可以运行的。

这是完整的报错:

python predict_i2v.py

Vae loaded ...
Transformer loaded ...
Loading pipeline components...: 0it [00:00, ?it/s]
Pipeline loaded ...
0%| | 0/25 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/aistudio/workspace/task/Ruyi-Models/predict_i2v.py", line 230, in
sample = pipeline(
File "/aistudio/workspace/system-default/envs/ruyi/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/aistudio/workspace/task/Ruyi-Models/ruyi/pipeline/pipeline_ruyi_inpaint.py", line 984, in call
noise_pred = self.transformer(
File "/aistudio/workspace/system-default/envs/ruyi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/aistudio/workspace/system-default/envs/ruyi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/aistudio/workspace/system-default/envs/ruyi/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
File "/aistudio/workspace/task/Ruyi-Models/ruyi/models/transformer3d.py", line 1147, in forward
hidden_states = block(
File "/aistudio/workspace/system-default/envs/ruyi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/aistudio/workspace/system-default/envs/ruyi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/aistudio/workspace/task/Ruyi-Models/ruyi/models/attention.py", line 1595, in forward
attn_output = self.attn1(
File "/aistudio/workspace/system-default/envs/ruyi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/aistudio/workspace/system-default/envs/ruyi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/aistudio/workspace/system-default/envs/ruyi/lib/python3.10/site-packages/diffusers/models/attention_processor.py", line 495, in forward
return self.processor(
File "/aistudio/workspace/system-default/envs/ruyi/lib/python3.10/site-packages/diffusers/models/attention_processor.py", line 2612, in call
hidden_states = F.scaled_dot_product_attention(
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 109.01 GiB. GPU 0 has a total capacity of 47.45 GiB of which 27.52 GiB is free. Process 26992 has 19.94 GiB memory in use. Of the allocated memory 18.77 GiB is allocated by PyTorch, and 421.82 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

@cellzero
Copy link
Collaborator

cellzero commented Dec 18, 2024

感谢提供的信息,运行环境上除了 CUDA 版本外基本是相同的,报错信息中显示 PyTorch 的 scaled_dot_product_attention 出现了问题,推测是 scaled_dot_product_attention 的某些加速机制与 CUDA 或 显卡版本不兼容导致的。

试试关闭加速后能否运行呢,调用这些方法能关闭加速:

torch.backends.cuda.enable_flash_sdp(False)
torch.backends.cuda.enable_mem_efficient_sdp(False)
torch.backends.cuda.enable_math_sdp(False)

可以每次关闭一种,看看哪种能够跑通。

@397688551
Copy link
Author

感谢提供的信息,运行环境上除了 CUDA 版本外基本是相同的,报错信息中显示 PyTorch 的 scaled_dot_product_attention 出现了问题,推测是 scaled_dot_product_attention 的某些加速机制与 CUDA 或 显卡版本不兼容导致的。

试试关闭加速后能否运行呢,调用这些方法能关闭加速:

torch.backends.cuda.enable_flash_sdp(False) torch.backends.cuda.enable_mem_efficient_sdp(False) torch.backends.cuda.enable_math_sdp(False)

可以每次关闭一种,看看哪种能够跑通。

我试了一下,我是这样添加的,应该位置没问题吧:
image

目前结果是,前两个没有用,还是报OOM
torch.backends.cuda.enable_flash_sdp(False)
torch.backends.cuda.enable_mem_efficient_sdp(False)

最后一个报其他的错误
torch.backends.cuda.enable_math_sdp(False)
错误为:
Vae loaded ...
Transformer loaded ...
Loading pipeline components...: 0it [00:00, ?it/s]Pipeline loaded ...
/aistudio/workspace/system-default/envs/ruyi/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py:540: UserWarning: Memory efficient kernel not used because: (Triggered internally at ../aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:773.)
attn_output = torch.nn.functional.scaled_dot_product_attention(
/aistudio/workspace/system-default/envs/ruyi/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py:540: UserWarning: Expected query, key and value to all be of dtype: {Half, Float}. Got Query dtype:c10::BFloat16, Key dtype: c10::BFloat16, and Value dtype: c10::BFloat16 instead. (Triggered internally at ../aten/src/ATen/native/transformers/sdp_utils_cpp.h:100.)
attn_output = torch.nn.functional.scaled_dot_product_attention(
/aistudio/workspace/system-default/envs/ruyi/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py:540: UserWarning: Flash attention kernel not used because: (Triggered internally at ../aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:775.)
attn_output = torch.nn.functional.scaled_dot_product_attention(/aistudio/workspace/system-default/envs/ruyi/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py:540: UserWarning: Flash attention only supports gpu architectures in the range [sm80, sm90]. Attempting to run on a sm 7.5 gpu. (Triggered internally at ../aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:235.)
attn_output = torch.nn.functional.scaled_dot_product_attention(
/aistudio/workspace/system-default/envs/ruyi/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py:540: UserWarning: CuDNN attention kernel not used because: (Triggered internally at ../aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:777.) attn_output = torch.nn.functional.scaled_dot_product_attention(/aistudio/workspace/system-default/envs/ruyi/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py:540: UserWarning: Expected query, key and value to all be of dtype: {Half}. Got Query dtype: c10::BFloat16, Key dtype: c10::BFloat16, and Value dtype: c10::BFloat16 instead. (Triggered internally at ../aten/src/ATen/native/transformers/sdp_utils_cpp.h:100.)
attn_output = torch.nn.functional.scaled_dot_product_attention(
Traceback (most recent call last):
File "/aistudio/workspace/task/Ruyi-Models/predict_i2v.py", line 237, in
sample = pipeline(
File "/aistudio/workspace/system-default/envs/ruyi/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/aistudio/workspace/task/Ruyi-Models/ruyi/pipeline/pipeline_ruyi_inpaint.py", line 785, in call
clip_encoder_hidden_states = self.clip_image_encoder(**inputs).last_hidden_state[:, 1:]
File "/aistudio/workspace/system-default/envs/ruyi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/aistudio/workspace/system-default/envs/ruyi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/aistudio/workspace/system-default/envs/ruyi/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
File "/aistudio/workspace/system-default/envs/ruyi/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 1555, in forward
vision_outputs = self.vision_model(
File "/aistudio/workspace/system-default/envs/ruyi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/aistudio/workspace/system-default/envs/ruyi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/aistudio/workspace/system-default/envs/ruyi/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 1097, in forward
encoder_outputs = self.encoder(
File "/aistudio/workspace/system-default/envs/ruyi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/aistudio/workspace/system-default/envs/ruyi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/aistudio/workspace/system-default/envs/ruyi/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 877, in forward
layer_outputs = encoder_layer(
File "/aistudio/workspace/system-default/envs/ruyi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/aistudio/workspace/system-default/envs/ruyi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/aistudio/workspace/system-default/envs/ruyi/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 608, in forward
hidden_states, attn_weights = self.self_attn(
File "/aistudio/workspace/system-default/envs/ruyi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/aistudio/workspace/system-default/envs/ruyi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/aistudio/workspace/system-default/envs/ruyi/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 540, in forward
attn_output = torch.nn.functional.scaled_dot_product_attention(
RuntimeError: No available kernel. Aborting execution.

当三个都同时写上的时候,抱的错跟上面一样,所以感觉最后一个能解决OOM的问题,但是引入了新问题

@cellzero
Copy link
Collaborator

cellzero commented Dec 18, 2024

添加的位置没问题的,看起来是 scaled_dot_product_attention 只能使用 math_sdp 进行计算,其他方法因为数据类型不支持、显卡不支持的原因无法使用,而 math_sdp 会申请非常大的显存(109 GiB)导致 OOM。

似乎 Quadro rtx 8000 显卡不支持 bfloat16 类型( pytorch/pytorch#67682 (comment) ),而当前模型只支持 bfloat16。

请问有没有更新的显卡可以用来运行该模型呢?

@397688551
Copy link
Author

添加的位置没问题的,看起来是 scaled_dot_product_attention 只能使用 math_sdp 进行计算,其他方法因为数据类型不支持、显卡不支持的原因无法使用,而 math_sdp 会申请非常大的显存(109 GiB)导致 OOM。

似乎 Quadro rtx 8000 显卡不支持 bfloat16 类型( pytorch/pytorch#67682 (comment) ),而当前模型只支持 bfloat16。

请问有没有更新的显卡可以用来运行该模型呢?

V100-PCIE-16GB 和 Tesla T4 这两个都是16G显存,刚试了下跑不起来[捂脸]

@cellzero
Copy link
Collaborator

cellzero commented Dec 18, 2024

是 GPU OOM 吗?

默认的 predict_i2v.py 大约需要 22GB 显存。可以尝试设置 low_gpu_memory_mode = True,虽然生成速度会慢一些,但应该能够正常运行。

我这边正在尝试是否可以使用 float16 运行,但遇到了一些问题,可能需要修复。如果能够成功使用 float16,我会再留言的。

@397688551
Copy link
Author

是 GPU OOM 吗?

默认的 predict_i2v.py 大约需要 22GB 显存。可以尝试设置 low_gpu_memory_mode = True,虽然生成速度会慢一些,但应该能够正常运行。

我这边正在尝试是否可以使用 float16 运行,但遇到了一些问题,可能需要修复。如果能够成功使用 float16,我会再留言的。

辛苦了,感谢您的帮助,明天我试一下low_gpu_memory_mode = True

@cellzero
Copy link
Collaborator

我测试了一下 float16,似乎出现了浮点数精度溢出的问题,因此目前感觉还不适合使用 float16。

建议先尝试使用 low_gpu_memory_mode = True,看看是否能够正常运行。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants