-
Notifications
You must be signed in to change notification settings - Fork 325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add hipBlas support #94
Conversation
This is the performance of AMD 7900XTX on windows ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0
[INFO] stable-diffusion.cpp:5386 - loading model from './models/miniSD.ckpt'
[INFO] model.cpp:641 - load ./models/miniSD.ckpt using checkpoint format
[INFO] stable-diffusion.cpp:5412 - Stable Diffusion 1.x
[INFO] stable-diffusion.cpp:5418 - Stable Diffusion weight type: f32
[INFO] stable-diffusion.cpp:5577 - total memory buffer size = 2731.37MB (clip 470.66MB, unet 2165.24MB, vae 95.47MB)
[INFO] stable-diffusion.cpp:5579 - loading model from './models/miniSD.ckpt' completed, taking 10.19s
[INFO] stable-diffusion.cpp:5593 - running in eps-prediction mode
[INFO] stable-diffusion.cpp:6486 - apply_loras completed, taking 0.00s
[INFO] stable-diffusion.cpp:6525 - get_learned_condition completed, taking 59 ms
[INFO] stable-diffusion.cpp:6535 - sampling using Euler A method
[INFO] stable-diffusion.cpp:6539 - generating image: 1/1 - seed 42
|==================================================| 20/20 - 8.84it/s
[INFO] stable-diffusion.cpp:6551 - sampling completed, taking 2.35s
[INFO] stable-diffusion.cpp:6559 - generating 1 latent images completed, taking 2.39s
[INFO] stable-diffusion.cpp:6561 - decoding 1 latents
[INFO] stable-diffusion.cpp:6571 - latent 1 decoded, taking 0.13s
[INFO] stable-diffusion.cpp:6575 - decode_first_stage completed, taking 0.13s
[INFO] stable-diffusion.cpp:6592 - txt2img completed in 2.58s
[INFO] main.cpp:538 - save result image to 'output.png' |
I'll attempt to adapt to the latest version of ggml. |
There is |
# Conflicts: # ggml # stable-diffusion.cpp
@leejet I think everything is ready, but the worst part is that I don't have an AMD GPU at the moment. I have to wait until I get home this weekend to test it. |
By the way, I have replaced ggml with the latest upstream version. I also hope that friends who use cuda can also test whether there are any problems. |
Thank you for your contribution. I'll find time to review and merge this PR. |
Please wait while I test it, I don't have time now |
Now it works fine, when I generate a smaller image (256x256), twenty sampling steps only takes 1.9 seconds ./bin/sd -m "../models/miniSD.ckpt" -p "a cat" -H 256 -W 256
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no
[INFO ] stable-diffusion.cpp:137 - loading model from '../models/miniSD.ckpt'
[INFO ] model.cpp:642 - load ../models/miniSD.ckpt using checkpoint format
[INFO ] stable-diffusion.cpp:163 - Stable Diffusion 1.x
[INFO ] stable-diffusion.cpp:169 - Stable Diffusion weight type: f32
[INFO ] stable-diffusion.cpp:268 - total memory buffer size = 2749.37MB (clip 479.66MB, unet 2165.24MB, vae 104.47MB)
[INFO ] stable-diffusion.cpp:270 - loading model from '../models/miniSD.ckpt' completed, taking 9.26s
[INFO ] stable-diffusion.cpp:284 - running in eps-prediction mode
[INFO ] stable-diffusion.cpp:1182 - apply_loras completed, taking 0.00s
[INFO ] stable-diffusion.cpp:1221 - get_learned_condition completed, taking 46 ms
[INFO ] stable-diffusion.cpp:1231 - sampling using Euler A method
[INFO ] stable-diffusion.cpp:1235 - generating image: 1/1 - seed 42
|==================================================| 20/20 - 12.18it/s
[INFO ] stable-diffusion.cpp:1247 - sampling completed, taking 1.77s
[INFO ] stable-diffusion.cpp:1255 - generating 1 latent images completed, taking 1.80s
[INFO ] stable-diffusion.cpp:1257 - decoding 1 latents
[INFO ] stable-diffusion.cpp:1267 - latent 1 decoded, taking 0.14s
[INFO ] stable-diffusion.cpp:1271 - decode_first_stage completed, taking 0.14s
[INFO ] stable-diffusion.cpp:1290 - txt2img completed in 1.99s
save result image to 'output.png' There are more features that I haven't tested yet, due to well-known network issues :) |
This feature is ready to be merged. @leejet
Update: Thanks to leejet's efforts, we can use the master branch directly. I think I need to wait for the following PR to be merged:
ggerganov/ggml#683
ggerganov/ggml#682