-
Notifications
You must be signed in to change notification settings - Fork 798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
State of video generation in Diffusers #2592
Conversation
_blog.yml
Outdated
thumbnail: /blog/assets/video_gen/thumbnail.png | ||
date: Jan 23, 2025 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to be updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very happy to see this in-action cc: @LysandreJik for vis (when you comeback)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a light weight review, sorry if it;s a bit pre-mature for the state of the blog - very excited to see this finally taking shape!
- user: dn6 | ||
--- | ||
|
||
# State of open video generation models in Diffusers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth opening with a video from one of the open video models, maybe even draw comparisons from where the video generation models were a year or two back vs now!
A good example could be the Will Smith benchmark!
- Fine-tuning | ||
- Looking ahead | ||
|
||
## Today’s Video Generation Models and their Limitations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to disagree, but IMO, we should only keep the table here and the limitations can potentially go toward the end of the blogpost, makes it more easy to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do disagree. I think it's common to start with limitations to make the readers have a fuller context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on the vibe you are going for, up to you since you're the author, to me it just feels odd to start with limitations since even a survey paper conveys the limitations towards the end.
- Several open models suffer from limited generalization capabilities and underperform expectations of users. Models may require prompting in a certain way, or LLM-like prompts, or fail to generalize to out-of-distribution data, which are hurdles for widespread user adoption. For example, models like LTX-Video often need to be prompted in a very detailed and specific way for obtaining good quality generations. | ||
- The high computational and memory demands of video generation result in significant generation latency. For local usage, this is often a roadblock. Most new open video models are inaccessible to community hardware without extensive memory optimizations and quantization approaches that affect both inference latency and quality of the generated videos. | ||
|
||
## Why is Video Generation Hard? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for this, it would be nice to keep a positive outlook going in and then ground it towards the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
video_gen.md
Outdated
| [`tencent/HunyuanVideo`](https://huggingface.co/tencent/HunyuanVideo) | [Link](https://huggingface.co/tencent/HunyuanVideo/blob/main/LICENSE) | | ||
| [`Lightricks/LTX-Video`](https://huggingface.co/Lightricks/LTX-Video) | [Link](https://huggingface.co/Lightricks/LTX-Video/blob/main/License.txt) | | ||
|
||
### **Memory requirements** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be beneficial to add inference examples for all/ some models that you mention here, to ground that diffusers
is the place to go for inference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe even with video snippets embedded from those as well - so that people can visually experience them as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be beneficial to add inference examples for all/ some models that you mention here, to ground that diffusers is the place to go for inference.
It will make it unnecessarily verbose. We will do some snippets but will keep it for only one model as we're already citing the docs for the other models. This is a TODO and will be addressed by @DN6.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will make it unnecessarily verbose. We will do some snippets but will keep it for only one model as we're already citing the docs for the other models.
Not really, you can just wrap them up into <details>
so that it is collapsed by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will make it redundant a bit IMO, as the code doesn't change much. So, showing for a single model is sufficient, I think.
video_gen.md
Outdated
|
||
We used the same settings as above to obtain these numbers. Quantization was performed with the [`bitsandbytes` library](https://huggingface.co/docs/bitsandbytes/main/en/index) (Diffusers [supports three different quantization backends](https://huggingface.co/docs/diffusers/main/en/quantization/overview) as of now). Also note that due to numerical precision loss, quantization can impact the quality of the outputs, effects of which are more prominent in videos than images. | ||
|
||
## Video Generation with Diffusers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More in-line with the suggestion above, I'd recommend moving this above optimisations/ memory etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved.
Super cool! Happy to help and/or review as needed. |
@pcuenca thanks! If you could help generate some of the videos @Vaibhavs10 mentioned that will be very helpful. This is the script I used for the optims: Codefrom diffusers import HunyuanVideoTransformer3DModel, HunyuanVideoPipeline
from diffusers import BitsAndBytesConfig as BitsAndBytesConfig
import argparse
import json
import torch
prompt = "A cat walks on the grass, realistic. The scene resembles a real-life footage and should look as if it was shot in a sunny day."
def load_pipeline(args):
if args.bit4_bnb:
quant_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
"hunyuanvideo-community/HunyuanVideo",
subfolder="transformer",
quantization_config=quant_config,
torch_dtype=torch.bfloat16,
)
elif args.bit8_bnb:
quant_config = BitsAndBytesConfig(load_in_8bit=True)
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
"hunyuanvideo-community/HunyuanVideo",
subfolder="transformer",
quantization_config=quant_config,
torch_dtype=torch.bfloat16,
)
else:
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
"hunyuanvideo-community/HunyuanVideo", subfolder="transformer", torch_dtype=torch.bfloat16
)
pipe = HunyuanVideoPipeline.from_pretrained(
"hunyuanvideo-community/HunyuanVideo", transformer=transformer, torch_dtype=torch.float16
)
if not args.enable_model_cpu_offload:
pipe = pipe.to("cuda")
else:
pipe.enable_model_cpu_offload()
if args.vae_tiling:
pipe.vae.enable_tiling()
return pipe
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--enable_model_cpu_offload", type=int, choices=[0, 1])
parser.add_argument("--vae_tiling", type=int, choices=[0, 1])
parser.add_argument("--bit4_bnb", type=int, choices=[0, 1])
parser.add_argument("--bit8_bnb", type=int, choices=[0, 1])
args = parser.parse_args()
# Construct output path based on argument values
output_path = f"4bit@{args.bit4_bnb}_8bit@{args.bit8_bnb}_tiling@{args.vae_tiling}_offload@{args.enable_model_cpu_offload}.json"
pipe = load_pipeline(args)
_ = pipe(
prompt,
height=512,
width=768,
num_frames=121,
generator=torch.manual_seed(0),
num_inference_steps=50
)
memory = torch.cuda.max_memory_allocated() / (1024 ** 3)
# Serialize memory usage info to JSON
memory_data = {
"prompt": prompt,
"height": 512,
"width": 768,
"num_frames": 121,
"num_inference_steps": 50,
"gpu_memory_usage_gb": memory,
"enable_model_cpu_offload": args.enable_model_cpu_offload,
"vae_tiling": args.vae_tiling,
"bit4_bnb": args.bit4_bnb,
"bit8_bnb": args.bit8_bnb
}
with open(output_path, "w") as json_file:
json.dump(memory_data, json_file, indent=4)
print(f"Serialized to {output_path=}") Completely fine if you don't have time. My plate if a bit full, too. So, it will take time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for leading the initiative @sayakpaul. Is there anything specific you'd like me to address? At the moment, I see some TODOs which were regarding the feature PRs not merged yet, but we are very close to merging (just needs final look from @DN6), so we can directly mention them.
Left some other comments as well about how we could nicely showcase memory reduction with/without quantization or other optimizations with upcoming features.
video_gen.md
Outdated
Note that in the above four options, as of now, we only support the first two. Support for the rest of the two will be merged in soon. If you’re interested to follow along the progress, here are the PRs: | ||
|
||
- TODO: | ||
- TODO: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are very close to merging PAB, which should cover attention & MLP state re-use. For chunked inference, slicing/tiling/FreeNoise-split-inference are great examples already.
For offloading, we currently only have group offloading pending (which might take a while to review and merge), but the PR is 90% ready IMO, so we can mention it -- especially because it has no speed overheads while drastically reducing memory requirements.
So, IMO we should not mention these few lines ("..., we only support the first two")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to perform those changes directly here. I would go with:
So, IMO we should not mention these few lines ("..., we only support the first two")
And make it clear what's upcoming (the ones you have opened PRs for).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@a-r-r-o-w I have taken care of the edits. LMK if that works for you.
video_gen.md
Outdated
| VAE tiling | 43.58 GB | | ||
| CPU offloading | 28.87 GB | | ||
| 8Bit | 49.9 GB | | ||
| 8Bit + CPU offloading* | 35.66 GB | | ||
| 8Bit + VAE tiling | 36.92 GB | | ||
| 8Bit + CPU offloading + VAE tiling | 26.18 GB | | ||
| 4Bit | 42.96 GB | | ||
| 4Bit + CPU offloading | 21.99 GB | | ||
| 4Bit + VAE tiling | 26.42 GB | | ||
| 4Bit + CPU offloading + VAE tiling | 14.15 GB | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sayakpaul Have we made note of the time required for each of these methods? IMO it would be helpful for users to understand the tradeoffs that come with each and the expected slowdown
It would also set the stage to tease the new banger feature, of prefetched offloading, coming soon, which uses the memory of sequential cpu offloading (so around ~3 GB) without compromising speed. CPU RAM requirements are the same as any other offloading methods. LMK what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reason why I didn't because:
- Video generation is time-consuming, especially HunyuanVideo. Not sure if most users care about the inference latency taking a hit because of memory optims.
- We don't have the other features merged yet so, I didn't feel comfortable benchmarking them.
If you feel strongly about the timing note, feel free to add the changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the Comfy community side atleast, I know that people do care about the time required and try to work with settings that reduce the overall time required (lower resolution/frames + latent upscaling, sage attention, fp8 matmul, etc. because they have support for some good memory optims already). So, I think it will be beneficial to mention time here because if we only cared about reducing memory, everyone would just default to something like sequential cpu offloading.
Could you provide me with the benchmark script from where you got the current numbers? Will run the same and measure time as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here:
Code
from diffusers import HunyuanVideoTransformer3DModel, HunyuanVideoPipeline
from diffusers import BitsAndBytesConfig as BitsAndBytesConfig
import argparse
import json
import torch
prompt = "A cat walks on the grass, realistic. The scene resembles a real-life footage and should look as if it was shot in a sunny day."
def load_pipeline(args):
if args.bit4_bnb:
quant_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
"hunyuanvideo-community/HunyuanVideo",
subfolder="transformer",
quantization_config=quant_config,
torch_dtype=torch.bfloat16,
)
elif args.bit8_bnb:
quant_config = BitsAndBytesConfig(load_in_8bit=True)
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
"hunyuanvideo-community/HunyuanVideo",
subfolder="transformer",
quantization_config=quant_config,
torch_dtype=torch.bfloat16,
)
else:
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
"hunyuanvideo-community/HunyuanVideo", subfolder="transformer", torch_dtype=torch.bfloat16
)
pipe = HunyuanVideoPipeline.from_pretrained(
"hunyuanvideo-community/HunyuanVideo", transformer=transformer, torch_dtype=torch.float16
)
if not args.enable_model_cpu_offload:
pipe = pipe.to("cuda")
else:
pipe.enable_model_cpu_offload()
if args.vae_tiling:
pipe.vae.enable_tiling()
return pipe
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--enable_model_cpu_offload", type=int, choices=[0, 1])
parser.add_argument("--vae_tiling", type=int, choices=[0, 1])
parser.add_argument("--bit4_bnb", type=int, choices=[0, 1])
parser.add_argument("--bit8_bnb", type=int, choices=[0, 1])
args = parser.parse_args()
# Construct output path based on argument values
output_path = f"4bit@{args.bit4_bnb}_8bit@{args.bit8_bnb}_tiling@{args.vae_tiling}_offload@{args.enable_model_cpu_offload}.json"
pipe = load_pipeline(args)
_ = pipe(
prompt,
height=512,
width=768,
num_frames=121,
generator=torch.manual_seed(0),
num_inference_steps=50
)
memory = torch.cuda.max_memory_allocated() / (1024 ** 3)
# Serialize memory usage info to JSON
memory_data = {
"prompt": prompt,
"height": 512,
"width": 768,
"num_frames": 121,
"num_inference_steps": 50,
"gpu_memory_usage_gb": memory,
"enable_model_cpu_offload": args.enable_model_cpu_offload,
"vae_tiling": args.vae_tiling,
"bit4_bnb": args.bit4_bnb,
"bit8_bnb": args.bit8_bnb
}
with open(output_path, "w") as json_file:
json.dump(memory_data, json_file, indent=4)
print(f"Serialized to {output_path=}")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would keep the settings similar, though. If we have reduce the number of frames, resolution, etc. I'd make a separate note and not change the settings during benchmarking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the results with time required for each method + FP8-layerwise-upcasting since the PR was merged.
| **Setting** | **Memory** | **Time** |
|:--------------------------------------------------:|:-------------:|:--------:|
| BF16 Base | 60.10 GB | 863s |
| BF16 + CPU offloading | 28.87 GB | 917s |
| BF16 + VAE tiling | 43.58 GB | 870s |
| 8-bit BnB | 49.90 GB | 983s |
| 8-bit BnB + CPU offloading* | 35.66 GB | 1041s |
| 8-bit BnB + VAE tiling | 36.92 GB | 997s |
| 8-bit BnB + CPU offloading + VAE tiling | 26.18 GB | 1260s |
| 4-bit BnB | 42.96 GB | 867s |
| 4-bit BnB + CPU offloading | 21.99 GB | 953s |
| 4-bit BnB + VAE tiling | 26.42 GB | 889s |
| 4-bit BnB + CPU offloading + VAE tiling | 14.15 GB | 995s |
| FP8 Upcasting | 51.70 GB | 856s |
| FP8 Upcasting + CPU offloading | 21.99 GB | 983s |
| FP8 Upcasting + VAE tiling | 35.17 GB | 867s |
| FP8 Upcasting + CPU offloading + VAE tiling | 20.44 GB | 1013s |
| BF16 + Group offload (blocks=8) + VAE tiling | 15.67 GB | 925s |
| BF16 + Group offload (blocks=1) + VAE tiling | 7.72 GB | 881s |
| BF16 + Group offload (leaf) + VAE tiling | 6.66 GB | 887s |
| FP8 Upcasting + Group offload (leaf) + VAE tiling | 6.56 GB | 885s |
Still haven't added Groupwise-offloading yet since I had another idea about optimizing it for further reducing memory. I will for sure be able to send the numbers for it by later today. Will push the changes directly EOD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Aryan!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the updated benchmark code (did not modify the original parts and just kept to the fp8 and group offloading additions)
code
from diffusers import HunyuanVideoTransformer3DModel, HunyuanVideoPipeline
from diffusers import BitsAndBytesConfig
import argparse
import json
import torch
import time
from diffusers.utils import export_to_video
from diffusers.hooks.group_offloading import apply_group_offloading
from diffusers.utils.logging import set_verbosity_debug
set_verbosity_debug()
prompt = "A cat walks on the grass, realistic. The scene resembles a real-life footage and should look as if it was shot in a sunny day."
def load_pipeline(args):
if args.bit4_bnb:
quant_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
"hunyuanvideo-community/HunyuanVideo",
subfolder="transformer",
quantization_config=quant_config,
torch_dtype=torch.bfloat16,
)
elif args.bit8_bnb:
quant_config = BitsAndBytesConfig(load_in_8bit=True)
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
"hunyuanvideo-community/HunyuanVideo",
subfolder="transformer",
quantization_config=quant_config,
torch_dtype=torch.bfloat16,
)
else:
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
"hunyuanvideo-community/HunyuanVideo", subfolder="transformer", torch_dtype=torch.bfloat16
)
if args.layerwise_casting:
transformer.enable_layerwise_casting(storage_dtype=torch.float8_e4m3fn, compute_dtype=torch.bfloat16)
pipe = HunyuanVideoPipeline.from_pretrained(
"hunyuanvideo-community/HunyuanVideo", transformer=transformer, torch_dtype=torch.float16
)
if not args.enable_model_cpu_offload:
if args.group_offloading == "0":
pipe = pipe.to("cuda")
else:
pipe.enable_model_cpu_offload()
if args.vae_tiling:
pipe.vae.enable_tiling()
return pipe
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--enable_model_cpu_offload", type=int, choices=[0, 1])
parser.add_argument("--vae_tiling", type=int, choices=[0, 1])
parser.add_argument("--bit4_bnb", type=int, choices=[0, 1])
parser.add_argument("--bit8_bnb", type=int, choices=[0, 1])
parser.add_argument("--layerwise_casting", type=int, choices=[0, 1])
parser.add_argument("--group_offloading", type=str, choices=["0", "1", "8", "leaf_level"])
args = parser.parse_args()
# Construct output path based on argument values
output_path = f"group_offloading@{args.group_offloading}_4bit@{args.bit4_bnb}_8bit@{args.bit8_bnb}_tiling@{args.vae_tiling}_offload@{args.enable_model_cpu_offload}_layerwise@{args.layerwise_casting}.json"
pipe = load_pipeline(args)
if args.group_offloading != "0":
apply_group_offloading(
pipe.text_encoder,
offload_type="leaf_level",
offload_device=torch.device("cpu"),
onload_device=torch.device("cuda"),
force_offload=True,
non_blocking=True,
use_stream=True,
)
apply_group_offloading(
pipe.text_encoder_2,
offload_type="leaf_level",
offload_device=torch.device("cpu"),
onload_device=torch.device("cuda"),
force_offload=True,
non_blocking=True,
use_stream=True,
)
apply_group_offloading(
pipe.transformer,
offload_type="block_level" if args.group_offloading in ["1", "8"] else "leaf_level",
num_blocks_per_group=8 if args.group_offloading == "8" else 1 if args.group_offloading == "1" else None,
offload_device=torch.device("cpu"),
onload_device=torch.device("cuda"),
force_offload=True,
non_blocking=True,
use_stream=True,
)
pipe.vae.to("cuda")
# warmup for prefetch hooks to figure out layer execution order
_ = pipe(prompt, height=64, width=64, num_frames=9, num_inference_steps=2)
t1 = time.time()
video = pipe(
prompt,
height=512,
width=768,
num_frames=121,
generator=torch.manual_seed(0),
num_inference_steps=30,
)
t2 = time.time()
video = video.frames[0]
export_to_video(video, output_path[:-5] + ".mp4", fps=30)
memory = torch.cuda.max_memory_allocated() / (1024 ** 3)
# Serialize memory usage info to JSON
memory_data = {
"prompt": prompt,
"height": 512,
"width": 768,
"num_frames": 121,
"num_inference_steps": 50,
"gpu_memory_usage_gb": memory,
"inference_time": round(t2 - t1, 2),
"enable_model_cpu_offload": args.enable_model_cpu_offload,
"vae_tiling": args.vae_tiling,
"bit4_bnb": args.bit4_bnb,
"bit8_bnb": args.bit8_bnb
}
with open(output_path, "w") as json_file:
json.dump(memory_data, json_file, indent=4)
print(f"Serialized to {output_path=}")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BF16, 121 frames, 512x768 resolution in under 7 GB (further reduced to under 5 GB with flash attention and optimized feed-forward (huggingface/diffusers#10623). Did we cook or did we cook? 👨🍳
Thanks @a-r-r-o-w!
I think you can take care of the comments you added and maybe make some changes to address them? I will try to address VB's comments. I will let @DN6 take care of the code examples. |
Co-authored-by: Aryan <[email protected]>
export_to_video(video, "output.mp4", fps=24) | ||
``` | ||
|
||
### Memory requirements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Vaibhavs10 this has been adjusted FYI.
video_gen.md
Outdated
* [Layerwise upcasting](https://github.com/huggingface/diffusers/pull/10347): Lets users store the params and layer outputs in a lower-precision such as `torch.float8_e4m3fn` and run computations in a higher precision such as `torch.bfloat16`. | ||
* [Overlapped offloading](https://github.com/huggingface/diffusers/pull/10503): Lets users overlap data transfer with computation using CUDA streams. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@a-r-r-o-w if you could help providing your best save-up numbers here, that would be nice. For example, we could say:
Layerwise upcasting enables us to save XYZ memory.
Same for overlapped offloading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, I would not call this overlapped offloading for two reasons:
- Overlapping is opt-in. It also may be imperfectly overlapped if computation is much faster to do than perform the module transfer (however, the synchronizations put in place make sure no operation starts unexpectedly)
- The PRs original intention is to allow groups of internal modules to be offloaded together. This helps reduce memory peak additions caused by loading entire model to GPU, by only partially loading the required modules at a time, performing computation, and then offloading
video_gen.md
Outdated
# (Full training command removed for brevity) | ||
``` | ||
|
||
For more details, check out the repository [here](https://github.com/a-r-r-o-w/finetrainers). We used `finetrainers` to emulate the "dissolve" effect and obtained |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Vaibhavs10 provided a fine-tuned model and a result.
We provide more details about these optimizations in the sections below along with some code snippets to go. But if you're already feeling excited, | ||
we encourage you to check out [our guide](https://huggingface.co/docs/diffusers/main/en/using-diffusers/text-img2vid). | ||
|
||
### Suite of optimizations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DN6 if you could take care of the code, that would be helpful!
The draft still has some TODOs but it won't prevent a first pass through the content. @DN6 @a-r-r-o-w could you please fill the TODO when you have a moment?
This is why the PR is in WIP mode.
TODOs