Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

State of video generation in Diffusers #2592

Merged
merged 13 commits into from
Jan 27, 2025
Merged

State of video generation in Diffusers #2592

merged 13 commits into from
Jan 27, 2025

Conversation

sayakpaul
Copy link
Member

@sayakpaul sayakpaul commented Jan 13, 2025

The draft still has some TODOs but it won't prevent a first pass through the content. @DN6 @a-r-r-o-w could you please fill the TODO when you have a moment?

This is why the PR is in WIP mode.

TODOs

  • Post preview
  • Thumbnail

@sayakpaul sayakpaul requested a review from pcuenca January 13, 2025 12:44
_blog.yml Outdated
Comment on lines 5303 to 5304
thumbnail: /blog/assets/video_gen/thumbnail.png
date: Jan 23, 2025
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to be updated.

video_gen.md Outdated Show resolved Hide resolved
Copy link
Member

@Vaibhavs10 Vaibhavs10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very happy to see this in-action cc: @LysandreJik for vis (when you comeback)

Copy link
Member

@Vaibhavs10 Vaibhavs10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a light weight review, sorry if it;s a bit pre-mature for the state of the blog - very excited to see this finally taking shape!

- user: dn6
---

# State of open video generation models in Diffusers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth opening with a video from one of the open video models, maybe even draw comparisons from where the video generation models were a year or two back vs now!

A good example could be the Will Smith benchmark!

- Fine-tuning
- Looking ahead

## Today’s Video Generation Models and their Limitations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to disagree, but IMO, we should only keep the table here and the limitations can potentially go toward the end of the blogpost, makes it more easy to read.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do disagree. I think it's common to start with limitations to make the readers have a fuller context.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on the vibe you are going for, up to you since you're the author, to me it just feels odd to start with limitations since even a survey paper conveys the limitations towards the end.

- Several open models suffer from limited generalization capabilities and underperform expectations of users. Models may require prompting in a certain way, or LLM-like prompts, or fail to generalize to out-of-distribution data, which are hurdles for widespread user adoption. For example, models like LTX-Video often need to be prompted in a very detailed and specific way for obtaining good quality generations.
- The high computational and memory demands of video generation result in significant generation latency. For local usage, this is often a roadblock. Most new open video models are inaccessible to community hardware without extensive memory optimizations and quantization approaches that affect both inference latency and quality of the generated videos.

## Why is Video Generation Hard?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for this, it would be nice to keep a positive outlook going in and then ground it towards the end.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

video_gen.md Show resolved Hide resolved
video_gen.md Outdated Show resolved Hide resolved
video_gen.md Outdated Show resolved Hide resolved
video_gen.md Outdated
| [`tencent/HunyuanVideo`](https://huggingface.co/tencent/HunyuanVideo) | [Link](https://huggingface.co/tencent/HunyuanVideo/blob/main/LICENSE) |
| [`Lightricks/LTX-Video`](https://huggingface.co/Lightricks/LTX-Video) | [Link](https://huggingface.co/Lightricks/LTX-Video/blob/main/License.txt) |

### **Memory requirements**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be beneficial to add inference examples for all/ some models that you mention here, to ground that diffusers is the place to go for inference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe even with video snippets embedded from those as well - so that people can visually experience them as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be beneficial to add inference examples for all/ some models that you mention here, to ground that diffusers is the place to go for inference.

It will make it unnecessarily verbose. We will do some snippets but will keep it for only one model as we're already citing the docs for the other models. This is a TODO and will be addressed by @DN6.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will make it unnecessarily verbose. We will do some snippets but will keep it for only one model as we're already citing the docs for the other models.

Not really, you can just wrap them up into <details> so that it is collapsed by default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will make it redundant a bit IMO, as the code doesn't change much. So, showing for a single model is sufficient, I think.

video_gen.md Outdated

We used the same settings as above to obtain these numbers. Quantization was performed with the [`bitsandbytes` library](https://huggingface.co/docs/bitsandbytes/main/en/index) (Diffusers [supports three different quantization backends](https://huggingface.co/docs/diffusers/main/en/quantization/overview) as of now). Also note that due to numerical precision loss, quantization can impact the quality of the outputs, effects of which are more prominent in videos than images.

## Video Generation with Diffusers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More in-line with the suggestion above, I'd recommend moving this above optimisations/ memory etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved.

video_gen.md Show resolved Hide resolved
@pcuenca
Copy link
Member

pcuenca commented Jan 14, 2025

Super cool! Happy to help and/or review as needed.

@sayakpaul
Copy link
Member Author

sayakpaul commented Jan 14, 2025

@pcuenca thanks! If you could help generate some of the videos @Vaibhavs10 mentioned that will be very helpful. This is the script I used for the optims:

Code
from diffusers import HunyuanVideoTransformer3DModel, HunyuanVideoPipeline
from diffusers import BitsAndBytesConfig as BitsAndBytesConfig
import argparse
import json
import torch 

prompt = "A cat walks on the grass, realistic. The scene resembles a real-life footage and should look as if it was shot in a sunny day."

def load_pipeline(args):
    if args.bit4_bnb:
        quant_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")
        transformer = HunyuanVideoTransformer3DModel.from_pretrained(
            "hunyuanvideo-community/HunyuanVideo",
            subfolder="transformer",
            quantization_config=quant_config,
            torch_dtype=torch.bfloat16,
        )
    elif args.bit8_bnb:
        quant_config = BitsAndBytesConfig(load_in_8bit=True)
        transformer = HunyuanVideoTransformer3DModel.from_pretrained(
            "hunyuanvideo-community/HunyuanVideo",
            subfolder="transformer",
            quantization_config=quant_config,
            torch_dtype=torch.bfloat16,
        )
    else:
        transformer = HunyuanVideoTransformer3DModel.from_pretrained(
            "hunyuanvideo-community/HunyuanVideo", subfolder="transformer", torch_dtype=torch.bfloat16
        )
    
    pipe = HunyuanVideoPipeline.from_pretrained(
        "hunyuanvideo-community/HunyuanVideo", transformer=transformer, torch_dtype=torch.float16
    )
    
    if not args.enable_model_cpu_offload:
        pipe = pipe.to("cuda")
    else:
        pipe.enable_model_cpu_offload()
    
    if args.vae_tiling:
        pipe.vae.enable_tiling()
    return pipe


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--enable_model_cpu_offload", type=int, choices=[0, 1])
    parser.add_argument("--vae_tiling", type=int, choices=[0, 1])
    parser.add_argument("--bit4_bnb", type=int, choices=[0, 1])
    parser.add_argument("--bit8_bnb", type=int, choices=[0, 1])
    args = parser.parse_args()

    # Construct output path based on argument values
    output_path = f"4bit@{args.bit4_bnb}_8bit@{args.bit8_bnb}_tiling@{args.vae_tiling}_offload@{args.enable_model_cpu_offload}.json"

    pipe = load_pipeline(args)

    _ = pipe(
        prompt, 
        height=512, 
        width=768, 
        num_frames=121, 
        generator=torch.manual_seed(0),
        num_inference_steps=50
    )

    memory = torch.cuda.max_memory_allocated() / (1024 ** 3)

    # Serialize memory usage info to JSON
    memory_data = {
        "prompt": prompt,
        "height": 512,
        "width": 768,
        "num_frames": 121,
        "num_inference_steps": 50,
        "gpu_memory_usage_gb": memory,
        "enable_model_cpu_offload": args.enable_model_cpu_offload,
        "vae_tiling": args.vae_tiling,
        "bit4_bnb": args.bit4_bnb,
        "bit8_bnb": args.bit8_bnb
    }

    with open(output_path, "w") as json_file:
        json.dump(memory_data, json_file, indent=4)

    print(f"Serialized to {output_path=}")

Completely fine if you don't have time. My plate if a bit full, too. So, it will take time.

Copy link
Member

@a-r-r-o-w a-r-r-o-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for leading the initiative @sayakpaul. Is there anything specific you'd like me to address? At the moment, I see some TODOs which were regarding the feature PRs not merged yet, but we are very close to merging (just needs final look from @DN6), so we can directly mention them.

Left some other comments as well about how we could nicely showcase memory reduction with/without quantization or other optimizations with upcoming features.

video_gen.md Outdated
Comment on lines 166 to 169
Note that in the above four options, as of now, we only support the first two. Support for the rest of the two will be merged in soon. If you’re interested to follow along the progress, here are the PRs:

- TODO:
- TODO:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are very close to merging PAB, which should cover attention & MLP state re-use. For chunked inference, slicing/tiling/FreeNoise-split-inference are great examples already.

For offloading, we currently only have group offloading pending (which might take a while to review and merge), but the PR is 90% ready IMO, so we can mention it -- especially because it has no speed overheads while drastically reducing memory requirements.

So, IMO we should not mention these few lines ("..., we only support the first two")

Copy link
Member Author

@sayakpaul sayakpaul Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to perform those changes directly here. I would go with:

So, IMO we should not mention these few lines ("..., we only support the first two")

And make it clear what's upcoming (the ones you have opened PRs for).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@a-r-r-o-w I have taken care of the edits. LMK if that works for you.

video_gen.md Outdated Show resolved Hide resolved
video_gen.md Outdated
Comment on lines 130 to 139
| VAE tiling | 43.58 GB |
| CPU offloading | 28.87 GB |
| 8Bit | 49.9 GB |
| 8Bit + CPU offloading* | 35.66 GB |
| 8Bit + VAE tiling | 36.92 GB |
| 8Bit + CPU offloading + VAE tiling | 26.18 GB |
| 4Bit | 42.96 GB |
| 4Bit + CPU offloading | 21.99 GB |
| 4Bit + VAE tiling | 26.42 GB |
| 4Bit + CPU offloading + VAE tiling | 14.15 GB |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sayakpaul Have we made note of the time required for each of these methods? IMO it would be helpful for users to understand the tradeoffs that come with each and the expected slowdown

It would also set the stage to tease the new banger feature, of prefetched offloading, coming soon, which uses the memory of sequential cpu offloading (so around ~3 GB) without compromising speed. CPU RAM requirements are the same as any other offloading methods. LMK what you think

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reason why I didn't because:

  1. Video generation is time-consuming, especially HunyuanVideo. Not sure if most users care about the inference latency taking a hit because of memory optims.
  2. We don't have the other features merged yet so, I didn't feel comfortable benchmarking them.

If you feel strongly about the timing note, feel free to add the changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the Comfy community side atleast, I know that people do care about the time required and try to work with settings that reduce the overall time required (lower resolution/frames + latent upscaling, sage attention, fp8 matmul, etc. because they have support for some good memory optims already). So, I think it will be beneficial to mention time here because if we only cared about reducing memory, everyone would just default to something like sequential cpu offloading.

Could you provide me with the benchmark script from where you got the current numbers? Will run the same and measure time as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here:

Code
from diffusers import HunyuanVideoTransformer3DModel, HunyuanVideoPipeline
from diffusers import BitsAndBytesConfig as BitsAndBytesConfig
import argparse
import json
import torch 

prompt = "A cat walks on the grass, realistic. The scene resembles a real-life footage and should look as if it was shot in a sunny day."

def load_pipeline(args):
    if args.bit4_bnb:
        quant_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")
        transformer = HunyuanVideoTransformer3DModel.from_pretrained(
            "hunyuanvideo-community/HunyuanVideo",
            subfolder="transformer",
            quantization_config=quant_config,
            torch_dtype=torch.bfloat16,
        )
    elif args.bit8_bnb:
        quant_config = BitsAndBytesConfig(load_in_8bit=True)
        transformer = HunyuanVideoTransformer3DModel.from_pretrained(
            "hunyuanvideo-community/HunyuanVideo",
            subfolder="transformer",
            quantization_config=quant_config,
            torch_dtype=torch.bfloat16,
        )
    else:
        transformer = HunyuanVideoTransformer3DModel.from_pretrained(
            "hunyuanvideo-community/HunyuanVideo", subfolder="transformer", torch_dtype=torch.bfloat16
        )
    
    pipe = HunyuanVideoPipeline.from_pretrained(
        "hunyuanvideo-community/HunyuanVideo", transformer=transformer, torch_dtype=torch.float16
    )
    
    if not args.enable_model_cpu_offload:
        pipe = pipe.to("cuda")
    else:
        pipe.enable_model_cpu_offload()
    
    if args.vae_tiling:
        pipe.vae.enable_tiling()
    return pipe


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--enable_model_cpu_offload", type=int, choices=[0, 1])
    parser.add_argument("--vae_tiling", type=int, choices=[0, 1])
    parser.add_argument("--bit4_bnb", type=int, choices=[0, 1])
    parser.add_argument("--bit8_bnb", type=int, choices=[0, 1])
    args = parser.parse_args()

    # Construct output path based on argument values
    output_path = f"4bit@{args.bit4_bnb}_8bit@{args.bit8_bnb}_tiling@{args.vae_tiling}_offload@{args.enable_model_cpu_offload}.json"

    pipe = load_pipeline(args)

    _ = pipe(
        prompt, 
        height=512, 
        width=768, 
        num_frames=121, 
        generator=torch.manual_seed(0),
        num_inference_steps=50
    )

    memory = torch.cuda.max_memory_allocated() / (1024 ** 3)

    # Serialize memory usage info to JSON
    memory_data = {
        "prompt": prompt,
        "height": 512,
        "width": 768,
        "num_frames": 121,
        "num_inference_steps": 50,
        "gpu_memory_usage_gb": memory,
        "enable_model_cpu_offload": args.enable_model_cpu_offload,
        "vae_tiling": args.vae_tiling,
        "bit4_bnb": args.bit4_bnb,
        "bit8_bnb": args.bit8_bnb
    }

    with open(output_path, "w") as json_file:
        json.dump(memory_data, json_file, indent=4)

    print(f"Serialized to {output_path=}")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep the settings similar, though. If we have reduce the number of frames, resolution, etc. I'd make a separate note and not change the settings during benchmarking.

Copy link
Member

@a-r-r-o-w a-r-r-o-w Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the results with time required for each method + FP8-layerwise-upcasting since the PR was merged.

| **Setting**                                        | **Memory**    | **Time** |
|:--------------------------------------------------:|:-------------:|:--------:|
| BF16 Base                                          | 60.10 GB      |  863s    |
| BF16 + CPU offloading                              | 28.87 GB      |  917s    |
| BF16 + VAE tiling                                  | 43.58 GB      |  870s    |
| 8-bit BnB                                          | 49.90 GB      |  983s    |
| 8-bit BnB + CPU offloading*                        | 35.66 GB      | 1041s    |
| 8-bit BnB + VAE tiling                             | 36.92 GB      |  997s    |
| 8-bit BnB + CPU offloading + VAE tiling            | 26.18 GB      | 1260s    |
| 4-bit BnB                                          | 42.96 GB      |  867s    |
| 4-bit BnB + CPU offloading                         | 21.99 GB      |  953s    |
| 4-bit BnB + VAE tiling                             | 26.42 GB      |  889s    |
| 4-bit BnB + CPU offloading + VAE tiling            | 14.15 GB      |  995s    |
| FP8 Upcasting                                      | 51.70 GB      |  856s    |
| FP8 Upcasting + CPU offloading                     | 21.99 GB      |  983s    |
| FP8 Upcasting + VAE tiling                         | 35.17 GB      |  867s    |
| FP8 Upcasting + CPU offloading + VAE tiling        | 20.44 GB      | 1013s    |
| BF16 + Group offload (blocks=8) + VAE tiling       | 15.67 GB      |  925s    |
| BF16 + Group offload (blocks=1) + VAE tiling       |  7.72 GB      |  881s    |
| BF16 + Group offload (leaf) + VAE tiling           |  6.66 GB      |  887s    | 
| FP8 Upcasting + Group offload (leaf) + VAE tiling  |  6.56 GB      |  885s    |

Still haven't added Groupwise-offloading yet since I had another idea about optimizing it for further reducing memory. I will for sure be able to send the numbers for it by later today. Will push the changes directly EOD

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Aryan!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the updated benchmark code (did not modify the original parts and just kept to the fp8 and group offloading additions)

code
from diffusers import HunyuanVideoTransformer3DModel, HunyuanVideoPipeline
from diffusers import BitsAndBytesConfig
import argparse
import json
import torch 
import time
from diffusers.utils import export_to_video
from diffusers.hooks.group_offloading import apply_group_offloading
from diffusers.utils.logging import set_verbosity_debug

set_verbosity_debug()

prompt = "A cat walks on the grass, realistic. The scene resembles a real-life footage and should look as if it was shot in a sunny day."

def load_pipeline(args):
    if args.bit4_bnb:
        quant_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")
        transformer = HunyuanVideoTransformer3DModel.from_pretrained(
            "hunyuanvideo-community/HunyuanVideo",
            subfolder="transformer",
            quantization_config=quant_config,
            torch_dtype=torch.bfloat16,
        )
    elif args.bit8_bnb:
        quant_config = BitsAndBytesConfig(load_in_8bit=True)
        transformer = HunyuanVideoTransformer3DModel.from_pretrained(
            "hunyuanvideo-community/HunyuanVideo",
            subfolder="transformer",
            quantization_config=quant_config,
            torch_dtype=torch.bfloat16,
        )
    else:
        transformer = HunyuanVideoTransformer3DModel.from_pretrained(
            "hunyuanvideo-community/HunyuanVideo", subfolder="transformer", torch_dtype=torch.bfloat16
        )
    
    if args.layerwise_casting:
        transformer.enable_layerwise_casting(storage_dtype=torch.float8_e4m3fn, compute_dtype=torch.bfloat16)
    
    pipe = HunyuanVideoPipeline.from_pretrained(
        "hunyuanvideo-community/HunyuanVideo", transformer=transformer, torch_dtype=torch.float16
    )
    
    if not args.enable_model_cpu_offload:
        if args.group_offloading == "0":
            pipe = pipe.to("cuda")
    else:
        pipe.enable_model_cpu_offload()
    
    if args.vae_tiling:
        pipe.vae.enable_tiling()
    return pipe


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--enable_model_cpu_offload", type=int, choices=[0, 1])
    parser.add_argument("--vae_tiling", type=int, choices=[0, 1])
    parser.add_argument("--bit4_bnb", type=int, choices=[0, 1])
    parser.add_argument("--bit8_bnb", type=int, choices=[0, 1])
    parser.add_argument("--layerwise_casting", type=int, choices=[0, 1])
    parser.add_argument("--group_offloading", type=str, choices=["0", "1", "8", "leaf_level"])
    args = parser.parse_args()

    # Construct output path based on argument values
    output_path = f"group_offloading@{args.group_offloading}_4bit@{args.bit4_bnb}_8bit@{args.bit8_bnb}_tiling@{args.vae_tiling}_offload@{args.enable_model_cpu_offload}_layerwise@{args.layerwise_casting}.json"

    pipe = load_pipeline(args)

    if args.group_offloading != "0":
        apply_group_offloading(
            pipe.text_encoder,
            offload_type="leaf_level",
            offload_device=torch.device("cpu"),
            onload_device=torch.device("cuda"),
            force_offload=True,
            non_blocking=True,
            use_stream=True,
        )
        apply_group_offloading(
            pipe.text_encoder_2,
            offload_type="leaf_level",
            offload_device=torch.device("cpu"),
            onload_device=torch.device("cuda"),
            force_offload=True,
            non_blocking=True,
            use_stream=True,
        )
        apply_group_offloading(
            pipe.transformer,
            offload_type="block_level" if args.group_offloading in ["1", "8"] else "leaf_level",
            num_blocks_per_group=8 if args.group_offloading == "8" else 1 if args.group_offloading == "1" else None,
            offload_device=torch.device("cpu"),
            onload_device=torch.device("cuda"),
            force_offload=True,
            non_blocking=True,
            use_stream=True,
        )
        pipe.vae.to("cuda")
    
        # warmup for prefetch hooks to figure out layer execution order
        _ = pipe(prompt, height=64, width=64, num_frames=9, num_inference_steps=2)

    t1 = time.time()
    video = pipe(
        prompt, 
        height=512, 
        width=768, 
        num_frames=121,
        generator=torch.manual_seed(0),
        num_inference_steps=30,
    )
    t2 = time.time()

    video = video.frames[0]
    export_to_video(video, output_path[:-5] + ".mp4", fps=30)

    memory = torch.cuda.max_memory_allocated() / (1024 ** 3)

    # Serialize memory usage info to JSON
    memory_data = {
        "prompt": prompt,
        "height": 512,
        "width": 768,
        "num_frames": 121,
        "num_inference_steps": 50,
        "gpu_memory_usage_gb": memory,
        "inference_time": round(t2 - t1, 2),
        "enable_model_cpu_offload": args.enable_model_cpu_offload,
        "vae_tiling": args.vae_tiling,
        "bit4_bnb": args.bit4_bnb,
        "bit8_bnb": args.bit8_bnb
    }

    with open(output_path, "w") as json_file:
        json.dump(memory_data, json_file, indent=4)

    print(f"Serialized to {output_path=}")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BF16, 121 frames, 512x768 resolution in under 7 GB (further reduced to under 5 GB with flash attention and optimized feed-forward (huggingface/diffusers#10623). Did we cook or did we cook? 👨‍🍳

@sayakpaul
Copy link
Member Author

sayakpaul commented Jan 20, 2025

Thanks @a-r-r-o-w!

Is there anything specific you'd like me to address? At the moment, I see some TODOs which were regarding the feature PRs not merged yet, but we are very close to merging (just needs final look from @DN6), so we can directly mention them.

I think you can take care of the comments you added and maybe make some changes to address them? I will try to address VB's comments.

I will let @DN6 take care of the code examples.

export_to_video(video, "output.mp4", fps=24)
```

### Memory requirements
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Vaibhavs10 this has been adjusted FYI.

video_gen.md Outdated
Comment on lines 177 to 178
* [Layerwise upcasting](https://github.com/huggingface/diffusers/pull/10347): Lets users store the params and layer outputs in a lower-precision such as `torch.float8_e4m3fn` and run computations in a higher precision such as `torch.bfloat16`.
* [Overlapped offloading](https://github.com/huggingface/diffusers/pull/10503): Lets users overlap data transfer with computation using CUDA streams.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@a-r-r-o-w if you could help providing your best save-up numbers here, that would be nice. For example, we could say:

Layerwise upcasting enables us to save XYZ memory.

Same for overlapped offloading.

Copy link
Member

@a-r-r-o-w a-r-r-o-w Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, I would not call this overlapped offloading for two reasons:

  • Overlapping is opt-in. It also may be imperfectly overlapped if computation is much faster to do than perform the module transfer (however, the synchronizations put in place make sure no operation starts unexpectedly)
  • The PRs original intention is to allow groups of internal modules to be offloaded together. This helps reduce memory peak additions caused by loading entire model to GPU, by only partially loading the required modules at a time, performing computation, and then offloading

video_gen.md Outdated
# (Full training command removed for brevity)
```

For more details, check out the repository [here](https://github.com/a-r-r-o-w/finetrainers). We used `finetrainers` to emulate the "dissolve" effect and obtained
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Vaibhavs10 provided a fine-tuned model and a result.

We provide more details about these optimizations in the sections below along with some code snippets to go. But if you're already feeling excited,
we encourage you to check out [our guide](https://huggingface.co/docs/diffusers/main/en/using-diffusers/text-img2vid).

### Suite of optimizations
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DN6 if you could take care of the code, that would be helpful!

@sayakpaul sayakpaul marked this pull request as ready for review January 27, 2025 14:02
@sayakpaul sayakpaul merged commit 4e9fd55 into main Jan 27, 2025
1 check passed
@sayakpaul sayakpaul deleted the video-gen branch January 27, 2025 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants