Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model does not support Flash Attention 2.0 yet #146

Open
DominikVincent opened this issue Jan 10, 2025 · 4 comments
Open

Model does not support Flash Attention 2.0 yet #146

DominikVincent opened this issue Jan 10, 2025 · 4 comments

Comments

@DominikVincent
Copy link

Running the gradio_web_server_adhoc.py or inference script with different models always leads to:

The error:

    inference()
  File "/home/dominik/Documents/repos/VideoLLaMA2/own_scripts/inference.py", line 27, in inference
    model, processor, tokenizer = model_init(model_path)
  File "/home/dominik/Documents/repos/VideoLLaMA2/videollama2/__init__.py", line 17, in model_init
    tokenizer, model, processor, context_len = load_pretrained_model(model_path, None, model_name, **kwargs)
  File "/home/dominik/Documents/repos/VideoLLaMA2/videollama2/model/__init__.py", line 174, in load_pretrained_model
    model = Videollama2Qwen2ForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, config=config, **kwargs)
  File "/home/dominik/miniconda3/envs/vllama/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3550, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/home/dominik/Documents/repos/VideoLLaMA2/videollama2/model/videollama2_qwen2.py", line 50, in __init__
    self.model = Videollama2Qwen2Model(config)
  File "/home/dominik/Documents/repos/VideoLLaMA2/videollama2/model/videollama2_qwen2.py", line 42, in __init__
    super(Videollama2Qwen2Model, self).__init__(config)
  File "/home/dominik/Documents/repos/VideoLLaMA2/videollama2/model/videollama2_arch.py", line 34, in __init__
    self.vision_tower = build_vision_tower(config)
  File "/home/dominik/Documents/repos/VideoLLaMA2/videollama2/model/encoder.py", line 160, in build_vision_tower
    vision_tower = SiglipVisionTower(vision_tower, args=vision_tower_cfg, **kwargs)
  File "/home/dominik/Documents/repos/VideoLLaMA2/videollama2/model/encoder.py", line 99, in __init__
    self.vision_tower = SiglipVisionModel(config=config)
  File "/home/dominik/miniconda3/envs/vllama/lib/python3.9/site-packages/transformers/models/siglip/modeling_siglip.py", line 915, in __init__
    super().__init__(config)
  File "/home/dominik/miniconda3/envs/vllama/lib/python3.9/site-packages/transformers/modeling_utils.py", line 1307, in __init__
    config = self._autoset_attn_implementation(
  File "/home/dominik/miniconda3/envs/vllama/lib/python3.9/site-packages/transformers/modeling_utils.py", line 1454, in _autoset_attn_implementation
    cls._check_and_enable_flash_attn_2(
  File "/home/dominik/miniconda3/envs/vllama/lib/python3.9/site-packages/transformers/modeling_utils.py", line 1535, in _check_and_enable_flash_attn_2
    raise ValueError(
ValueError: SiglipVisionModel does not support Flash Attention 2.0 yet. Please request to add support where the model is hosted, on its model hub page: https://huggingface.co//discussions/new or in the Transformers GitHub repo: https://github.com/huggingface/transformers/issues/new

Tried with:

  • DAMO-NLP-SG/VideoLLaMA2.1-7B-16F
  • DAMO-NLP-SG/VideoLLaMA2.1-7B-16F-Base
  • DAMO-NLP-SG/VideoLLaMA2-7B-Base

What is the issue, why is flash attention 2.0 not supported despite being the default _attn_implementation

@MengHao666
Copy link

same problem, how could I solve it ?

@clownrat6
Copy link
Member

This is caused by low transformers version. Try to increase your transformers version to >4.45.0

@MengHao666
Copy link

4.45.0

how could I modify the inference code in the machine without flashattenttion support?

import sys
sys.path.append('./')
from videollama2 import model_init, mm_infer
from videollama2.utils import disable_torch_init


def inference():
    disable_torch_init()

    # Video Inference
    modal = 'video'
    modal_path = 'assets/cat_and_chicken.mp4' 
    instruct = 'What animals are in the video, what are they doing, and how does the video feel?'
    # Reply:
    # The video features a kitten and a baby chick playing together. The kitten is seen laying on the floor while the baby chick hops around. The two animals interact playfully with each other, and the video has a cute and heartwarming feel to it.

    # Image Inference
    modal = 'image'
    modal_path = 'assets/sora.png'
    instruct = 'What is the woman wearing, what is she doing, and how does the image feel?'
    # Reply:
    # The woman in the image is wearing a black coat and sunglasses, and she is walking down a rain-soaked city street. The image feels vibrant and lively, with the bright city lights reflecting off the wet pavement, creating a visually appealing atmosphere. The woman's presence adds a sense of style and confidence to the scene, as she navigates the bustling urban environment.

    model_path = 'DAMO-NLP-SG/VideoLLaMA2.1-7B-16F'
    # Base model inference (only need to replace model_path)
    # model_path = 'DAMO-NLP-SG/VideoLLaMA2.1-7B-16F-Base'
    model, processor, tokenizer = model_init(model_path)
    output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal)

    print(output)

if __name__ == "__main__":
    inference()

@xl-liu
Copy link

xl-liu commented Jan 17, 2025

4.45.0

how could I modify the inference code in the machine without flashattenttion support?

in model/encoder.py, you can change the implementation from config._attn_implementation = 'flash_attention_2' to config._attn_implementation = 'sdpa'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants