Model does not support Flash Attention 2.0 yet #146

DominikVincent · 2025-01-10T10:54:52Z

Running the gradio_web_server_adhoc.py or inference script with different models always leads to:

The error:

    inference()
  File "/home/dominik/Documents/repos/VideoLLaMA2/own_scripts/inference.py", line 27, in inference
    model, processor, tokenizer = model_init(model_path)
  File "/home/dominik/Documents/repos/VideoLLaMA2/videollama2/__init__.py", line 17, in model_init
    tokenizer, model, processor, context_len = load_pretrained_model(model_path, None, model_name, **kwargs)
  File "/home/dominik/Documents/repos/VideoLLaMA2/videollama2/model/__init__.py", line 174, in load_pretrained_model
    model = Videollama2Qwen2ForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, config=config, **kwargs)
  File "/home/dominik/miniconda3/envs/vllama/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3550, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/home/dominik/Documents/repos/VideoLLaMA2/videollama2/model/videollama2_qwen2.py", line 50, in __init__
    self.model = Videollama2Qwen2Model(config)
  File "/home/dominik/Documents/repos/VideoLLaMA2/videollama2/model/videollama2_qwen2.py", line 42, in __init__
    super(Videollama2Qwen2Model, self).__init__(config)
  File "/home/dominik/Documents/repos/VideoLLaMA2/videollama2/model/videollama2_arch.py", line 34, in __init__
    self.vision_tower = build_vision_tower(config)
  File "/home/dominik/Documents/repos/VideoLLaMA2/videollama2/model/encoder.py", line 160, in build_vision_tower
    vision_tower = SiglipVisionTower(vision_tower, args=vision_tower_cfg, **kwargs)
  File "/home/dominik/Documents/repos/VideoLLaMA2/videollama2/model/encoder.py", line 99, in __init__
    self.vision_tower = SiglipVisionModel(config=config)
  File "/home/dominik/miniconda3/envs/vllama/lib/python3.9/site-packages/transformers/models/siglip/modeling_siglip.py", line 915, in __init__
    super().__init__(config)
  File "/home/dominik/miniconda3/envs/vllama/lib/python3.9/site-packages/transformers/modeling_utils.py", line 1307, in __init__
    config = self._autoset_attn_implementation(
  File "/home/dominik/miniconda3/envs/vllama/lib/python3.9/site-packages/transformers/modeling_utils.py", line 1454, in _autoset_attn_implementation
    cls._check_and_enable_flash_attn_2(
  File "/home/dominik/miniconda3/envs/vllama/lib/python3.9/site-packages/transformers/modeling_utils.py", line 1535, in _check_and_enable_flash_attn_2
    raise ValueError(
ValueError: SiglipVisionModel does not support Flash Attention 2.0 yet. Please request to add support where the model is hosted, on its model hub page: https://huggingface.co//discussions/new or in the Transformers GitHub repo: https://github.com/huggingface/transformers/issues/new

Tried with:

DAMO-NLP-SG/VideoLLaMA2.1-7B-16F
DAMO-NLP-SG/VideoLLaMA2.1-7B-16F-Base
DAMO-NLP-SG/VideoLLaMA2-7B-Base

What is the issue, why is flash attention 2.0 not supported despite being the default _attn_implementation

The text was updated successfully, but these errors were encountered:

MengHao666 · 2025-01-13T11:38:17Z

same problem, how could I solve it ?

clownrat6 · 2025-01-15T06:36:48Z

This is caused by low transformers version. Try to increase your transformers version to >4.45.0

MengHao666 · 2025-01-16T03:11:50Z

4.45.0

how could I modify the inference code in the machine without flashattenttion support?

import sys
sys.path.append('./')
from videollama2 import model_init, mm_infer
from videollama2.utils import disable_torch_init


def inference():
    disable_torch_init()

    # Video Inference
    modal = 'video'
    modal_path = 'assets/cat_and_chicken.mp4' 
    instruct = 'What animals are in the video, what are they doing, and how does the video feel?'
    # Reply:
    # The video features a kitten and a baby chick playing together. The kitten is seen laying on the floor while the baby chick hops around. The two animals interact playfully with each other, and the video has a cute and heartwarming feel to it.

    # Image Inference
    modal = 'image'
    modal_path = 'assets/sora.png'
    instruct = 'What is the woman wearing, what is she doing, and how does the image feel?'
    # Reply:
    # The woman in the image is wearing a black coat and sunglasses, and she is walking down a rain-soaked city street. The image feels vibrant and lively, with the bright city lights reflecting off the wet pavement, creating a visually appealing atmosphere. The woman's presence adds a sense of style and confidence to the scene, as she navigates the bustling urban environment.

    model_path = 'DAMO-NLP-SG/VideoLLaMA2.1-7B-16F'
    # Base model inference (only need to replace model_path)
    # model_path = 'DAMO-NLP-SG/VideoLLaMA2.1-7B-16F-Base'
    model, processor, tokenizer = model_init(model_path)
    output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal)

    print(output)

if __name__ == "__main__":
    inference()

xl-liu · 2025-01-17T05:54:49Z

4.45.0

how could I modify the inference code in the machine without flashattenttion support?

in model/encoder.py, you can change the implementation from config._attn_implementation = 'flash_attention_2' to config._attn_implementation = 'sdpa'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model does not support Flash Attention 2.0 yet #146

Model does not support Flash Attention 2.0 yet #146

DominikVincent commented Jan 10, 2025

MengHao666 commented Jan 13, 2025

clownrat6 commented Jan 15, 2025

MengHao666 commented Jan 16, 2025

xl-liu commented Jan 17, 2025

Model does not support Flash Attention 2.0 yet #146

Model does not support Flash Attention 2.0 yet #146

Comments

DominikVincent commented Jan 10, 2025

MengHao666 commented Jan 13, 2025

clownrat6 commented Jan 15, 2025

MengHao666 commented Jan 16, 2025

xl-liu commented Jan 17, 2025