VLM: TraceableChatGLMForConditionalGeneration #1039

kylesayrs · 2025-01-06T23:53:58Z

Purpose

Support GLM architecture

Related issues

Related to Quantize glm-4v-9b with INT8 Quantization #1003

TODO

Investigate whether we can dynamicaly import ChatGLMForConditionalGeneration when TraceableChatGLMForConditionalGeneration is imported
Check copyright by @jeanniefinks and @markurtz

Signed-off-by: Kyle Sayers <[email protected]>

…s. Requires patching modeling_llava

Signed-off-by: Kyle Sayers <[email protected]>

…tokenized datasets should not be given labels Signed-off-by: Kyle Sayers <[email protected]>

Signed-off-by: Kyle Sayers <[email protected]>

…allbacks Signed-off-by: Kyle Sayers <[email protected]>

Signed-off-by: Kyle Sayers <[email protected]>

github-actions · 2025-01-06T23:54:10Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Signed-off-by: Kyle Sayers <[email protected]>

@shubhra

## Purpose ## * Enable oneshot quantization of vision-language models ![VLM Banner](https://github.com/user-attachments/assets/0d748714-b524-44f4-b850-a721f35d5543) [Llama_3 2-Vision Graphviz](https://github.com/user-attachments/assets/6b371ccc-f9f6-4bf2-b4cd-24ed75a3cad0) ## Related Issues ## * Fixes #91 * Fixes #961 * Fixes #990 ## Prerequisites ## * neuralmagic/compressed-tensors#193 * #917 * #943 * #955 * #950 * #998 * #1014 ## Changes ## ### VLM Support ### * Add multimodal examples in `examples/multimodal_vision` * Modify `custom_offload_device_map` to support models which are not `XForCausalLM` * Add custom data collators for VLM models in `src/llmcompressor/transformers/utils/data_collator.py` ### GPTQModifier ### * Implement hooks-based compression in `GPTQModifier` * This replaces layer-compressor, which made many assumptions about model architecture * This also enables finer-grained sequential compression such as [true_sequential](https://huggingface.co/docs/transformers/main_classes/quantization#transformers.GPTQConfig.true_sequential) * Functions previously implemented in `gptq_wrapper.py` are now implemented in `gptq_quantize.py` * Implement `offload_hessians` parameter in `GPTQModifier` * Implement data-pipelines-based calibration in `GPTQModifier` * First an attempt will be made to trace the model and run the `sequential` pipeline * If that fails, assumptions will be made about the model architecture and an attempt will be made to run the `layer_sequential` pipeline * This ensures backwards compatibility with any previously supported models * If that fails, then the basic pipeline will be used, which is guaranteed to run but may require using `offlo ad_hessians` * Change hessian instability from a `ValueError` to a `_LinAlgError` so it can be ignored by the gptq pipeline fallback mechanism * Add support for conv2d as indicated by [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ/blob/6689349625de973b9ee3016c28c11f32acf7f02c/auto_gptq/quantization/gptq.py#L45-L54) ### Data Pipelines ### * Implement the basic skeletons of data pipelines, which are subject to change when data pipelines are pulled out of modifiers * Basic Pipeline * Performs standard forward passes through the model with provided dataloader * Used as fallback, as well as in the future for basic calibration passes * Layer Sequential Pipeline * Refactor of `LayerCompressor` as a straight-forward data pipeline * Uses `IntermediatesCache` to handle activation offloading * Sequential Pipeline * Utilizes graph tracing implemented by `torch.fx` to trace the graph in order to determine where sequential targets (layers) exist in the graph and what their inputs and outputs are * Implements BFS algorithm to assign nodes to partitions * An ideal implementation consolidates partition indices to assign each node to the latest possible partition, delaying execution. The current implementation addresses the most common case (node.op == get_attr) * Each partition (`Subgraph`) is compiled as an executable python function with the proper inputs and outputs * Uses `IntermediatesCache` to handle activation offloading * Implement `IntermediatesCache` which automagically handles the offloading and onloading of activations from batches * This class is capable of offloading many non-standard activation types such as `Tuple`s and dataclasses such as `BaseModelOutputWithPast` * For convenience, the class also handles masking padding * The class is tested in `tests/llmcompressor/pipelines/test_cache.py` ### Tracing ### * In order to support sequential quantization of the large variety of different multimodal model architectures, some model definitions have to be altered to support tracing * If the calibration dataset is text only, most LLMs and VLMs are traceable without additional work. Multimodal calibration datasets are more likely to require additional work to make tracable * For many VLMs (but not all), the vision tower is not traceable without significant work. However, this only affects sequential error propagation and (minimal?) increased memory usage, which leaves the door open for future support for quantizing modules in the vision tower * Add traceable model definitions for llava, mistral, mllama, and glm * All copyright licenses allow for alteration and redistribution, the line `# vllm-project: no copyright` was added in similar style to [text_generation.py](https://github.com/vllm-project/llm-compressor/blob/main/src/llmcompressor/transformers/finetune/text_generation.py#L18) ## Future Work/ Follow ups ## * #1027 * #1032 * #1039 * #1030 * Create better data collators capable of handling larger batch sizes in order to support VLM fine tuning * Better support prompt masking for multimodal processors in order to support VLM fine tuning ## Winogrande Evaluations ## Model | Dataset | Scheme | Runtime | Winogrande | -- | -- | -- | -- | -- Llama-3-8B | ultrachat | W4A16 | 43m, 2xA4000 | 0.7545 Llama-3-70B | ultrachat | W4A16 | 303m, 1xH100 | 0.8216 Mixtral-8x7B | ultrachat | W4A16 | 317m, 1xA100 | 0.8200 openbmb/MiniCPM3-4B | ultrachat | W4A16 | 63m, 1xA100 | 0.6701 Qwen2-VL-2B-Instruct | ultrachat | W8A8 | 12m, 2xA4000 | 0.6188 Qwen2-VL-2B-Instruct | flickr | W8A8 | 24m, 2xA4000 | 0.6093 Llama-3.2-11B-Vision-Instruct | flickr | W8A8 | 75m, 1xA100 | 0.7837 Pixtral-12B-2409 | flickr | W8A8 | 52m, 1xA100 | 0.7924 llava-1.5-7b-hf | flickr | W8A8 | 15m, 1xH100 | 0.7214 Phi-3-vision-128k-instruct | flickr | W4A16 | 51m, 1xA100 | 0.7151 `lm_eval --model vllm --model_args pretrained="path/to/model",dtype=auto,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enforce_eager=True,add_bos_token=True --tasks winogrande --num_fewshot 5 --batch_size 32` `lm_eval --model vllm --model_args pretrained="path/to/model",dtype=bfloat16,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enforce_eager=True,add_bos_token=True,max_num_seqs=1 --tasks winogrande --num_fewshot 5 --batch_size 1` ## MMMU Evaluations ## Credit to @shubhra Model | Dataset | Scheme | MMMU -- | -- | -- | -- Llama-3.2-11B-Vision | N/A | Dense | 0.4144 Llama-3.2-11B-Vision | N/A | FP8-dynamic | 0.4300 Llama-3.2-11B-Vision | flickr | W4A16 | 0.4377 Llama-3.2-11B-Vision | flickr | W4A16-group | 0.4211 Model | Dataset | Scheme | MMMU -- | -- | -- | -- Llama-3.2-90B-Vision | N/A | Dense | 0.5388 Llama-3.2-90B-Vision | N/A | FP8-dynamic | 0.5278 Llama-3.2-90B-Vision | flickr | W4A16 | 0.5111 Llama-3.2-90B-Vision | flickr | W4A16-group | 0.5477 Model | Dataset | Scheme | MMMU -- | -- | -- | -- Pixtral-12B-2409 | N/A | Dense | 0.5022 Pixtral-12B-2409 | N/A | FP8-dynamic | 0.5322 Pixtral-12B-2409 | flickr | W4A16 | 0.4500 Pixtral-12B-2409 | flickr | W4A16-group | 0.4689 ## Testing ## * [Nightly](https://github.com/neuralmagic/llm-compressor-testing/actions/runs/12640439996) --------- Signed-off-by: Kyle Sayers <[email protected]> Co-authored-by: Dipika Sikka <[email protected]>

kylesayrs added 30 commits November 15, 2024 21:56

integrate with SparseGPTModifier

59ffe44

Signed-off-by: Kyle Sayers <[email protected]>

add hooksmixin to modifier

21fe61b

Signed-off-by: Kyle Sayers <[email protected]>

Merge remote-tracking branch 'origin' into kylesayrs/HooksMixin

ba01137

Merge remote-tracking branch 'origin' into kylesayrs/HooksMixin

3771a89

Merge branch 'kylesayrs/HooksMixin' into kylesayrs/gptq-partition

ccc5458

merge

a5635a1

Signed-off-by: Kyle Sayers <[email protected]>

small updates

83ed409

Signed-off-by: Kyle Sayers <[email protected]>

Merge branch 'main' into kylesayrs/HooksMixin

7fd142b

WIP

d104282

WIP

236a47a

able to run without hooks

188896e

issue with different sizes

8ef9c23

able to run through pixtral without issue and using real proxy tensor…

1362ca2

…s. Requires patching modeling_llava

nits

0539df7

Signed-off-by: Kyle Sayers <[email protected]>

Merge remote-tracking branch 'origin' into kylesayrs/HooksMixin

a734393

Merge branch 'kylesayrs/HooksMixin' into kylesayrs/gptq-partition

ea10aed

fix all variable

ed96ee4

tmp

5f26711

wip

ebc2c41

wip

922b407

testing with lots of models

0577f36

preliminary data pipeline

3830696

WIP

1ecaa39

delete unnecessary files

9aa9679

Merge remote-tracking branch 'origin' into kylesayrs/gptq-partition

7e6fe17

Merge branch 'kylesayrs/gptq-hooks' into kylesayrs/gptq-partition

034c0b1

clean up CustomDataset

a62617c

Signed-off-by: Kyle Sayers <[email protected]>

chchchchanges

57b5e02

Signed-off-by: Kyle Sayers <[email protected]>

wip: use rename to processor, going through tests

fa317fd

Signed-off-by: Kyle Sayers <[email protected]>

remove labels from calibration dataset rather than assuming that all …

f3f5875

…tokenized datasets should not be given labels Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs added 18 commits January 4, 2025 06:00

layer sequential helpers docstrings

14f5d88

Signed-off-by: Kyle Sayers <[email protected]>

update comments

fde309a

Signed-off-by: Kyle Sayers <[email protected]>

sequential helpers docstrings

e6a8fa8

Signed-off-by: Kyle Sayers <[email protected]>

more docstrings

954cd4e

Signed-off-by: Kyle Sayers <[email protected]>

IntermediatesCache docstrings

00309e9

Signed-off-by: Kyle Sayers <[email protected]>

free hessians on finalize

57e8f21

Signed-off-by: Kyle Sayers <[email protected]>

remove unnecessary examples

378afb3

Signed-off-by: Kyle Sayers <[email protected]>

make diff closer to original implementation

83b81be

Signed-off-by: Kyle Sayers <[email protected]>

Merge branch 'main' into kylesayrs/gptq-partition

b6c0a50

use original mask padding function

5363d40

Signed-off-by: Kyle Sayers <[email protected]>

reduce diff

ae89688

Signed-off-by: Kyle Sayers <[email protected]>

replace list comprehesion

d3eebfe

Signed-off-by: Kyle Sayers <[email protected]>

nit: only pass first layer

412086c

Signed-off-by: Kyle Sayers <[email protected]>

revert changes to tensors_to_device

8433304

Signed-off-by: Kyle Sayers <[email protected]>

type hint intermediates cache for clarity

07b3cc3

Signed-off-by: Kyle Sayers <[email protected]>

make hessian instability a _LinAlgError so it can be caught by gptq f…

895b409

…allbacks Signed-off-by: Kyle Sayers <[email protected]>

Merge remote-tracking branch 'origin' into kylesayrs/gptq-partition

18fe751

support manifest

45b8225

Signed-off-by: Kyle Sayers <[email protected]>

defer chatglm for later

336e064

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs mentioned this pull request Jan 6, 2025

VLM Support via GPTQ Hooks and Data Pipelines #914

Merged

kylesayrs changed the base branch from main to kylesayrs/gptq-partition January 6, 2025 23:57

kylesayrs added 6 commits January 6, 2025 23:57

Merge branch 'kylesayrs/gptq-partition' into kylesayrs/traceable-chatglm

566ea20

add chatglm

c457911

Signed-off-by: Kyle Sayers <[email protected]>

docstrings, reorder pipeline args

f6312d0

Signed-off-by: Kyle Sayers <[email protected]>

correct typos

153a4fa

Signed-off-by: Kyle Sayers <[email protected]>

code clarity

3f9dd7d

Signed-off-by: Kyle Sayers <[email protected]>

Merge branch 'kylesayrs/gptq-partition' into kylesayrs/traceable-chatglm

01c71eb

Base automatically changed from kylesayrs/gptq-partition to main January 8, 2025 22:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VLM: TraceableChatGLMForConditionalGeneration #1039

VLM: TraceableChatGLMForConditionalGeneration #1039

kylesayrs commented Jan 6, 2025 •

edited

Loading

github-actions bot commented Jan 6, 2025

VLM: TraceableChatGLMForConditionalGeneration #1039

Are you sure you want to change the base?

VLM: TraceableChatGLMForConditionalGeneration #1039

Conversation

kylesayrs commented Jan 6, 2025 • edited Loading

Purpose

Related issues

TODO

github-actions bot commented Jan 6, 2025

kylesayrs commented Jan 6, 2025 •

edited

Loading