Add options for configuring quantization in CachedCausalLM.from_pretrained() #26

gabegrand · 2025-01-22T19:16:46Z

Adds several kwargs to CachedCausalLM.from_pretrained() to make quantization more configurable. Preserves the default behavior of load_in_8bit=True.

Motivation: It turns out that NVIDIA Hopper removed int8 support (bitsandbytes-foundation/bitsandbytes#599) in favor of float8 quantization. This is an issue for running hfppl on H100 GPUs which use Hopper architecture. More generally, with the space of LLMs and quantization schemes evolving quickly, the existing CachedCausalLM.from_pretrained() should offer the user more configuration control.

As a quick fix, this PR adds the ability to pass load_in_4bit as well as to specify a custom bnb_config. It's also separately useful to be able to pass a torch_dtype.

Future steps: Certain llama models now use torch.bfloat16; however, this dtype isn't supported by numpy so it's currently incompatible with hfppl, but there are multiple workarounds we should explore that extend numpy to support it.
EDIT: Turns out the only issue with bfloat16 arises when we try to store logprobs in the Trie without converting them to a numpy-friendly format. I've added calls to .float() in 2 cases and these seem to be sufficient to support bfloat16 models.

…figurable

gabegrand · 2025-01-22T22:57:16Z

On closer inspection, this PR seems to be largely encapsulated by the changes in #24 so I'm going to close it for now in favor of Ben's PR. We should try to merge it in soon...

@benlebrun What do you think about updating the Trie code in GenLM backend to make sure it casts p_llm to float before calling .numpy()? This will ensure that models that return bfloat16 will not cause errors when we move the logprobs to CPU.

Update CachedCausalLM.from_pretrained() to make quantization more con…

775ac5d

…figurable

gabegrand requested a review from alex-lew January 22, 2025 19:16

gabegrand added 2 commits January 22, 2025 14:17

Update default torch_dtype='auto'

523686b

Cast logprobs to float before storing in Trie

6a03164

gabegrand closed this Jan 22, 2025

gabegrand mentioned this pull request Jan 25, 2025

Integrate AsyncLM models from genlm-backend #24

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add options for configuring quantization in CachedCausalLM.from_pretrained() #26

Add options for configuring quantization in CachedCausalLM.from_pretrained() #26

gabegrand commented Jan 22, 2025 •

edited

Loading

gabegrand commented Jan 22, 2025

Add options for configuring quantization in CachedCausalLM.from_pretrained() #26

Add options for configuring quantization in CachedCausalLM.from_pretrained() #26

Conversation

gabegrand commented Jan 22, 2025 • edited Loading

gabegrand commented Jan 22, 2025

gabegrand commented Jan 22, 2025 •

edited

Loading