Multi-GPU Context Parallel Mamba2 #664

josiahbjorgaard · 2025-01-10T19:20:37Z

I've made an implementation here of Context Parallelism for Mamba 2. It uses a sequential step at the state transfer stage, but otherwise functions in parallel. I've validate that the results are numerically within floating point error between single GPU context and multi-GPU Context, for both forward and backward pass calculations.

It uses a hack of the causal_conv1d function by transfering the number of tokens equivalent to the convolution window between GPUs and then discarding the result for few prepended tokens on each GPU. This requires a new ContextMixer layer to be inserted before each Mamba2 Layer, which is automatically inserted in a modification to the Mamba 2 class. The actual GPU to GPU transfer is done in a loop in the ssd_combined function.

Please let me know how I can further improve the PR to make it a mergeable contribution. Also feel free to reach out if you'd like help setting up a multi-GPU context parallel run.

N.B. this PR does not include splitting of the initial input sequence or aggregating gradients after loss, both of which would need to be performed by the training loop code.

Skylion007 · 2025-01-20T16:32:02Z

mamba_ssm/modules/mamba2.py

@@ -61,6 +74,7 @@ def __init__(
        layer_idx=None,  # Absorb kwarg for general module
        process_group=None,
        sequence_parallel=True,


We are entering the boolean trap here with this API design...

ZYHowell · 2025-01-27T05:07:18Z

It looks like you are using the process_group for context parallel, while it was designed for the tensor and sequence parallel. As long as TP and CP are compatible, I believe you can simply introduce a new cp_process_group to remove most lines you commented out. Do you have a plan on developing this part?

josiahbjorgaard · 2025-02-01T17:03:51Z

@ZYHowell I think TP and CP should be compatible and I had thought about separating the process groups in this way. Are you interested in implementing that?

@Skylion007 any suggestions on how best to approach the logical selection of using the context parallel path and I also believe the configuration options are not ideal. Perhaps supplying the cp_process_group can provide the boolean flag to select the cp path? It is unfortunately requiring a large amount of repeated code between the standard and context parallel versions of the ssd code.

josiahbjorgaard added 3 commits January 10, 2025 11:14

Add Context Parallel Mamba2

fdffb60

added

2df17a7

Add missing file

bc53b78

This was referenced Jan 10, 2025

Context parallel implementation for Mamba 2 #597

Open

Sequence parallelism in the mixer (Context Parallelism) #482

Open

Question about support for sequence parallel #176

Open

Skylion007 reviewed Jan 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU Context Parallel Mamba2 #664

Multi-GPU Context Parallel Mamba2 #664

josiahbjorgaard commented Jan 10, 2025 •

edited

Loading

Skylion007 Jan 20, 2025

ZYHowell commented Jan 27, 2025

josiahbjorgaard commented Feb 1, 2025 •

edited

Loading

Multi-GPU Context Parallel Mamba2 #664

Are you sure you want to change the base?

Multi-GPU Context Parallel Mamba2 #664

Conversation

josiahbjorgaard commented Jan 10, 2025 • edited Loading

Skylion007 Jan 20, 2025

Choose a reason for hiding this comment

ZYHowell commented Jan 27, 2025

josiahbjorgaard commented Feb 1, 2025 • edited Loading

josiahbjorgaard commented Jan 10, 2025 •

edited

Loading

josiahbjorgaard commented Feb 1, 2025 •

edited

Loading