Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic Tutorial: Adding a New LLM Inference/Serving Backend #21

Open
PeterSH6 opened this issue Nov 22, 2024 · 1 comment
Open

Basic Tutorial: Adding a New LLM Inference/Serving Backend #21

PeterSH6 opened this issue Nov 22, 2024 · 1 comment
Labels
enhancement New feature or request generation

Comments

@PeterSH6
Copy link
Collaborator

  1. Prerequisite: Make sure the LLM Inference framework can be launched following the SPMD style. For example, the LLM inference script can be launched by torchrun --standalone --nproc=8 offline_inference.py
  2. A Rollout class: Build a xxx_rollout.py script similar to vllm_rollout.py. In this file, define a xxxRollout class that inherits from BaseRollout.
    1. This class should have a generate_sequence API that accepts a batch of input_ids, response_masks, and position_ids from the DataProto as input. The self.inference_engine (e.g., LLMEngine in vLLM) is responsible for performing auto-regressive generation and outputting a batch of responses. These responses should then be concatenated with input_ids, and the response_masks and position_ids should be reconstructed accordingly.
  3. ShardingManager Classes for Weight Synchronization with Training Frameworks: Create files named fsdp_xxx.py and megatron_xxx.py, similar to fsdp_vllm.py and megatron_vllm.py. These files should define XXXShardingManager classes (i.e., HybridEngine) that handle weight sharding between the training and inference frameworks.
    1. In megatron_vllm.py, we define an AllGatherPPModel class to collect weights across the pipeline parallelism dimension. The parameters stored in the memory_buffers of AllGatherPPModel will be used to synchronize the weights with the models in the vLLM rollout.
  4. Weight loading issues: It may be necessary to provide specific weight loaders for transferring weights between different LLM Inference and Training backends for each model. This is similar to the dtensor_weight_loader.py and megatron_weight_loader.py files in vLLM.
@PeterSH6 PeterSH6 added enhancement New feature or request generation labels Nov 22, 2024
@PeterSH6 PeterSH6 changed the title Basic Tutorial of Adding New LLM Inference/Serving Backend Basic Tutorial: Adding a New LLM Inference/Serving Backend Nov 22, 2024
@eelxpeng
Copy link

Could you describe what primary changes you have to make in verl/third_party/vllm/ assuming that most of the code in the directory are from vllm code. If we can somehow simplify the dependency with vllm, it would be a lot easier to upgrade to higher version of vllm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request generation
Projects
None yet
Development

No branches or pull requests

2 participants