Use CUDA_VISIBLE_DEVICES instead of gpu_id variables everywhere. #2824

heiner · 2025-01-10T02:21:45Z

Motivation

This is recommended by PyTorch:

In most cases it’s better to use CUDA_VISIBLE_DEVICES environmental variable.

(https://pytorch.org/docs/stable/generated/torch.cuda.set_device.html)

It also helps avoid certain classes or errors (e.g., initializing CUDA on the wrong device and using extra memory there).

Modifications

Set CUDA_VISIBLE_DEVICES at the start of run_scheduler_process (annoyingly, Python's multiprocessing module gives no way of setting the env variables of the new child process).
Drop many of the gpu_id variables throughout the code.

Checklist

Format your code according to the Contributor Guide.

This is recommended by PyTorch: > In most cases it’s better to use CUDA_VISIBLE_DEVICES environmental variable. (https://pytorch.org/docs/stable/generated/torch.cuda.set_device.html) It also helps avoid certain classes or errors (e.g., initializing CUDA on the wrong device and using extra memory there).

python/sglang/srt/model_executor/model_runner.py

Something at process start seems to already initialize CUDA.

merrymercy · 2025-01-10T21:50:51Z

python/sglang/srt/managers/data_parallel_controller.py

        proc = mp.Process(
            target=run_scheduler_process,
            args=(server_args, port_args, gpu_id, tp_rank, dp_rank, writer),
        )
        proc.start()
+        os.environ["CUDA_VISIBLE_DEVICES"]


del os.environ["CUDA_VISIBLE_DEVICES"]?

merrymercy · 2025-01-10T21:52:03Z

python/sglang/srt/server.py

@@ -450,13 +450,15 @@ def launch_engine(
        for tp_rank in tp_rank_range:
            reader, writer = mp.Pipe(duplex=False)
            gpu_id = server_args.base_gpu_id + tp_rank % tp_size_per_node
+            os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id)


Does this work for AMD? We might need to find the correct env vars for AMD as well.
cc @HaiShaw

Perhaps not; the tests pass for me locally but don't seem to pass in CI.

According to these docs, this should work: https://rocm.docs.amd.com/en/latest/conceptual/gpu-isolation.html#cuda-visible-devices

merrymercy · 2025-01-13T05:14:29Z

The tests are still failing.

heiner requested review from merrymercy, Ying1123, hnyls2002, zhyncs, ispobock and ByronHsu as code owners January 10, 2025 02:21

rkooo567 reviewed Jan 10, 2025

View reviewed changes

python/sglang/srt/model_executor/model_runner.py Outdated Show resolved Hide resolved

heiner added 3 commits January 9, 2025 20:08

Set local_rank=0.

30e2c05

Set CUDA_VISIBLE_DEVICES in certain tests too.

775a477

Set CUDA_VISIBLE_DEVICES before process start.

409f32e

Something at process start seems to already initialize CUDA.

merrymercy requested changes Jan 10, 2025

View reviewed changes

Add missing del.

0a222e8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use CUDA_VISIBLE_DEVICES instead of gpu_id variables everywhere. #2824

Use CUDA_VISIBLE_DEVICES instead of gpu_id variables everywhere. #2824

heiner commented Jan 10, 2025

merrymercy Jan 10, 2025 •

edited

Loading

heiner Jan 10, 2025

merrymercy Jan 10, 2025

heiner Jan 10, 2025

heiner Jan 11, 2025

merrymercy commented Jan 13, 2025

Use CUDA_VISIBLE_DEVICES instead of gpu_id variables everywhere. #2824

Are you sure you want to change the base?

Use CUDA_VISIBLE_DEVICES instead of gpu_id variables everywhere. #2824

Conversation

heiner commented Jan 10, 2025

Motivation

Modifications

Checklist

merrymercy Jan 10, 2025 • edited Loading

Choose a reason for hiding this comment

heiner Jan 10, 2025

Choose a reason for hiding this comment

merrymercy Jan 10, 2025

Choose a reason for hiding this comment

heiner Jan 10, 2025

Choose a reason for hiding this comment

heiner Jan 11, 2025

Choose a reason for hiding this comment

merrymercy commented Jan 13, 2025

merrymercy Jan 10, 2025 •

edited

Loading