Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use CUDA_VISIBLE_DEVICES instead of gpu_id variables everywhere. #2824

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

heiner
Copy link
Contributor

@heiner heiner commented Jan 10, 2025

Motivation

This is recommended by PyTorch:

In most cases it’s better to use CUDA_VISIBLE_DEVICES environmental variable.

(https://pytorch.org/docs/stable/generated/torch.cuda.set_device.html)

It also helps avoid certain classes or errors (e.g., initializing CUDA on the wrong device and using extra memory there).

Modifications

  • Set CUDA_VISIBLE_DEVICES at the start of run_scheduler_process (annoyingly, Python's multiprocessing module gives no way of setting the env variables of the new child process).
  • Drop many of the gpu_id variables throughout the code.

Checklist

This is recommended by PyTorch:

> In most cases it’s better to use CUDA_VISIBLE_DEVICES environmental variable.

(https://pytorch.org/docs/stable/generated/torch.cuda.set_device.html)

It also helps avoid certain classes or errors (e.g., initializing CUDA on the
wrong device and using extra memory there).
proc = mp.Process(
target=run_scheduler_process,
args=(server_args, port_args, gpu_id, tp_rank, dp_rank, writer),
)
proc.start()
os.environ["CUDA_VISIBLE_DEVICES"]
Copy link
Contributor

@merrymercy merrymercy Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

del os.environ["CUDA_VISIBLE_DEVICES"]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -450,13 +450,15 @@ def launch_engine(
for tp_rank in tp_rank_range:
reader, writer = mp.Pipe(duplex=False)
gpu_id = server_args.base_gpu_id + tp_rank % tp_size_per_node
os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work for AMD? We might need to find the correct env vars for AMD as well.
cc @HaiShaw

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps not; the tests pass for me locally but don't seem to pass in CI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merrymercy
Copy link
Contributor

The tests are still failing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants