-
Notifications
You must be signed in to change notification settings - Fork 738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use CUDA_VISIBLE_DEVICES instead of gpu_id variables everywhere. #2824
base: main
Are you sure you want to change the base?
Conversation
This is recommended by PyTorch: > In most cases it’s better to use CUDA_VISIBLE_DEVICES environmental variable. (https://pytorch.org/docs/stable/generated/torch.cuda.set_device.html) It also helps avoid certain classes or errors (e.g., initializing CUDA on the wrong device and using extra memory there).
Something at process start seems to already initialize CUDA.
proc = mp.Process( | ||
target=run_scheduler_process, | ||
args=(server_args, port_args, gpu_id, tp_rank, dp_rank, writer), | ||
) | ||
proc.start() | ||
os.environ["CUDA_VISIBLE_DEVICES"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
del os.environ["CUDA_VISIBLE_DEVICES"]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@@ -450,13 +450,15 @@ def launch_engine( | |||
for tp_rank in tp_rank_range: | |||
reader, writer = mp.Pipe(duplex=False) | |||
gpu_id = server_args.base_gpu_id + tp_rank % tp_size_per_node | |||
os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work for AMD? We might need to find the correct env vars for AMD as well.
cc @HaiShaw
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps not; the tests pass for me locally but don't seem to pass in CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to these docs, this should work: https://rocm.docs.amd.com/en/latest/conceptual/gpu-isolation.html#cuda-visible-devices
The tests are still failing. |
Motivation
This is recommended by PyTorch:
(https://pytorch.org/docs/stable/generated/torch.cuda.set_device.html)
It also helps avoid certain classes or errors (e.g., initializing CUDA on the wrong device and using extra memory there).
Modifications
CUDA_VISIBLE_DEVICES
at the start ofrun_scheduler_process
(annoyingly, Python'smultiprocessing
module gives no way of setting the env variables of the new child process).gpu_id
variables throughout the code.Checklist