-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion about Slurm queue policies #4
Comments
I mostly agree with Aurelien's description. Additionally, keeping Slurm would be highly desirable, as people are already familiar with it and it is used on Frontier, Perlmutter, and other places. For mode 3: GitHub Actions/backfill, is the suggestion that GitHub actions would never run (or at least never start) on a node when users are logged into that node (via either mode 1 or 2)? In some ways this is nice, but if nodes were very busy with users, it could mean that GitHub Actions face starvation. We could see how it worked and adjust if there were issues. I have been blocked from merging PRs in the past because someone was using 100% of the GPU memory overnight, so checks could not run (or actually failed) — but that was a rare occurrence that was resolved by email. SLATE is moving towards using 4 GPUs in its CI testing, so that needs to be feasible. |
I have tried to process both this discussion and the description of usage patterns that were sent to me by Piotr, Natalie, and Ahmad. There are some corner cases and desired usage patterns which are not possible to implement in a single queue system either due to limitations in the queue implementation (slurm) or limitations in the scope of use that job queue systems in general are meant to address. We must decide which usage patterns to support and which need to be changed to adapt to the queue policy that we can implement. As a first step, I would like to proposal a minimal configuration that we can use/refine over time. Given this starting point, I should be able to say if/how a particular usage pattern would be accomplished. Here is the config:
|
It is important to note that the One usage scenario where this is relevant is in the case of submitting many small jobs that each need access to a GPU. The jobs may fail if two jobs try to use the same GPU device at once, so this config allows the jobs to request one GPU each and Slurm will make sure that each job gets a dedicated GPU device (may or may not be on different nodes). If it were only possible to guarantee exclusive access at the whole-node level, then each of these jobs would have to reserve a whole node. Not only is this wasteful of the unused resources on the node, but it will cause these jobs to sometimes take much longer to schedule because they must wait until there are no other jobs running on the selected nodes. |
My question is: with allocation of sub-node resources on, is it still possible to distinguish between shared and exclusive jobs, or not? E.g., Mode 1 in Aurelien's post above. If I request all the GPUs on a node, but I'm just doing debugging/testing and could share with other users just wanting to do debugging/testing, is that possible? Or does specifying that I want all the GPUs to the resource manager automatically give them to me exclusively? Also, what about wanting to have the full CPU threading capability available to my job? I've noticed on guyot that with the resource manager on, I have to specify a number of CPUs (--cpus-per-task) that will let me use the CPU threading capabilities (vs just Finally, related to this need to specify CPU resources as well (unless I am doing it wrong?), it would be nice to provide some easy way to know how to specify these for each node without having to remember the number of cores on each node. It was, of course, much simpler when we could just specify the node and have automatic access to all the cores. |
There are only two mutually-exclusive operating modes that Slurm can use to affect "sharing" of resources. The first is whole-node control and the second is sub-node control. Slurm is only able to run under one or the other mode. Under whole-node, if you need exclusive access to anything on the node, you must allocate the whole node exclusively. Under sub-node allocation, you can exclusively allocation CPU, memory, or devices while allowing others to use other sub-node resources on the node. Essentially, under sub-node allocation, all sub-node resources are allocated exclusively, and if you request all CPUs on a node you will ensure that no one else is using the node. If users are conservative with their requests, the sub-node resource allocation will work well to effectively share nodes between users most of the time.
This would cause the GPUs to be unavailable to other users, but other users could still run on the CPUs that are not allocated to your job. I would encourage people to not request more resources than they really need if they are doing debugging/testing. In this case, if you really need all GPUs, it is better to share the resources across time rather than simultaneously (i.e. to run "srun --gres nvidia test_application" for each short-term test thus allowing other users to also run short-term tests).
Yes, generally each resource that you request is allocated to you exclusively.
You should request all the CPU capability that you need. The system is set up to use taskset or cgroups to allocate the requested CPU cores to you. You can decide how you want to distribute your processes/threads across these cores. This extends up to requesting all CPU cores on the node. It is possible to set up a "shared" queue and an "exclusive" queue which would allow the sharing of CPU cores, but this wouldn't extend to the sharing to GPUs.
You would request the whole node worth of CPU cores by using "salloc -N 1" |
I envision 3 operation modes
In italic are 'nice to have' features that may or may not be difficult to achieve.
Mode1: shared usage for debugging (default mode for human users)
srun
orsalloc
without any other qualifier)-wleconte
,-N6 -pbezout
srun -N3 -pbezout
, user B callssrun -N3 -pbezout
, the workload should spread on all 6 Bezout nodes before we reuseNot needed: fine grain allocation of resources
Difficulty: number of "access tokens" may still be required for load balancing purposes (3), and using core allocation is a poor substitute, because it has effect on
cgroups
and other actual access policy to the hardware resources within the allocation.Mode2: exclusive usage for production runs (human user requested)
srun -exclusive
may or may not do what we want based on requirements for mode 3: backfill), so maybesrun --reservation=exclusive
,srun --reservation=exclusive-nightly
etc.srun
andssh
cannot login while the job is active (slurm_pam
should do that out-of-the-box)exclusive nighly
mode may terminate existingsrun
andssh
accesses (using theslurm_pam
module should be able to do both prevention and termination for ssh access, but ssh termination may require some customization).Not needed (actually problematic): fine grain allocation of resources, I want guarantee I have a full node and no other stuff is running at the same time (including Jenkins, GH actions, ...)
Difficulty: if we have the fine grain allocation scheduler active, we can simply reserve all resources, but users may still want to execute multiple
srun
inside a givensalloc/sbatch
and spread subjobs however they want, I think that should work out-of-the-box but needs verifiedMode 3: GH actions/backfill
Difficulty: using fine-grain allocation in one mode forces us to use the fine-grain scheduler in all modes, which we don't care much about and may complicate how we allocate shared jobs.
The text was updated successfully, but these errors were encountered: