Discussion about Slurm queue policies #4

abouteiller · 2024-04-12T15:52:12Z

I envision 3 operation modes

In italic are 'nice to have' features that may or may not be difficult to achieve.

Mode1: shared usage for debugging (default mode for human users)

resources are allocated in non-exclusive mode by default (that is running srun or salloc without any other qualifier)
multiple users can coexist on the same node at the same time, especially if they requested the same resource explicitly (e.g., -wleconte, -N6 -pbezout
Prefer not sharing if possible: user A calls srun -N3 -pbezout, user B calls srun -N3 -pbezout, the workload should spread on all 6 Bezout nodes before we reuse

Not needed: fine grain allocation of resources

Difficulty: number of "access tokens" may still be required for load balancing purposes (3), and using core allocation is a poor substitute, because it has effect on cgroups and other actual access policy to the hardware resources within the allocation.

Mode2: exclusive usage for production runs (human user requested)

resources are allocated in exclusive mode if the user so specifies (how that gets specified is not completely clear yet, the srun -exclusive may or may not do what we want based on requirements for mode 3: backfill), so maybe srun --reservation=exclusive, srun --reservation=exclusive-nightly etc.
A single user can use the resource, that is, other sharing-mode srun and ssh cannot login while the job is active (slurm_pam should do that out-of-the-box)
exclusive jobs during the day have a short time limit (e.g. 1h) to prevent resource hoarding, exclusive-nightly have a longer time limit (e.g., until 7am next business day).
The exclusive nighly mode may terminate existing srun and ssh accesses (using the slurm_pam module should be able to do both prevention and termination for ssh access, but ssh termination may require some customization).
exclusive nightly jobs are uninterruptible until 7am the next business day, but may overstay until a competing shared or exclusive job exist in the queue that would use these resources is actually submitted

Not needed (actually problematic): fine grain allocation of resources, I want guarantee I have a full node and no other stuff is running at the same time (including Jenkins, GH actions, ...)

Difficulty: if we have the fine grain allocation scheduler active, we can simply reserve all resources, but users may still want to execute multiple srun inside a given salloc/sbatch and spread subjobs however they want, I think that should work out-of-the-box but needs verified

Mode 3: GH actions/backfill

GitHub actions, Jenkins and other automations use a backfill scheduler
The backfill jobs can be interrupted by the arrival of user-created jobs, and that doesn't cause the CI pipeline to generate an error, just reschedule the pipeline to a later date (not sure how difficult that is to actually do)
backfill uses the fine grain allocation policy (so that we can run more actions at the same time, if for example we know they require only 1 GPU, and we have 8, we may run 8 ctest simultaneously)

Difficulty: using fine-grain allocation in one mode forces us to use the fine-grain scheduler in all modes, which we don't care much about and may complicate how we allocate shared jobs.

The text was updated successfully, but these errors were encountered:

mgates3 · 2024-04-12T17:22:15Z

I mostly agree with Aurelien's description. Additionally, keeping Slurm would be highly desirable, as people are already familiar with it and it is used on Frontier, Perlmutter, and other places.

For mode 3: GitHub Actions/backfill, is the suggestion that GitHub actions would never run (or at least never start) on a node when users are logged into that node (via either mode 1 or 2)? In some ways this is nice, but if nodes were very busy with users, it could mean that GitHub Actions face starvation. We could see how it worked and adjust if there were issues. I have been blocked from merging PRs in the past because someone was using 100% of the GPU memory overnight, so checks could not run (or actually failed) — but that was a rare occurrence that was resolved by email.

SLATE is moving towards using 4 GPUs in its CI testing, so that needs to be feasible.

G-Ragghianti · 2024-07-29T12:21:48Z

I have tried to process both this discussion and the description of usage patterns that were sent to me by Piotr, Natalie, and Ahmad. There are some corner cases and desired usage patterns which are not possible to implement in a single queue system either due to limitations in the queue implementation (slurm) or limitations in the scope of use that job queue systems in general are meant to address. We must decide which usage patterns to support and which need to be changed to adapt to the queue policy that we can implement.

As a first step, I would like to proposal a minimal configuration that we can use/refine over time. Given this starting point, I should be able to say if/how a particular usage pattern would be accomplished.

Here is the config:

# slurm.conf relevant configuration lines
# Enable the tracking/control of access to GPU devices
GresTypes=gpu
# Enable tracking and control of user processes
ProctrackType=proctrack/cgroup
# Enable binding of processes to CPU cores
TaskPlugin=task/cgroup,task/affinity
SchedulerType=sched/backfill
# Provides ability to track/control access to CPU, memory, and device files (for GPUs)
SelectType=select/cons_tres

# This partition controls/allocates CPU cores to each job 
# and has a default runtime limit of 1 hour and maximum runtime limit of 12 hours
PartitionName=shared    Nodes=ALL Default=YES SelectTypeParameters=CR_Core DefaultTime=0-1 Maxtime=0-12 State=UP


# cgroup.conf
# Control access to CPU cores and GPU devices but not memory
ConstrainCores=yes
ConstrainRAMSpace=no
ConstrainSwapSpace=no
ConstrainDevices=yes


# Example gres.conf (created at startup of each node)
# This node has one GPU (nvidia) that is allocatable by users
Name=gpu Type=nvidia File=/dev/nvidia0 Flags=nvidia_gpu_env

@nbeams @abdelfattah83 @luszczek

G-Ragghianti · 2024-07-29T12:35:23Z

It is important to note that the SelectType=select/cons_tres turns on the tracking of sub-node resources. The decision to track sub-node resources or not must be made for the whole slurm-controlled environment (cluster). You cannot have some nodes/partitions with this on and some without it. Without control of sub-node resources, the only option to enable exclusive access is for it to be whole-node access. This would be very wasteful if only a small part of the node is required.

One usage scenario where this is relevant is in the case of submitting many small jobs that each need access to a GPU. The jobs may fail if two jobs try to use the same GPU device at once, so this config allows the jobs to request one GPU each and Slurm will make sure that each job gets a dedicated GPU device (may or may not be on different nodes). If it were only possible to guarantee exclusive access at the whole-node level, then each of these jobs would have to reserve a whole node. Not only is this wasteful of the unused resources on the node, but it will cause these jobs to sometimes take much longer to schedule because they must wait until there are no other jobs running on the selected nodes.

nbeams · 2024-07-30T16:28:41Z

My question is: with allocation of sub-node resources on, is it still possible to distinguish between shared and exclusive jobs, or not? E.g., Mode 1 in Aurelien's post above. If I request all the GPUs on a node, but I'm just doing debugging/testing and could share with other users just wanting to do debugging/testing, is that possible? Or does specifying that I want all the GPUs to the resource manager automatically give them to me exclusively?

Also, what about wanting to have the full CPU threading capability available to my job? I've noticed on guyot that with the resource manager on, I have to specify a number of CPUs (--cpus-per-task) that will let me use the CPU threading capabilities (vs just -w guyot in the past, which would automatically let a job use all the cores/threading on the node). Does such a request also give me exclusive access to the required number of CPU cores -- perhaps effectively blocking anyone else from using the node at all, even if that wasn't my intention -- or does it work differently than specifying the GPU resources?

Finally, related to this need to specify CPU resources as well (unless I am doing it wrong?), it would be nice to provide some easy way to know how to specify these for each node without having to remember the number of cores on each node. It was, of course, much simpler when we could just specify the node and have automatic access to all the cores.

G-Ragghianti · 2024-08-02T15:32:39Z

My question is: with allocation of sub-node resources on, is it still possible to distinguish between shared and exclusive jobs, or not?

There are only two mutually-exclusive operating modes that Slurm can use to affect "sharing" of resources. The first is whole-node control and the second is sub-node control. Slurm is only able to run under one or the other mode. Under whole-node, if you need exclusive access to anything on the node, you must allocate the whole node exclusively. Under sub-node allocation, you can exclusively allocation CPU, memory, or devices while allowing others to use other sub-node resources on the node. Essentially, under sub-node allocation, all sub-node resources are allocated exclusively, and if you request all CPUs on a node you will ensure that no one else is using the node. If users are conservative with their requests, the sub-node resource allocation will work well to effectively share nodes between users most of the time.

E.g., Mode 1 in Aurelien's post above. If I request all the GPUs on a node, but I'm just doing debugging/testing and could share with other users just wanting to do debugging/testing, is that possible?

This would cause the GPUs to be unavailable to other users, but other users could still run on the CPUs that are not allocated to your job. I would encourage people to not request more resources than they really need if they are doing debugging/testing. In this case, if you really need all GPUs, it is better to share the resources across time rather than simultaneously (i.e. to run "srun --gres nvidia test_application" for each short-term test thus allowing other users to also run short-term tests).

Or does specifying that I want all the GPUs to the resource manager automatically give them to me exclusively?

Yes, generally each resource that you request is allocated to you exclusively.

Also, what about wanting to have the full CPU threading capability available to my job? I've noticed on guyot that with the resource manager on, I have to specify a number of CPUs (--cpus-per-task) that will let me use the CPU threading capabilities (vs just -w guyot in the past, which would automatically let a job use all the cores/threading on the node). Does such a request also give me exclusive access to the required number of CPU cores -- perhaps effectively blocking anyone else from using the node at all, even if that wasn't my intention -- or does it work differently than specifying the GPU resources?

You should request all the CPU capability that you need. The system is set up to use taskset or cgroups to allocate the requested CPU cores to you. You can decide how you want to distribute your processes/threads across these cores. This extends up to requesting all CPU cores on the node. It is possible to set up a "shared" queue and an "exclusive" queue which would allow the sharing of CPU cores, but this wouldn't extend to the sharing to GPUs.

Finally, related to this need to specify CPU resources as well (unless I am doing it wrong?), it would be nice to provide some easy way to know how to specify these for each node without having to remember the number of cores on each node. It was, of course, much simpler when we could just specify the node and have automatic access to all the cores.

You would request the whole node worth of CPU cores by using "salloc -N 1"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion about Slurm queue policies #4

Discussion about Slurm queue policies #4

abouteiller commented Apr 12, 2024 •

edited

Loading

mgates3 commented Apr 12, 2024

G-Ragghianti commented Jul 29, 2024 •

edited

Loading

G-Ragghianti commented Jul 29, 2024

nbeams commented Jul 30, 2024

G-Ragghianti commented Aug 2, 2024

Discussion about Slurm queue policies #4

Discussion about Slurm queue policies #4

Comments

abouteiller commented Apr 12, 2024 • edited Loading

Mode1: shared usage for debugging (default mode for human users)

Mode2: exclusive usage for production runs (human user requested)

mgates3 commented Apr 12, 2024

G-Ragghianti commented Jul 29, 2024 • edited Loading

G-Ragghianti commented Jul 29, 2024

nbeams commented Jul 30, 2024

G-Ragghianti commented Aug 2, 2024

abouteiller commented Apr 12, 2024 •

edited

Loading

G-Ragghianti commented Jul 29, 2024 •

edited

Loading