Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion about Slurm queue policies #4

Open
abouteiller opened this issue Apr 12, 2024 · 5 comments
Open

Discussion about Slurm queue policies #4

abouteiller opened this issue Apr 12, 2024 · 5 comments

Comments

@abouteiller
Copy link

abouteiller commented Apr 12, 2024

I envision 3 operation modes

In italic are 'nice to have' features that may or may not be difficult to achieve.

Mode1: shared usage for debugging (default mode for human users)

  1. resources are allocated in non-exclusive mode by default (that is running srun or salloc without any other qualifier)
  2. multiple users can coexist on the same node at the same time, especially if they requested the same resource explicitly (e.g., -wleconte, -N6 -pbezout
  3. Prefer not sharing if possible: user A calls srun -N3 -pbezout, user B calls srun -N3 -pbezout, the workload should spread on all 6 Bezout nodes before we reuse

Not needed: fine grain allocation of resources

Difficulty: number of "access tokens" may still be required for load balancing purposes (3), and using core allocation is a poor substitute, because it has effect on cgroups and other actual access policy to the hardware resources within the allocation.

Mode2: exclusive usage for production runs (human user requested)

  1. resources are allocated in exclusive mode if the user so specifies (how that gets specified is not completely clear yet, the srun -exclusive may or may not do what we want based on requirements for mode 3: backfill), so maybe srun --reservation=exclusive, srun --reservation=exclusive-nightly etc.
  2. A single user can use the resource, that is, other sharing-mode srun and ssh cannot login while the job is active (slurm_pam should do that out-of-the-box)
  3. exclusive jobs during the day have a short time limit (e.g. 1h) to prevent resource hoarding, exclusive-nightly have a longer time limit (e.g., until 7am next business day).
  4. The exclusive nighly mode may terminate existing srun and ssh accesses (using the slurm_pam module should be able to do both prevention and termination for ssh access, but ssh termination may require some customization).
  5. exclusive nightly jobs are uninterruptible until 7am the next business day, but may overstay until a competing shared or exclusive job exist in the queue that would use these resources is actually submitted

Not needed (actually problematic): fine grain allocation of resources, I want guarantee I have a full node and no other stuff is running at the same time (including Jenkins, GH actions, ...)

Difficulty: if we have the fine grain allocation scheduler active, we can simply reserve all resources, but users may still want to execute multiple srun inside a given salloc/sbatch and spread subjobs however they want, I think that should work out-of-the-box but needs verified

Mode 3: GH actions/backfill

  1. GitHub actions, Jenkins and other automations use a backfill scheduler
  2. The backfill jobs can be interrupted by the arrival of user-created jobs, and that doesn't cause the CI pipeline to generate an error, just reschedule the pipeline to a later date (not sure how difficult that is to actually do)
  3. backfill uses the fine grain allocation policy (so that we can run more actions at the same time, if for example we know they require only 1 GPU, and we have 8, we may run 8 ctest simultaneously)

Difficulty: using fine-grain allocation in one mode forces us to use the fine-grain scheduler in all modes, which we don't care much about and may complicate how we allocate shared jobs.

@mgates3
Copy link

mgates3 commented Apr 12, 2024

I mostly agree with Aurelien's description. Additionally, keeping Slurm would be highly desirable, as people are already familiar with it and it is used on Frontier, Perlmutter, and other places.

For mode 3: GitHub Actions/backfill, is the suggestion that GitHub actions would never run (or at least never start) on a node when users are logged into that node (via either mode 1 or 2)? In some ways this is nice, but if nodes were very busy with users, it could mean that GitHub Actions face starvation. We could see how it worked and adjust if there were issues. I have been blocked from merging PRs in the past because someone was using 100% of the GPU memory overnight, so checks could not run (or actually failed) — but that was a rare occurrence that was resolved by email.

SLATE is moving towards using 4 GPUs in its CI testing, so that needs to be feasible.

@G-Ragghianti
Copy link
Contributor

G-Ragghianti commented Jul 29, 2024

I have tried to process both this discussion and the description of usage patterns that were sent to me by Piotr, Natalie, and Ahmad. There are some corner cases and desired usage patterns which are not possible to implement in a single queue system either due to limitations in the queue implementation (slurm) or limitations in the scope of use that job queue systems in general are meant to address. We must decide which usage patterns to support and which need to be changed to adapt to the queue policy that we can implement.

As a first step, I would like to proposal a minimal configuration that we can use/refine over time. Given this starting point, I should be able to say if/how a particular usage pattern would be accomplished.

Here is the config:

# slurm.conf relevant configuration lines
# Enable the tracking/control of access to GPU devices
GresTypes=gpu
# Enable tracking and control of user processes
ProctrackType=proctrack/cgroup
# Enable binding of processes to CPU cores
TaskPlugin=task/cgroup,task/affinity
SchedulerType=sched/backfill
# Provides ability to track/control access to CPU, memory, and device files (for GPUs)
SelectType=select/cons_tres

# This partition controls/allocates CPU cores to each job 
# and has a default runtime limit of 1 hour and maximum runtime limit of 12 hours
PartitionName=shared    Nodes=ALL Default=YES SelectTypeParameters=CR_Core DefaultTime=0-1 Maxtime=0-12 State=UP


# cgroup.conf
# Control access to CPU cores and GPU devices but not memory
ConstrainCores=yes
ConstrainRAMSpace=no
ConstrainSwapSpace=no
ConstrainDevices=yes


# Example gres.conf (created at startup of each node)
# This node has one GPU (nvidia) that is allocatable by users
Name=gpu Type=nvidia File=/dev/nvidia0 Flags=nvidia_gpu_env

@nbeams @abdelfattah83 @luszczek

@G-Ragghianti
Copy link
Contributor

It is important to note that the SelectType=select/cons_tres turns on the tracking of sub-node resources. The decision to track sub-node resources or not must be made for the whole slurm-controlled environment (cluster). You cannot have some nodes/partitions with this on and some without it. Without control of sub-node resources, the only option to enable exclusive access is for it to be whole-node access. This would be very wasteful if only a small part of the node is required.

One usage scenario where this is relevant is in the case of submitting many small jobs that each need access to a GPU. The jobs may fail if two jobs try to use the same GPU device at once, so this config allows the jobs to request one GPU each and Slurm will make sure that each job gets a dedicated GPU device (may or may not be on different nodes). If it were only possible to guarantee exclusive access at the whole-node level, then each of these jobs would have to reserve a whole node. Not only is this wasteful of the unused resources on the node, but it will cause these jobs to sometimes take much longer to schedule because they must wait until there are no other jobs running on the selected nodes.

@nbeams
Copy link

nbeams commented Jul 30, 2024

My question is: with allocation of sub-node resources on, is it still possible to distinguish between shared and exclusive jobs, or not? E.g., Mode 1 in Aurelien's post above. If I request all the GPUs on a node, but I'm just doing debugging/testing and could share with other users just wanting to do debugging/testing, is that possible? Or does specifying that I want all the GPUs to the resource manager automatically give them to me exclusively?

Also, what about wanting to have the full CPU threading capability available to my job? I've noticed on guyot that with the resource manager on, I have to specify a number of CPUs (--cpus-per-task) that will let me use the CPU threading capabilities (vs just -w guyot in the past, which would automatically let a job use all the cores/threading on the node). Does such a request also give me exclusive access to the required number of CPU cores -- perhaps effectively blocking anyone else from using the node at all, even if that wasn't my intention -- or does it work differently than specifying the GPU resources?

Finally, related to this need to specify CPU resources as well (unless I am doing it wrong?), it would be nice to provide some easy way to know how to specify these for each node without having to remember the number of cores on each node. It was, of course, much simpler when we could just specify the node and have automatic access to all the cores.

@G-Ragghianti
Copy link
Contributor

My question is: with allocation of sub-node resources on, is it still possible to distinguish between shared and exclusive jobs, or not?

There are only two mutually-exclusive operating modes that Slurm can use to affect "sharing" of resources. The first is whole-node control and the second is sub-node control. Slurm is only able to run under one or the other mode. Under whole-node, if you need exclusive access to anything on the node, you must allocate the whole node exclusively. Under sub-node allocation, you can exclusively allocation CPU, memory, or devices while allowing others to use other sub-node resources on the node. Essentially, under sub-node allocation, all sub-node resources are allocated exclusively, and if you request all CPUs on a node you will ensure that no one else is using the node. If users are conservative with their requests, the sub-node resource allocation will work well to effectively share nodes between users most of the time.

E.g., Mode 1 in Aurelien's post above. If I request all the GPUs on a node, but I'm just doing debugging/testing and could share with other users just wanting to do debugging/testing, is that possible?

This would cause the GPUs to be unavailable to other users, but other users could still run on the CPUs that are not allocated to your job. I would encourage people to not request more resources than they really need if they are doing debugging/testing. In this case, if you really need all GPUs, it is better to share the resources across time rather than simultaneously (i.e. to run "srun --gres nvidia test_application" for each short-term test thus allowing other users to also run short-term tests).

Or does specifying that I want all the GPUs to the resource manager automatically give them to me exclusively?

Yes, generally each resource that you request is allocated to you exclusively.

Also, what about wanting to have the full CPU threading capability available to my job? I've noticed on guyot that with the resource manager on, I have to specify a number of CPUs (--cpus-per-task) that will let me use the CPU threading capabilities (vs just -w guyot in the past, which would automatically let a job use all the cores/threading on the node). Does such a request also give me exclusive access to the required number of CPU cores -- perhaps effectively blocking anyone else from using the node at all, even if that wasn't my intention -- or does it work differently than specifying the GPU resources?

You should request all the CPU capability that you need. The system is set up to use taskset or cgroups to allocate the requested CPU cores to you. You can decide how you want to distribute your processes/threads across these cores. This extends up to requesting all CPU cores on the node. It is possible to set up a "shared" queue and an "exclusive" queue which would allow the sharing of CPU cores, but this wouldn't extend to the sharing to GPUs.

Finally, related to this need to specify CPU resources as well (unless I am doing it wrong?), it would be nice to provide some easy way to know how to specify these for each node without having to remember the number of cores on each node. It was, of course, much simpler when we could just specify the node and have automatic access to all the cores.

You would request the whole node worth of CPU cores by using "salloc -N 1"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants