Skip to content

Commit

Permalink
dws: crudely enforce storage count constraints
Browse files Browse the repository at this point in the history
Problem: as described in issue #171, creating many MDTs is a bad
for performance, and usually goes against what is explicitly
required by directivebreakdown resources. However, there is not
yet a good way to get Fluxion to handle MDT allocation.

Bypass Fluxion allocation completely, and tell DWS to create
exactly the number of allocations requested in the
.constraints.count field (which is usually found on MDTs).

Place the allocations on the rabbits which have the most compute
nodes allocated to the job.

This is intended to be only a temporary solution, since it
adds a new potential problem, in that some rabbit storage is
used which is not tracked by Fluxion. This could lead to
overallocation of resources, causing jobs to fail with errors.
However, this seems unlikely to occur in practice, since MDTs
are small and Fluxion always gives jobs more storage than they
asked for, so there should usually be some spare storage.
  • Loading branch information
jameshcorbett committed Jul 15, 2024
1 parent adb4408 commit e80fe39
Showing 1 changed file with 39 additions and 12 deletions.
51 changes: 39 additions & 12 deletions src/python/flux_k8s/directivebreakdown.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
import copy
import functools
import math
import collections

from flux_k8s.crd import DIRECTIVEBREAKDOWN_CRD

Expand Down Expand Up @@ -41,19 +42,45 @@ def build_allocation_sets(breakdown_alloc_sets, nodes_per_nnf, hlist, min_alloc_
}
)
elif alloc_set["allocationStrategy"] == AllocationStrategy.ACROSS_SERVERS.value:
nodecount_gcd = functools.reduce(math.gcd, nodes_per_nnf.values())
server_alloc_set["allocationSize"] = math.ceil(
nodecount_gcd * alloc_set["minimumCapacity"] / len(hlist)
)
# split lustre across every rabbit, weighting the split based on
# the number of the job's nodes associated with each rabbit
for rabbit_name in nodes_per_nnf:
storage_field.append(
{
"allocationCount": nodes_per_nnf[rabbit_name] / nodecount_gcd,
"name": rabbit_name,
}
if "count" in alloc_set.get("constraints", {}):
# a specific number of allocations is required (generally for MDTs)
count = alloc_set["constraints"]["count"]
server_alloc_set["allocationSize"] = math.ceil(
alloc_set["minimumCapacity"] / count
)
# place the allocations on the rabbits with the most nodes allocated
# to this job (and therefore the largest storage allocations)
while count > 0:
# count may be greater than the rabbits available, so we may need
# to place multiple on a single rabbit (hence the outer while-loop)
for name, _ in collections.Counter(nodes_per_nnf).most_common(
count
):
storage_field.append(
{
"allocationCount": 1,
"name": name,
}
)
count -= 1
if count == 0:
break
else:
nodecount_gcd = functools.reduce(math.gcd, nodes_per_nnf.values())
server_alloc_set["allocationSize"] = math.ceil(
nodecount_gcd * alloc_set["minimumCapacity"] / len(hlist)
)
# split lustre across every rabbit, weighting the split based on
# the number of the job's nodes associated with each rabbit
for rabbit_name in nodes_per_nnf:
storage_field.append(
{
"allocationCount": int(
nodes_per_nnf[rabbit_name] / nodecount_gcd
),
"name": rabbit_name,
}
)
# enforce the minimum allocation size
server_alloc_set["allocationSize"] = max(
server_alloc_set["allocationSize"], min_alloc_size * 1024**3
Expand Down

0 comments on commit e80fe39

Please sign in to comment.