-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Post refactor changes needed #71
Comments
This set of changes includes the following: 1. Renaming short variable names to be longer and more understandable. 2. Not using the Status.ScheduleStartTime for the pod start time, but instead adding a new field. This previous field was there for a different purpose. 3. Creating named identifiers for resource types that can be shared in the jgf module along with others that use the same relations / vertex types. 4. Removing comments that are not necessary. 5. Changing JGF types from int to int64 that warrant it. 6. Spelling mistakes, etc. 7. Removing need to write jobspec to temporary file (we just need string) The JGF and utils modules need some additional looking - specifically I am worried that the paths->containment is not set, and sometimes the name reflects the index of the overall graph (global) and other times the index of the resource type. I think we likely want the latter for the inner name, but I am not sure in practice that fluxion is using it (internally). I am pushing these changes to assess testing, etc., and will update the PR as needed. There could also have been changes to upstream since the PR was opened that warrant additional fixes. Signed-off-by: vsoch <[email protected]>
I agree with these items. Here's more detail here on the containment / JGF format issues from PR 69: #69 (comment) |
Adding a note for myself: Here flux-k8s/src/fluence/utils/utils.go Lines 144 to 149 in 33ab097
OK going to try this: |
@cmisale @milroy I'm working on the second bullet above, and wanted to have discussion about the format that we want. We currently do something like (and please correct me if I'm wrong - I get this confused with jobspec nextgen): version: 1
resources:
- type: slot
label: default
count: 2
with:
- type: core
count: 16
tasks:
- command: [ "app" ]
slot: default
count:
per_slot: 1 With memory / GPU added if it's defined for the pod. And that is done via parsing one container and then having the slot->count be the number of nodes (I think). If we parse each container individually (which assumes they might be different) I'm wanting to know what that should look like? The only thing that made sense to me was to move the count down to the node, and then to be able to say how many of each node is requested: version: 1
resources:
- type: slot
label: default
with:
- type: node
count: 1
with:
- type: core
count: 4
- type: gpu
count: 1
- type: node
count: 4
with:
- type: core
count: 16
tasks:
- command: [ "app" ]
slot: default
count:
per_slot: 1 But I remember there was a reason for having the slot right above the core (and not including the nodes) so I think that might be wrong. That said, I don't know how to enforce a design here with nodes of different types, because the approach that asks for "this many CPU across whatever resources you need" doesn't well capture the multiple (different containers). If possible, let's have some discussion on the above! I have the next PR well underway but I paused here because I wasn't sure. |
hm I have to say I don't remember that well how to define jobspecs.. I was much better before lol. |
I think if we want to ask fluxion for the right resources, and if the pods vary slightly, we might need to customize that request for the pods that we see. For example, let's say the group has two pods that request 32 cores each, 1 gpu (some ML thing), and then 2 more pods that just need 16 cores and no gpu (some service). Up until this point we have used a "representative pod" and then multiplied it by count (maybe all 4 pods require 32 cpu), and in practice that is the most likely use case (homogeneous clusters). But we could very easily, for example, have this "potpourri of pods" that need different resources for the applications running within. This use case is more an orchestrated idea, maybe 2 pods running an application, and 2 running a service. The homogenous use case is more for different ranks of an MPI application. At least that is my understanding - I think @milroy probably can better comment! |
@cmisale I forget almost every period between working on them and have to remind myself what the heck a "slot" is... 😆 This is me reading in my vast library of books about flux trying to answer that question... I started reading in my late 30s, and I'm still not sure what "total" vs "per_slot" is, but likely someone will figure it eventually. I humbly request it on my tombstone - "Here lies Vanessa, she wanted a "per_slot" task for her test run of lammps. |
The text was updated successfully, but these errors were encountered: