-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: testing idea to wrap coscheduling #69
Conversation
Also tweak some docstrings for fluence. Signed-off-by: vsoch <[email protected]>
Problem: the local upstream might get out of date Solution: provide an easy make update to pull from it. Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]> There are too many edge cases / too much complexity and behavior that I do not understand to pursue having the pod group information cached with fluence. For now I am nuking it and testing the intial design as a sanity check.
Problem: the podgroups with second precision have interleaving Solution: try to create an internal representation with better precision. This looks promising with early testing, but I need to consider the edge cases and how to clean up the groups, otherwise a pod group might be re-created later and still in the cache. Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
We install with the helm manifests, and the old fluence manifests might be confusing (they have changed). This commit will remove the old manifests, and also change some of the fmt.Print logging to use klog to be easier to parse. Signed-off-by: vsoch <[email protected]>
This adds a prototype support for an extra helm flag that dually enables adding an extra grpc set of endpoints, and then the configs (ingress and service) necessary to expose them. I next need to figure out how to interact with grpc from a local client, likely built from the same codebase and grpc spec. This is super cool!! Signed-off-by: vsoch <[email protected]>
Problem: we want to be able to persist PodGroup if upstream removes it Solution: build our own controller image, also allowing us to tweak it to enhance fluence. This commit also renames the helm install to be "fluence" so it is easier for the developer workflow Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
Problem: we want to try a design where a mutating admission webhook can handle receiving and creating PodGroup from labels. We are choosing mutating with the expectation that, at some point, we might be able to change the size (min/max/desired) either for the PodGroup or some other watcher to jobs. Note that this is an empty skeleton - the webhook is added and running but basically doing nothing. I am also fixing a bug that I noticed while running kind that fluence was assigning work to the control plane. I think there maybe used to be logic (a commented out worker label) that was anticipating doing a check for a control plane, but it looks like on production clusters we do not always haave access and it was never finished. Note that this addition does not guarantee this design will work, but it is just one step. Since the helm charts are manually genereated for the scheduler-plugin (as far as I can tell) this took me almost 6 hours to figure out and get working. I am really starting to think there is no skill behind software engineering beyond absolute patience. Signed-off-by: vsoch <[email protected]>
Problem: we need every pod object coming into the cluster to be part of a group. Solution: This change adds logic to the mutating webhook to add the labels that indicate the group name and size. We can eventually add flexibility here. I also realize that we can easily watch for job objects first, and add the group size/name to the pod template. This will be much more efficient to then not have to add to the individual pods that are part of a larger job. With this approach I was able to create a fluence scheduled pod, and then see my labels added! It does not do anything beyond that. I am also adding a nice script that makes it easy to build, load, and install fluence freshly, otherwise you will use all your internet data for the month in like, two days. Do not do that :P Signed-off-by: vsoch <[email protected]>
Problem: we want the labels (size and name) that are explicitly set to lead to the creation of the pod group so the user does not need to. This is done by way of a watcher on pod, which will trigger after the webhook that ensures that every pod (in a job or single pod) has the proper label. Likely we want to do this for other abstractions that hold pods as well, because it ensures that no matter how the pods go into pending, we have the correct size and name. The only case that a pod can come in without the label means that it was not scheduled by fluence. The user is directed to not do this, but it is not impossible (e.g., fluence sees itself show up here actually). So after this addition we have the full steps to add the labels and create the pod group, and next steps are (finally) to integrate this into fluence (and remove the old abstraction to store it in memory). Signed-off-by: vsoch <[email protected]>
Problem: fluence should only be storing state of jobid and presence of a group name in a map to indicate node assignment. Soluion: update the code here. Note that this is not working yet, and I am pushing / opening the PR to not use the work (and will update accordingly, and using this PR to test). Signed-off-by: vsoch <[email protected]>
af71f88
to
a250385
Compare
I'm testing another idea to only allow scheduling the first in the group, that way the next in the group get seen right after that. It's too bad we can't selectively grab pods from the queue (or maybe we can)? Otherwise we could just get any pod scheduled and then assign (actually assign, not just declare the node and wait for the pod to show up again) the pod to the node. |
It's running now (and been running). I can't back this up with data, but it seems to not be doing the "freeze for a while and then kick in" like an old car thing. Will update if/when it finishes. I doubled the size of the experiment so it's size chonkmaster. Side note - we need a chonker scale for experiments. 🐱 🐈 LAMMPS? A Fine Boi!
AMD? A Heckin' Chonker
MuMMI? OH LAWD HE COMIN'! |
Problem: Since the PodGroup controller creates the PodGroup, it should delete it as well. Solution: Ideally I wanted to attach an owner reference, meaning that the top level job (that also owns the pod) would be owner to the PodGroup. But that does not seem to take - either because the controller is the owner or the field is read only for k8s. For the time being, I decided to delete the PodGroup when the group is determined to be Finished/Failed, which happens when that number of pods equals or exceeds the MinimumSize. I think granted that MinimumSize == size this should be OK with fluence, and we might need to consider other approaches if/when the min size is smaller than the total size (because fluence might still see a pod in the queue and try to schedule again. I think what we might do in that case is just update the MinSize for the group, so if fluence schedules again it will be for the smaller size. But not sure about that either! TBA. The important thing now is that the pod group cleans itself up! Signed-off-by: vsoch <[email protected]>
Problem: we need to be able to run deployments, stateful/replica sets and have them handled by fluence. Solution: allow the webhook to create pod groups for them. In the case they are not targeted for fluence (any abstraction) and get into the PreFilter, allow creation of a FauxPodGroup that will simply schedule one job for the pod. We do this twice - in PreFilter and in the events for update/delete. Signed-off-by: vsoch <[email protected]>
Problem: I noticed in testing that the time only had granularity down to the second. Solution: It appears that when we do a create of the PodGroup from the reconciler watch, the metadata (beyond name and namespace) does not stick. I am not sure why, but the labels are still retrievable from the pods (via the mutating webhook) after. So instead, we need to get the size and creation timestamp at the first hit in reconcile, which (given how that works) should still somewhat honor the order. I did try adding the timestamp to a label but it got hairy really quickly (kept me up about 3 hours longer than I intended to!) The good news now is that I see the microseconds in the Schedule Start Time, so we should be almost ready to test this on a GCP cluster. I also had lots of time waiting for the containers to rebuild so I made a diagram of how it is currently working. I have some concerns about the internal state of fluxion (my kind cluster stopped working after some hours and I do not know why) but we can address them later. We mostly need to see if there are jobs that are being forgotten, etc. Signed-off-by: vsoch <[email protected]>
Problem: the design description did not correspond with the numbers Solution: fix them up, also fix some bugs in the controller and fluence that assume we have pods / pod group (we do not always) Signed-off-by: vsoch <[email protected]>
I am making small changes as I test on GKE and EKS. My first tests on GKE had me creating / deleting jobs, and I think the state of fluence (fluxion) got out of sync with the jobs, meaning that fluxion thought jobs were running that were not and then was unable to allocate new ones. To adjust for that we can add back in the cancel response, but this will only work given that fluence has not lost memory of the job id. We likely need an approach that can either save the jobids to the state data (that could be reloaded) or a way to inspect jobs explicitly and purge, OR (better) a way to look up a job not based on the id, but based on the group id (the command in the jobspec). That way, regardless of a jobid, we could lose all of our state and still find the old (stale) job to delete. With a fresh state and larger cluster I am able to run jobs on GKE, but they are enormously slow - lammps size 2 2 2 is taking over 20 minutes. This is not the fault of fluence - GKE networking sucks. To keep debugging I likely need to move over to AWS with EFA, of course that introduces more things to figure out like EFA, etc. Signed-off-by: vsoch <[email protected]>
This is the "skeleton" of a new idea to wrap coscheduling, adding in the logic for fluence only where it is needed, likely in the PodGroup (in the new fluence/core/core that wraps the same in coscheduling). This is just a skeleton because we are deploying the sidecar with the wrapped scheduling and absolutely no logic ported over to AskFlux. I think I have a sense of where to put this, but wanted to save this vanilla/skeleton state in case we need to go back to it. Note that it did not work to have fluence inherit the functions from coscheduler, so I opted for a strategy of adding it as a helper field, and then just using it when necessary. Signed-off-by: vsoch <[email protected]>
Problem: fluence is missing! Solution: add back fluence. This is a different design in that we do the asking from the perspective of the pod group, meaning that we get back a full set of nodes, and save them (assigned exactly) to specific pods. This could also be more lenient - e.g., creating a cache of the list and then taking off the cache, but I like the finer granularity of 1:1 mapping for future issues that might arise (where one pod needs a new node). This design also introduces a nice feature that we can ask for the resources (meaning creating a jobspec) for exactly what we need across pods for the group because we are listing all pods for the group before we generate the jobspec. I left it as it currently was before (using one representative pod) to not incur too many changes but this definitely can be tried. There is likely more work to be done to test edge cases and account for resources when fluence starts (and be able to load a state if it restarts) but this is pretty great for a first shot! The local lammps experiment ran without clogging and I am testing on GKE as a next step. Finally, I think there is a lot of poetential error in allowing a ton of other PreFilter plugins to exist, each of which could return their own set of nodes to consider that might mismatch what fluence has decided on. For this reason I have done aggressive pruning and we can add things back as we see fit. Signed-off-by: vsoch <[email protected]>
Problem: it is really hard using klog and parses through messy multi-threaded logs Solution: make a little (likely temporary) filesystem logger for a single place of truth! Signed-off-by: vsoch <[email protected]>
Problem: we currently allow any pod in the group to make the request Solution: Making a BIG assumption that might be wrong, I am adding logic that only allows scheduling (meaning going through PreFilter with AskFlux) given that we see the first pod in the listing. In practice this is the first index (e.g., index 0) which based on our sorting strategy (timestamp then name) I think might work. But I am not 100% on that. The reason we want to do that is so the nodes are chosen for the first pod, and then the group can quickly follow and be actually assigned. Before I did this I kept seeing huge delays in waiting for the queue to move (e.g., 5/6 pods Running and the last one waiting, and then kicking in much later like an old car) and I think with this tweak that is fixed. But this is my subjective evaluation. I am also adding in the hack script for deploying to gke, which requires a push instead of a kind load. Signed-off-by: vsoch <[email protected]>
a839b3e
to
8c99f10
Compare
I'm not sure why there are conflicts here now - going to try working on a separate branch and see what's up. |
Ah, no need - it was rebased with main and the PR was open to another (now deprecated) branch. Updating the PR branch to be against main resolved that (and showed the other large amount of work that preceded this one). |
Problem: the submit of the first index works for more controlled lengths (e.g., lammps takes a while) but was having issues with really quick jobs. Solution: try restoring the queue that allows for enabling siblings pods so any group can be scheduled. Signed-off-by: vsoch <[email protected]>
Problem: we need to update to a newer go to keep up with the sig-scheduler upstream, and also the rainbow scheduler integration. Solution: upgrade to 1.21. This also required some refactor of the main.go and fluence due to changes in function signatures. This is a test to see if tests are passing - the fluxion-go bindings used here are from a branch (not merged yet) that can be used for the PR this one is going into, and before merging that final one we should merge and release the bindings more properly. Signed-off-by: vsoch <[email protected]>
@milroy I went through and changed most of the short variables (e.g., pgMgr) to be fully written out. I left a few that are convention for operators (reconcilers) like "mgr" (manager) and "req" (request). The one I didn't change from the original code was "ps" primarily because I don't know what that means. Let me know if there are others you'd like expanded. Thanks for starting to take a look! |
I'm in the process of reviewing this PR, but it is taking quite a while. There's a lot of churn in the commits, e.g., |
I'm really sorry it's so big - it's added up over many months, and understandably that sucks to review. I won't push anything until you are finished, and we can talk about the best way to work on it to minimize reviewer pain.
Can you focus on review the final state (not each commit)?
I think if I were to push the comments would say outdated, but they wouldn't go away (we could still see them). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made quite a few comments and am afraid some of them may not be relevant for the final file states of the PR. I'll try to find those and close them.
In general, this looks really good and won't require much to merge. I think the main issues to address are single-character variable names (not including files that were automatically generated), and choices for primitive integer types.
This PR represents a ton of excellent work; thank you!
// 2. Compare the initialization timestamps of fluence pod groups | ||
// 3. Fall back, sort by namespace/name | ||
// See https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/ | ||
func (f *Fluence) Less(podInfo1, podInfo2 *framework.QueuedPodInfo) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double-checking the expected behavior given the syntax:
Less(pod1, pod2)
should return true
if pod1 < pod2
by one of the three criteria. return prio1 > prio2
Returns true
if pod1 priority is greater than pod2. That means pod1 < pod2
which should mean pod1 comes before pod2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the function needs to return true if:
- for priority, pod 1 has higher priority (larger number) than pod 2, so
pod1 > pod2
. - for timestamp, if pod 1 has a group timestamp that is before the group timestamp for pod 2 (there is a function Before for that)
We can see example with priority and falling back to timestamp when they are equal here.
Here is the current docstring for our function.
// Less is used to sort pods in the scheduling queue in the following order.
// 1. Compare the priorities of Pods.
// 2. Compare the initialization timestamps of PodGroups or Pods.
// 3. Compare the keys of PodGroups/Pods: <namespace>/<podname>.
In summary we do the same - first looking at priority (returning true if pod 1 has a higher number) and then considering timestamp and returning true if pod group 1 was created before group 2. If the timestamps are equal, then we sort based on the group name, which should provide consistent ordering (and there is likely no "right" answer for which one should come first).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was checking because we have pod1 < pod2
if pod1.priority > pod2.priority
, which indeed makes sense.
src/fluence/fluxion/fluxion.go
Outdated
|
||
// Create a temporary file to write and read the jobspec | ||
// The first parameter here as the empty string creates in /tmp | ||
file, err := os.CreateTemp("", "jobspec.*.yaml") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth creating a temporary file just to turn it back into a string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely not! I refactored this in another variant of this API, but (when I was working on it here) I didn't have enough confidence to question the original design decision. I thought there must be a reason, so I left it. I'll try removing it (it simplifies the code a ton and also we don't write a file we don't need) and see if tests pass (or if there is something I'm missing).
src/fluence/jgf/types.go
Outdated
Id int `json:"id"` | ||
Uniq_id int `json:"uniq_id"` | ||
Rank int `json:"rank,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these should be int64
s. They are in Fluxion, and there will likely be cases where the number of vertices approaches the max expressible int32
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Working on now. One question I have is why for the name, here
flux-k8s/src/fluence/jgf/jgf.go
Lines 116 to 158 in 9f24f36
func (g *Fluxjgf) MakeSubnet(index int, ip string) string { | |
newnode := node{ | |
Id: strconv.Itoa(g.Elements), | |
Metadata: nodeMetadata{ | |
Type: "subnet", | |
Basename: ip, | |
Name: ip + strconv.Itoa(g.Elements), | |
Id: index, | |
Uniq_id: g.Elements, | |
Rank: -1, | |
Exclusive: false, | |
Unit: "", | |
Size: 1, | |
Paths: map[string]string{ | |
"containment": "", | |
}, | |
}, | |
} | |
g.addNode(newnode) | |
return newnode.Id | |
} | |
func (g *Fluxjgf) MakeNode(index int, exclusive bool, subnet string) string { | |
newnode := node{ | |
Id: strconv.Itoa(g.Elements), | |
Metadata: nodeMetadata{ | |
Type: "node", | |
Basename: subnet, | |
Name: subnet + strconv.Itoa(g.Elements), | |
Id: g.Elements, | |
Uniq_id: g.Elements, | |
Rank: -1, | |
Exclusive: exclusive, | |
Unit: "", | |
Size: 1, | |
Paths: map[string]string{ | |
"containment": "", | |
}, | |
}, | |
} | |
g.addNode(newnode) | |
return newnode.Id | |
} |
flux-k8s/src/fluence/jgf/jgf.go
Lines 160 to 180 in 9f24f36
func (g *Fluxjgf) MakeSocket(index int, name string) string { | |
newnode := node{ | |
Id: strconv.Itoa(g.Elements), | |
Metadata: nodeMetadata{ | |
Type: "socket", | |
Basename: name, | |
Name: name + strconv.Itoa(index), | |
Id: index, | |
Uniq_id: g.Elements, | |
Rank: -1, | |
Exclusive: false, | |
Unit: "", | |
Size: 1, | |
Paths: map[string]string{ | |
"containment": "", | |
}, | |
}, | |
} | |
g.addNode(newnode) | |
return newnode.Id | |
} |
See "Name" in each of the above. I think this would do good to have a comment to explain. My thinking is that maybe there is a subtle distinction between 'socket id in the scope of other sockets" vs "id in the scope of the entire graph" but I think the Name should be the first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The containment paths should also be a path from the cluster root to the named resource. I'm not sure if fluxion needs to use it, if not there is no issue that it's empty. If yes, this might be a source of error.
flux-k8s/src/fluence/jgf/jgf.go
Lines 173 to 175 in 9f24f36
Paths: map[string]string{ | |
"containment": "", | |
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay I'm making a meta issue for us to tackle after this issue here - this entire function I think needs some refactor, but it should be a scoped PR after this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking is that maybe there is a subtle distinction between 'socket id in the scope of other sockets" vs "id in the scope of the entire graph" but I think the Name should be the first.
You are correct. See an example here: https://github.com/flux-framework/flux-sched/blob/master/t/data/resource/jgfs/tiny.json.
The containment paths should also be a path from the cluster root to the named resource. I'm not sure if fluxion needs to use it, if not there is no issue that it's empty. If yes, this might be a source of error.
Fluxion hasn't needed to use it in Fluence yet because paths are used for subsystem checking, and more importantly establishing existence and uniqueness when mutating the resource graph (i.e., elasticity). We should add correct path generation in a follow-up PR.
Problem: a lot of the variables with pg are hard to understand Solution: write out podGroup or groupName explicitly. Signed-off-by: vsoch <[email protected]>
This set of changes includes the following: 1. Renaming short variable names to be longer and more understandable. 2. Not using the Status.ScheduleStartTime for the pod start time, but instead adding a new field. This previous field was there for a different purpose. 3. Creating named identifiers for resource types that can be shared in the jgf module along with others that use the same relations / vertex types. 4. Removing comments that are not necessary. 5. Changing JGF types from int to int64 that warrant it. 6. Spelling mistakes, etc. 7. Removing need to write jobspec to temporary file (we just need string) The JGF and utils modules need some additional looking - specifically I am worried that the paths->containment is not set, and sometimes the name reflects the index of the overall graph (global) and other times the index of the resource type. I think we likely want the latter for the inner name, but I am not sure in practice that fluxion is using it (internally). I am pushing these changes to assess testing, etc., and will update the PR as needed. There could also have been changes to upstream since the PR was opened that warrant additional fixes. Signed-off-by: vsoch <[email protected]>
14d5652
to
cbeffce
Compare
@milroy commit with response to review is here: cbeffce I left the comments unresolved that need further conversation, or just the ones that were more fun that you should see. ⚾ 🪝 I do see some issues with the design of the JGF, but I'm not sure if there are functional implications. I updated an issue here to reflect those thoughts: #71 (and please add other bullets to it). Let's discuss the comments here as much as we can, and (if we can meet this Wednesday) finish up discussion then. I will do one final test of experiments I've run before using fluence to ensure that everything is still working before we merge. |
Thank you very much for the quick and detailed work addressing my comments. This is a tremendous improvement, and makes Fluence far more useful and stable. I'm approving the PR, but let's discuss a few points and test the changes as you suggested before merging. |
To update from my post in slack (which will be lost into the black hole of segfaults): Fluence experiments are done - I did three sizes each one batch and varying iterations where one iteration == 5 jobs ranging in size from 2 to 6. Small was 3 of those, medium was 5, and large was 10. No clogging in any case, and the one flaw we have (that we know about) is that when the queue gets much longer, you have to wait longer to cycle through the entire thing and schedule the next job. So the large size took about 38 minutes. But the default scheduler would fail after a few minutes (clog) so we are optimizing for that - no clogging. And a follow up question I had this morning to think about: why do we use the Kubernetes queue, period? Why can't we do what flux does with the planner and schedule the job for a set of nodes in the future? Then we wouldn't need to process through the same jobs continuously. It would improve things quite a bit I think. I suspect we are missing a flux component. I think we are good to merge if someone wants to do the honors, or I can. It's not perfect and there is more to do, but this is a first step. 👣 |
We've started discussing improvements and changes to Fluence along these lines, and let's continue in a separate issue. For now, I'm merging this PR. |
This is the "skeleton" of a new idea to wrap coscheduling, where we will add in the logic for fluence only where it is needed, likely in the PodGroup (in the new fluence/core/core that wraps the same in coscheduling). This is just a skeleton because we are deploying the sidecar with the wrapped scheduling and absolutely no logic ported over to AskFlux. I think I have a sense of where to put this, but wanted to save this vanilla/skeleton state in case we need to go back to it. Note that it did not work to have fluence inherit the functions from coscheduler, so I opted for a strategy of adding it as a helper field, and then just using it when necessary (a call to
f.cosched
).Note that this has no logic to work yet, so the tests will obviously fail. I want to open it so I can clearly see changes here (and update in logical, incremental steps).
Update: this doesn't seem to work, so we will need to duplicate a lot. I pushed again and have included all functions, and also restored functionality for the pod group to be cleaned up by the controller (easier development for me).
Second update: this is now working without clogging for these experiments: https://github.com/converged-computing/operator-experiments/tree/main/google/scheduler/run7
Update: this has been updated with #74 which:
We will want to, before we do a final merge here, to merge flux-framework/fluxion-go#8 and then do a release that we update the digest for here. Note that fluence seems to be working OK so we likely can do this soon. My next step is working on other tools to supplement the experiments. But I'm not sure fluence in master (as a new build) would work now with upstream (actually I am sure it will not unless an older version is used).
This will close #71 #73 #59