Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: testing idea to wrap coscheduling #69

Merged
merged 28 commits into from
May 24, 2024
Merged

Conversation

vsoch
Copy link
Member

@vsoch vsoch commented Mar 7, 2024

This is the "skeleton" of a new idea to wrap coscheduling, where we will add in the logic for fluence only where it is needed, likely in the PodGroup (in the new fluence/core/core that wraps the same in coscheduling). This is just a skeleton because we are deploying the sidecar with the wrapped scheduling and absolutely no logic ported over to AskFlux. I think I have a sense of where to put this, but wanted to save this vanilla/skeleton state in case we need to go back to it. Note that it did not work to have fluence inherit the functions from coscheduler, so I opted for a strategy of adding it as a helper field, and then just using it when necessary (a call to f.cosched).

Note that this has no logic to work yet, so the tests will obviously fail. I want to open it so I can clearly see changes here (and update in logical, incremental steps).

Update: this doesn't seem to work, so we will need to duplicate a lot. I pushed again and have included all functions, and also restored functionality for the pod group to be cleaned up by the controller (easier development for me).

Second update: this is now working without clogging for these experiments: https://github.com/converged-computing/operator-experiments/tree/main/google/scheduler/run7

Update: this has been updated with #74 which:

  • has support to re-enable activating siblings
  • uses a branch of fluxion-go updated for go 1.21 and kubernetes 1.29
  • updates go to 1.21 here

We will want to, before we do a final merge here, to merge flux-framework/fluxion-go#8 and then do a release that we update the digest for here. Note that fluence seems to be working OK so we likely can do this soon. My next step is working on other tools to supplement the experiments. But I'm not sure fluence in master (as a new build) would work now with upstream (actually I am sure it will not unless an older version is used).

This will close #71 #73 #59

vsoch added 15 commits January 7, 2024 21:37
Also tweak some docstrings for fluence.

Signed-off-by: vsoch <[email protected]>
Problem: the local upstream might get out of date
Solution: provide an easy make update to pull from it.
Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>

There are too many edge cases / too much complexity and behavior
that I do not understand to pursue having the pod group information
cached with fluence. For now I am nuking it and testing the intial
design as a sanity check.
Problem: the podgroups with second precision have interleaving
Solution: try to create an internal representation with better
precision. This looks promising with early testing, but I need
to consider the edge cases and how to clean up the groups, otherwise
a pod group might be re-created later and still in the cache.

Signed-off-by: vsoch <[email protected]>
We install with the helm manifests, and the old fluence
manifests might be confusing (they have changed). This commit
will remove the old manifests, and also change some of the fmt.Print
logging to use klog to be easier to parse.

Signed-off-by: vsoch <[email protected]>
This adds a prototype support for an extra helm flag that
dually enables adding an extra grpc set of endpoints, and then
the configs (ingress and service) necessary to expose them.
I next need to figure out how to interact with grpc from
a local client, likely built from the same codebase and grpc
spec. This is super cool!!

Signed-off-by: vsoch <[email protected]>
Problem: we want to be able to persist PodGroup if upstream removes it
Solution: build our own controller image, also allowing us to
tweak it to enhance fluence. This commit also renames the helm
install to be "fluence" so it is easier for the developer
workflow

Signed-off-by: vsoch <[email protected]>
Problem: we want to try a design where a mutating admission webhook can handle
receiving and creating PodGroup from labels. We are choosing mutating with
the expectation that, at some point, we might be able to change the size
(min/max/desired) either for the PodGroup or some other watcher to jobs.
Note that this is an empty skeleton - the webhook is added and running
but basically doing nothing. I am also fixing a bug that I noticed while
running kind that fluence was assigning work to the control plane. I think
there maybe used to be logic (a commented out worker label) that was
anticipating doing a check for a control plane, but it looks like on
production clusters we do not always haave access and it was never
finished. Note that this addition does not guarantee this design will
work, but it is just one step. Since the helm charts are manually
genereated for the scheduler-plugin (as far as I can tell) this
took me almost 6 hours to figure out and get working. I am really
starting to think there is no skill behind software engineering
beyond absolute patience.

Signed-off-by: vsoch <[email protected]>
Problem: we need every pod object coming into the cluster to be
part of a group.
Solution: This change adds logic to the mutating webhook to
add the labels that indicate the group name and size. We
can eventually add flexibility here. I also realize that
we can easily watch for job objects first, and add the
group size/name to the pod template. This will be much
more efficient to then not have to add to the individual
pods that are part of a larger job. With this approach
I was able to create a fluence scheduled pod, and then
see my labels added! It does not do anything beyond that.
I am also adding a nice script that makes it easy to
build, load, and install fluence freshly, otherwise you
will use all your internet data for the month in like,
two days. Do not do that :P

Signed-off-by: vsoch <[email protected]>
Problem: we want the labels (size and name) that are explicitly set
to lead to the creation of the pod group so the user does not need to.
This is done by way of a watcher on pod, which will trigger after
the webhook that ensures that every pod (in a job or single pod)
has the proper label. Likely we want to do this for other abstractions
that hold pods as well, because it ensures that no matter how the pods
go into pending, we have the correct size and name. The only case that
a pod can come in without the label means that it was not scheduled
by fluence. The user is directed to not do this, but it is not
impossible (e.g., fluence sees itself show up here actually). So after
this addition we have the full steps to add the labels and create
the pod group, and next steps are (finally) to integrate this into
fluence (and remove the old abstraction to store it in memory).

Signed-off-by: vsoch <[email protected]>
Problem: fluence should only be storing state of jobid and
presence of a group name in a map to indicate node assignment.
Soluion: update the code here. Note that this is not working
yet, and I am pushing / opening the PR to not use the work
(and will update accordingly, and using this PR to test).

Signed-off-by: vsoch <[email protected]>
@vsoch vsoch force-pushed the rebase-on-coscheduling branch 2 times, most recently from af71f88 to a250385 Compare March 7, 2024 22:54
@vsoch vsoch marked this pull request as ready for review April 2, 2024 23:21
@vsoch
Copy link
Member Author

vsoch commented Apr 4, 2024

I'm testing another idea to only allow scheduling the first in the group, that way the next in the group get seen right after that. It's too bad we can't selectively grab pods from the queue (or maybe we can)? Otherwise we could just get any pod scheduled and then assign (actually assign, not just declare the node and wait for the pod to show up again) the pod to the node.

@vsoch
Copy link
Member Author

vsoch commented Apr 5, 2024

It's running now (and been running). I can't back this up with data, but it seems to not be doing the "freeze for a while and then kick in" like an old car thing. Will update if/when it finishes. I doubled the size of the experiment so it's size chonkmaster.

Side note - we need a chonker scale for experiments. 🐱 🐈

LAMMPS?   A Fine Boi!
AMD?      A Heckin' Chonker
MuMMI?    OH LAWD HE COMIN'!

vsoch added 9 commits April 4, 2024 19:42
Problem: Since the PodGroup controller creates the PodGroup, it
should delete it as well.
Solution: Ideally I wanted to attach an owner reference, meaning
that the top level job (that also owns the pod) would be owner
to the PodGroup. But that does not seem to take - either because
the controller is the owner or the field is read only for k8s.
For the time being, I decided to delete the PodGroup when the
group is determined to be Finished/Failed, which happens when
that number of pods equals or exceeds the MinimumSize. I think
granted that MinimumSize == size this should be OK with fluence,
and we might need to consider other approaches if/when the
min size is smaller than the total size (because fluence might
still see a pod in the queue and try to schedule again. I think
what we might do in that case is just update the MinSize for
the group, so if fluence schedules again it will be for the
smaller size. But not sure about that either! TBA. The
important thing now is that the pod group cleans itself up!

Signed-off-by: vsoch <[email protected]>
Problem: we need to be able to run deployments, stateful/replica sets
and have them handled by fluence.
Solution: allow the webhook to create pod groups for them. In the case
they are not targeted for fluence (any abstraction) and get into the
PreFilter, allow creation of a FauxPodGroup that will simply schedule one
job for the pod. We do this twice - in PreFilter and in the events
for update/delete.

Signed-off-by: vsoch <[email protected]>
Problem: I noticed in testing that the time only had granularity down to the second.
Solution: It appears that when we do a create of the PodGroup from the reconciler watch,
the metadata (beyond name and namespace) does not stick. I am not sure why, but the labels
are still retrievable from the pods (via the mutating webhook) after. So instead, we need
to get the size and creation timestamp at the first hit in reconcile, which (given how that
works) should still somewhat honor the order. I did try adding the timestamp to a label
but it got hairy really quickly (kept me up about 3 hours longer than I intended to!) The
good news now is that I see the microseconds in the Schedule Start Time, so we should be
almost ready to test this on a GCP cluster. I also had lots of time waiting for the containers
to rebuild so I made a diagram of how it is currently working. I have some concerns about
the internal state of fluxion (my kind cluster stopped working after some hours and I do not
know why) but we can address them later. We mostly need to see if there are jobs that are being
forgotten, etc.

Signed-off-by: vsoch <[email protected]>
Problem: the design description did not correspond with the numbers
Solution: fix them up, also fix some bugs in the controller and
fluence that assume we have pods / pod group (we do not always)

Signed-off-by: vsoch <[email protected]>
I am making small changes as I test on GKE and EKS. My first tests on GKE had
me creating / deleting jobs, and I think the state of fluence (fluxion) got out
of sync with the jobs, meaning that fluxion thought jobs were running that were not
and then was unable to allocate new ones. To adjust for that we can add back in the
cancel response, but this will only work given that fluence has not lost memory
of the job id. We likely need an approach that can either save the jobids to the
state data (that could be reloaded) or a way to inspect jobs explicitly and purge,
OR (better) a way to look up a job not based on the id, but based on the group id
(the command in the jobspec). That way, regardless of a jobid, we could lose all
of our state and still find the old (stale) job to delete. With a fresh state
and larger cluster I am able to run jobs on GKE, but they are enormously slow -
lammps size 2 2 2 is taking over 20 minutes. This is not the fault of fluence -
GKE networking sucks. To keep debugging I likely need to move over to AWS with
EFA, of course that introduces more things to figure out like EFA, etc.

Signed-off-by: vsoch <[email protected]>
This is the "skeleton" of a new idea to wrap coscheduling, adding
in the logic for fluence only where it is needed, likely in
the PodGroup (in the new fluence/core/core that wraps the same
in coscheduling). This is just a skeleton because we are deploying
the sidecar with the wrapped scheduling and absolutely no logic
ported over to AskFlux. I think I have a sense of where to put
this, but wanted to save this vanilla/skeleton state in case
we need to go back to it. Note that it did not work to have
fluence inherit the functions from coscheduler, so I opted for
a strategy of adding it as a helper field, and then just
using it when necessary.

Signed-off-by: vsoch <[email protected]>
Problem: fluence is missing!
Solution: add back fluence. This is a different design in that we do the asking
from the perspective of the pod group, meaning that we get back a full set of nodes,
and save them (assigned exactly) to specific pods. This could also be more lenient -
e.g., creating a cache of the list and then taking off the cache, but I like the
finer granularity of 1:1 mapping for future issues that might arise (where one
pod needs a new node). This design also introduces a nice feature that we can ask
for the resources (meaning creating a jobspec) for exactly what we need across pods
for the group because we are listing all pods for the group before we generate
the jobspec. I left it as it currently was before (using one representative pod)
to not incur too many changes but this definitely can be tried. There is likely
more work to be done to test edge cases and account for resources when fluence
starts (and be able to load a state if it restarts) but this is pretty great for
a first shot! The local lammps experiment ran without clogging and I am testing
on GKE as a next step. Finally, I think there is a lot of poetential error in
allowing a ton of other PreFilter plugins to exist, each of which could return
their own set of nodes to consider that might mismatch what fluence has decided
on. For this reason I have done aggressive pruning and we can add things back as
we see fit.

Signed-off-by: vsoch <[email protected]>
Problem: it is really hard using klog and parses through messy multi-threaded logs
Solution: make a little (likely temporary) filesystem logger for a single place
of truth!

Signed-off-by: vsoch <[email protected]>
Problem: we currently allow any pod in the group to make the request
Solution: Making a BIG assumption that might be wrong, I am adding logic
that only allows scheduling (meaning going through PreFilter with AskFlux)
given that we see the first pod in the listing. In practice this is the first
index (e.g., index 0) which based on our sorting strategy (timestamp then name)
I think might work. But I am not 100% on that. The reason we want to do that is
so the nodes are chosen for the first pod, and then the group can quickly
follow and be actually assigned. Before I did this I kept seeing huge delays
in waiting for the queue to move (e.g., 5/6 pods Running and the last one
waiting, and then kicking in much later like an old car) and I think with
this tweak that is fixed. But this is my subjective evaluation. I am
also adding in the hack script for deploying to gke, which requires a
push instead of a kind load.

Signed-off-by: vsoch <[email protected]>
@vsoch vsoch force-pushed the rebase-on-coscheduling branch from a839b3e to 8c99f10 Compare April 5, 2024 01:42
@vsoch
Copy link
Member Author

vsoch commented Apr 9, 2024

I'm not sure why there are conflicts here now - going to try working on a separate branch and see what's up.

@vsoch vsoch changed the base branch from fluence-controller to main April 9, 2024 01:08
@vsoch
Copy link
Member Author

vsoch commented Apr 9, 2024

Ah, no need - it was rebased with main and the PR was open to another (now deprecated) branch. Updating the PR branch to be against main resolved that (and showed the other large amount of work that preceded this one).

vsoch added 2 commits April 17, 2024 00:54
Problem: the submit of the first index works for more controlled lengths (e.g., lammps takes a while)
but was having issues with really quick jobs.
Solution: try restoring the queue that allows for enabling siblings pods so any group
can be scheduled.

Signed-off-by: vsoch <[email protected]>
Problem: we need to update to a newer go to keep up
with the sig-scheduler upstream, and also the rainbow
scheduler integration.
Solution: upgrade to 1.21. This also required some
refactor of the main.go and fluence due to changes
in function signatures. This is a test to see if
tests are passing - the fluxion-go bindings used
here are from a branch (not merged yet) that can
be used for the PR this one is going into, and
before merging that final one we should merge
and release the bindings more properly.

Signed-off-by: vsoch <[email protected]>
@vsoch
Copy link
Member Author

vsoch commented May 3, 2024

@milroy I went through and changed most of the short variables (e.g., pgMgr) to be fully written out. I left a few that are convention for operators (reconcilers) like "mgr" (manager) and "req" (request). The one I didn't change from the original code was "ps" primarily because I don't know what that means. Let me know if there are others you'd like expanded. Thanks for starting to take a look!

@milroy
Copy link
Member

milroy commented May 8, 2024

I'm in the process of reviewing this PR, but it is taking quite a while. There's a lot of churn in the commits, e.g., fluence.go is updated 17 times across commits. That makes it difficult to track the desired final state and also causes me to wonder if my comments will be obviated by a later commit.

@vsoch
Copy link
Member Author

vsoch commented May 8, 2024

I'm in the process of reviewing this PR, but it is taking quite a while. There's a lot of churn in the commits, e.g., fluence.go is updated 17 times across commits.

I'm really sorry it's so big - it's added up over many months, and understandably that sucks to review. I won't push anything until you are finished, and we can talk about the best way to work on it to minimize reviewer pain.

That makes it difficult to track the desired final state

Can you focus on review the final state (not each commit)?

and also causes me to wonder if my comments will be obviated by a later commit.

I think if I were to push the comments would say outdated, but they wouldn't go away (we could still see them).

Copy link
Member

@milroy milroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made quite a few comments and am afraid some of them may not be relevant for the final file states of the PR. I'll try to find those and close them.

In general, this looks really good and won't require much to merge. I think the main issues to address are single-character variable names (not including files that were automatically generated), and choices for primitive integer types.

This PR represents a ton of excellent work; thank you!

sig-scheduler-plugins/pkg/fluence/fluence.go Outdated Show resolved Hide resolved
sig-scheduler-plugins/pkg/fluence/fluence.go Outdated Show resolved Hide resolved
sig-scheduler-plugins/pkg/fluence/fluence.go Outdated Show resolved Hide resolved
Comment on lines 148 to 151
// 2. Compare the initialization timestamps of fluence pod groups
// 3. Fall back, sort by namespace/name
// See https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/
func (f *Fluence) Less(podInfo1, podInfo2 *framework.QueuedPodInfo) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double-checking the expected behavior given the syntax:

Less(pod1, pod2) should return true if pod1 < pod2 by one of the three criteria. return prio1 > prio2 Returns true if pod1 priority is greater than pod2. That means pod1 < pod2 which should mean pod1 comes before pod2.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the function needs to return true if:

  • for priority, pod 1 has higher priority (larger number) than pod 2, so pod1 > pod2.
  • for timestamp, if pod 1 has a group timestamp that is before the group timestamp for pod 2 (there is a function Before for that)

We can see example with priority and falling back to timestamp when they are equal here.

Here is the current docstring for our function.

// Less is used to sort pods in the scheduling queue in the following order.
// 1. Compare the priorities of Pods.
// 2. Compare the initialization timestamps of PodGroups or Pods.
// 3. Compare the keys of PodGroups/Pods: <namespace>/<podname>.

In summary we do the same - first looking at priority (returning true if pod 1 has a higher number) and then considering timestamp and returning true if pod group 1 was created before group 2. If the timestamps are equal, then we sort based on the group name, which should provide consistent ordering (and there is likely no "right" answer for which one should come first).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was checking because we have pod1 < pod2 if pod1.priority > pod2.priority, which indeed makes sense.

sig-scheduler-plugins/pkg/fluence/fluence.go Outdated Show resolved Hide resolved

// Create a temporary file to write and read the jobspec
// The first parameter here as the empty string creates in /tmp
file, err := os.CreateTemp("", "jobspec.*.yaml")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth creating a temporary file just to turn it back into a string?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely not! I refactored this in another variant of this API, but (when I was working on it here) I didn't have enough confidence to question the original design decision. I thought there must be a reason, so I left it. I'll try removing it (it simplifies the code a ton and also we don't write a file we don't need) and see if tests pass (or if there is something I'm missing).

Comment on lines 41 to 43
Id int `json:"id"`
Uniq_id int `json:"uniq_id"`
Rank int `json:"rank,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these should be int64s. They are in Fluxion, and there will likely be cases where the number of vertices approaches the max expressible int32.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working on now. One question I have is why for the name, here

func (g *Fluxjgf) MakeSubnet(index int, ip string) string {
newnode := node{
Id: strconv.Itoa(g.Elements),
Metadata: nodeMetadata{
Type: "subnet",
Basename: ip,
Name: ip + strconv.Itoa(g.Elements),
Id: index,
Uniq_id: g.Elements,
Rank: -1,
Exclusive: false,
Unit: "",
Size: 1,
Paths: map[string]string{
"containment": "",
},
},
}
g.addNode(newnode)
return newnode.Id
}
func (g *Fluxjgf) MakeNode(index int, exclusive bool, subnet string) string {
newnode := node{
Id: strconv.Itoa(g.Elements),
Metadata: nodeMetadata{
Type: "node",
Basename: subnet,
Name: subnet + strconv.Itoa(g.Elements),
Id: g.Elements,
Uniq_id: g.Elements,
Rank: -1,
Exclusive: exclusive,
Unit: "",
Size: 1,
Paths: map[string]string{
"containment": "",
},
},
}
g.addNode(newnode)
return newnode.Id
}
we use the g.Elements length, but then for others we use an index?

func (g *Fluxjgf) MakeSocket(index int, name string) string {
newnode := node{
Id: strconv.Itoa(g.Elements),
Metadata: nodeMetadata{
Type: "socket",
Basename: name,
Name: name + strconv.Itoa(index),
Id: index,
Uniq_id: g.Elements,
Rank: -1,
Exclusive: false,
Unit: "",
Size: 1,
Paths: map[string]string{
"containment": "",
},
},
}
g.addNode(newnode)
return newnode.Id
}

See "Name" in each of the above. I think this would do good to have a comment to explain. My thinking is that maybe there is a subtle distinction between 'socket id in the scope of other sockets" vs "id in the scope of the entire graph" but I think the Name should be the first.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The containment paths should also be a path from the cluster root to the named resource. I'm not sure if fluxion needs to use it, if not there is no issue that it's empty. If yes, this might be a source of error.

Paths: map[string]string{
"containment": "",
},

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay I'm making a meta issue for us to tackle after this issue here - this entire function I think needs some refactor, but it should be a scoped PR after this one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking is that maybe there is a subtle distinction between 'socket id in the scope of other sockets" vs "id in the scope of the entire graph" but I think the Name should be the first.

You are correct. See an example here: https://github.com/flux-framework/flux-sched/blob/master/t/data/resource/jgfs/tiny.json.

The containment paths should also be a path from the cluster root to the named resource. I'm not sure if fluxion needs to use it, if not there is no issue that it's empty. If yes, this might be a source of error.

Fluxion hasn't needed to use it in Fluence yet because paths are used for subsystem checking, and more importantly establishing existence and uniqueness when mutating the resource graph (i.e., elasticity). We should add correct path generation in a follow-up PR.

src/fluence/jgf/types.go Outdated Show resolved Hide resolved
src/fluence/utils/utils.go Outdated Show resolved Hide resolved
src/fluence/utils/utils.go Outdated Show resolved Hide resolved
vsoch added 2 commits May 14, 2024 00:36
Problem: a lot of the variables with pg are hard to understand
Solution: write out podGroup or groupName explicitly.
Signed-off-by: vsoch <[email protected]>
This set of changes includes the following:

1. Renaming short variable names to be longer and more understandable.
2. Not using the Status.ScheduleStartTime for the pod start time, but instead
adding a new field. This previous field was there for a different purpose.
3. Creating named identifiers for resource types that can be shared in the
jgf module along with others that use the same relations / vertex types.
4. Removing comments that are not necessary.
5. Changing JGF types from int to int64 that warrant it.
6. Spelling mistakes, etc.
7. Removing need to write jobspec to temporary file (we just need string)

The JGF and utils modules need some additional looking - specifically I am worried
that the paths->containment is not set, and sometimes the name reflects the index
of the overall graph (global) and other times the index of the resource type. I
think we likely want the latter for the inner name, but I am not sure in practice
that fluxion is using it (internally). I am pushing these changes to assess testing,
etc., and will update the PR as needed. There could also have been changes to
upstream since the PR was opened that warrant additional fixes.

Signed-off-by: vsoch <[email protected]>
@vsoch vsoch force-pushed the rebase-on-coscheduling branch from 14d5652 to cbeffce Compare May 14, 2024 06:36
@vsoch
Copy link
Member Author

vsoch commented May 14, 2024

@milroy commit with response to review is here: cbeffce

I left the comments unresolved that need further conversation, or just the ones that were more fun that you should see. ⚾ 🪝 I do see some issues with the design of the JGF, but I'm not sure if there are functional implications. I updated an issue here to reflect those thoughts: #71 (and please add other bullets to it). Let's discuss the comments here as much as we can, and (if we can meet this Wednesday) finish up discussion then. I will do one final test of experiments I've run before using fluence to ensure that everything is still working before we merge.

@milroy
Copy link
Member

milroy commented May 22, 2024

Thank you very much for the quick and detailed work addressing my comments. This is a tremendous improvement, and makes Fluence far more useful and stable.

I'm approving the PR, but let's discuss a few points and test the changes as you suggested before merging.

@vsoch
Copy link
Member Author

vsoch commented May 23, 2024

To update from my post in slack (which will be lost into the black hole of segfaults):

Fluence experiments are done - I did three sizes each one batch and varying iterations where one iteration == 5 jobs ranging in size from 2 to 6. Small was 3 of those, medium was 5, and large was 10. No clogging in any case, and the one flaw we have (that we know about) is that when the queue gets much longer, you have to wait longer to cycle through the entire thing and schedule the next job. So the large size took about 38 minutes. But the default scheduler would fail after a few minutes (clog) so we are optimizing for that - no clogging.

And a follow up question I had this morning to think about: why do we use the Kubernetes queue, period? Why can't we do what flux does with the planner and schedule the job for a set of nodes in the future? Then we wouldn't need to process through the same jobs continuously. It would improve things quite a bit I think. I suspect we are missing a flux component.

I think we are good to merge if someone wants to do the honors, or I can. It's not perfect and there is more to do, but this is a first step. 👣

@milroy
Copy link
Member

milroy commented May 24, 2024

And a follow up question I had this morning to think about: why do we use the Kubernetes queue, period? Why can't we do what flux does with the planner and schedule the job for a set of nodes in the future? Then we wouldn't need to process through the same jobs continuously. It would improve things quite a bit I think. I suspect we are missing a flux component.

We've started discussing improvements and changes to Fluence along these lines, and let's continue in a separate issue. For now, I'm merging this PR.

@milroy milroy merged commit 33ab097 into main May 24, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Post refactor changes needed
2 participants