Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated failure estimator #3427

Merged
merged 15 commits into from
Feb 28, 2024
Merged

Conversation

severinson
Copy link
Contributor

@severinson severinson commented Feb 28, 2024

This PR introduces a new learning algorithm for node and queue failure probability estimation. Simulations indicate this new method is much more accurate than the one we use now.

Adding the gonum package as a dependency resulted in a version increase of the slices package we use. As a result, I had to update all calls to slices.SortFunc as the API had changed.

The snyk error reported seemingly only affects image decoding. It shouldn't matter to us.

type Equaler[T any] interface {
// Returns true if both objects are equal.
Equal(T) bool
}

// Number represents any integer or floating-point number.
type Number interface {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curiously this interface isn't in the default collection.

@@ -0,0 +1,352 @@
using Random
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've included my simulator here for anyone working on this in the future.

@@ -61,8 +61,14 @@ func TestExecutorRepository_LoadAndSave(t *testing.T) {
}
retrievedExecutors, err := repo.GetExecutors(ctx)
require.NoError(t, err)
executorSort := func(a *schedulerobjects.Executor, b *schedulerobjects.Executor) bool {
return a.Id > b.Id
executorSort := func(a *schedulerobjects.Executor, b *schedulerobjects.Executor) int {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have several uses of SortFunc where the sorting is opposite that of what I would think the intuitive one.

d80tb7
d80tb7 previously approved these changes Feb 28, 2024
import "gonum.org/v1/gonum/mat"

type Optimiser interface {
Update(out, p *mat.VecDense, g mat.Vector) *mat.VecDense
Copy link
Collaborator

@robertdavidsmith robertdavidsmith Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better names for parameters? It's not clear what these do from the signature.
Comment at the top to make it clear how this interface works?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. I left p and g as-is in the implementations as I think the long names make the code a bit less readable there.

// - {node, queue}SuccessProbabilityCordonThreshold: Success probability below which nodes (queues) are considered unhealthy.
// - {node, queue}CordonTimeout: Amount of time for which nodes (queues) remain unhealthy in the absence of any job successes or failures for that node (queue).
// - {node, queue}EquilibriumFailureRate: Job failures per second necessary for a node (queue) to remain unhealthy in the absence of any successes for that node (queue).
type Update struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better names/comment?

@severinson severinson merged commit 8e52353 into master Feb 28, 2024
18 of 20 checks passed
@severinson severinson deleted the severinson/node-queue-failure-metrics-2 branch February 28, 2024 11:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants