Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2170: Initial Implementations for v2 Manager #2236

Conversation

tenzen-y
Copy link
Member

@tenzen-y tenzen-y commented Aug 27, 2024

What this PR does / why we need it:
I implemented the skeleton v2 manager. So, this PR introduces only the following things:

  • Main manager setupper.
  • The skeleton controllers for TrainJob, TrainingRuntime, and ClusterTrainingRuntime.
  • The minimum integration tests using envtest so that we can be confident that this manager can be launched successfully.

These skeleton implementations allow us to unblock webhooks or other implementations for other contributors.

This depends on #2237, and #2237 blocks this PR.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Part-of #2207

Checklist:

  • Docs included if any changes are user facing

@tenzen-y
Copy link
Member Author

/hold

@tenzen-y tenzen-y marked this pull request as ready for review August 27, 2024 23:38
@coveralls
Copy link

coveralls commented Aug 27, 2024

Pull Request Test Coverage Report for Build 10705754285

Details

  • 46 of 46 (100.0%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+68.2%) to 100.0%

Totals Coverage Status
Change from base Build 10701428188: 68.2%
Covered Lines: 46
Relevant Lines: 46

💛 - Coveralls

@tenzen-y tenzen-y force-pushed the first-implementaton-for-traininig-v2 branch 4 times, most recently from 4780722 to d52692c Compare August 30, 2024 21:58
@tenzen-y tenzen-y force-pushed the first-implementaton-for-traininig-v2 branch from d52692c to dafd032 Compare August 30, 2024 22:24
@tenzen-y
Copy link
Member Author

Ready for review.
/assign @kubeflow/wg-training-leads

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this @tenzen-y!

Makefile Outdated
@@ -75,6 +75,10 @@ testall: manifests generate fmt vet golangci-lint test ## Run tests.
test: envtest
KUBEBUILDER_ASSETS="$(shell setup-envtest use $(ENVTEST_K8S_VERSION) -p path)" go test ./... -coverprofile cover.out

.PHONY: test-integrationv2
test-integrationv2:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add envtest instructions, similar to v1 ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that's a good point.
You're right.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 79 to 80
flag.StringVar(&webhookServiceName, "webhook-service-name", "training-operator", "Name of the Service used as part of the DNSName")
flag.StringVar(&webhookSecretName, "webhook-secret-name", "training-operator-webhook-cert", "Name of the Secret to store CA and server certs")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we separate resources names between V1 and V2 ? In that case, we can give user an option to deploy two versions of Training Operator.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense.
How about "webhook-secret-name=training-operator-v2-webhook-cert" and "webhook-service-name=training-operator-v2"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it looks good.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
}

func (r *ClusterTrainingRuntimeReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to have reconcilers for Training Runtimes since we don't do any orchestration for them initially ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this reconciler since the TrainingRuntime reconcilers have only event handlers and predicators.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will implement event handlers and predicators in the third phase.

1st phase: this PR.
2nd phase: JobSet builder interface.
3rd phase: actual reconciler and event handlers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of event handlers and predicators are you planning to implement for Training Runtime ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I'm planning the following handlers, but I may add more event handling mechanisms in the actual implementation steps.

Create: Trigger the TrainJob reconcilers by triggering the generic event of the TrainJob reconcilers.
Update: Ignore the Update event since the webhook can be disabled by webhookConfiguration failurePolicy.
Delete: Ignore the Delete event when the Runtime is still used by any TrainJob since the webhook can be disabled by webhookConfiguration failurePolicy.
Generic: None

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y For the Create event do you mean: Trigger the TrainingRuntime reconcilers by triggering the generic event of the TrainingRuntime reconcilers ?
For the Delete, do you mean that user can't delete TrainingRuntime if it used by any TrainJob ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the Create event do you mean: Trigger the TrainingRuntime reconcilers by triggering the generic event of the TrainingRuntime reconcilers ?

No, we do not trigger the TrainingRuntime reconciler; we just trigger the TrainJob reconciler by the TrainingRuntime event handler.

For the Delete, do you mean that user can't delete TrainingRuntime if it used by any TrainJob ?

Yes, that's right. We should prevent accidently deleting the TrainingRuntime when the TrainingRuntime is referred by any TranJob.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we do not trigger the TrainingRuntime reconciler; we just trigger the TrainJob reconciler by the TrainingRuntime event handler.

@tenzen-y What orchestration for the TrainJob are we going to do when we trigger such event ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me show the actual step in the following:

  1. During setting up controllers, we pass the go channel for TrainJob Generic EventHandler to the TrainigRuntime controller.
  2. TrainingRuntime Controller: Get TrainingRuntime creation event
  3. TrainingRuntime Controller: Send the object to the TrainJob controller using go channel
  4. TrainJob Controller: Start reconciling

This is a generic technique to start reconciling by out-of-reconcilers. I will implement the mechanism in the 3rd step.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I synced with @andreyvelich offline.
I was supposed to provide functionality creation of TraininigRuntime after the TrainJob was created.
So, I added TrainingRuntime reconcilers. But, we decided not to allow user to create TrainJob before the TrainingRuntime is created in the first iteration. So, users need to create the TrainigRuntime before the TrainJob is created.
Additionally, we decided to handle the deletion of TrainingRuntim by finalizers, and then we never used to reconciler for deletion event for TrainingRuntime.

}
if err := NewTrainJobReconciler(
mgr.GetClient(),
mgr.GetEventRecorderFor("training-operator-trainjob-controller"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add the training-operator- name in the event recorder ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should add the prefix since this event recorder name is used across the entire cluster.


func NewTrainJobReconciler(client client.Client, recorder record.EventRecorder) *TrainJobReconciler {
return &TrainJobReconciler{
log: ctrl.Log.WithName("trainjob-reconciler"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call this logger: trainjob-controller ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the "train job-reconciler" would be better since the controller contains reconcilers, event handlers, and predicators.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but from the logging perspective we are not going to separate it, isn't ?
It might be easier from user perspective to see those logs under trainjob-controller name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm, it seems that you have a point.
So, I will replace trainjob-reconciler with trainjob-controller. Thanks

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -0,0 +1,74 @@
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you want to separate those integration tests into separate folder (not into the /pkg/controller.v2/trainjob_controller_test.go)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe similar to Kueue and JobSet, we should create subfolder for e2e and integration tests under /test directory.

Copy link
Member Author

@tenzen-y tenzen-y Sep 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you want to separate those integration tests into separate folder (not into the /pkg/controller.v2/trainjob_controller_test.go)?

This separate test directory allows us to avoid fragmentation of the place for testing.
Because, indeed, the integration testing depends on the multiple directories. So, I think locating integration testing in each directory would not be appropriate.

Maybe similar to Kueue and JobSet, we should create subfolder for e2e and integration tests under /test directory.

Yes, I assumed the directory structure. In the first implementation, we want to add unit and integration (envtest) tests. After that we want to implement E2E testing. How do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but can we keep the folder structure correctly initially ?
E.g. add the integration tests under:

/test/integration/controller.v2/trainjob_controller_test.go
/test/integration/controller.v2/trainingruntime_controller_test.go
/test/integration/controller.v2/suite_test.go

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proposed directory structure looks good.
Thank you for proposing that!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@tenzen-y tenzen-y force-pushed the first-implementaton-for-traininig-v2 branch from dafd032 to 470512e Compare September 3, 2024 08:52
@andreyvelich
Copy link
Member

/assign @kubeflow/wg-training-leads for review

@tenzen-y tenzen-y force-pushed the first-implementaton-for-traininig-v2 branch from 470512e to 0120280 Compare September 3, 2024 20:47
@@ -0,0 +1,72 @@
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to remove these test as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can remove this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

"k8s.io/client-go/rest"
"sigs.k8s.io/controller-runtime/pkg/client"

"github.com/kubeflow/training-operator/test/integration/framework"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you see a value to separate framework logic into separate file ?
E.g. we can add this logic in the suite_test.go , similar to JobSet: https://github.com/kubernetes-sigs/jobset/blob/main/test/integration/controller/suite_test.go

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed since we will add webhook server integration testing and we will use this framework to set up testing environment.

- /test/integration/controller.v2/
- /test/integration/webhook.v2/trainjob_test.go
- /test/integration/webhook.v2/trainingruntime_test.go
- /test/integration/webhook.v2/clustertrainingruntime_test.go

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using integration testing framework would be better between testing for trainjob_controller, trainjob_wwbhook, trainingruntime_webhook, and clustertrainingruntime_webhook.

@@ -0,0 +1,74 @@
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove it for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@tenzen-y tenzen-y force-pushed the first-implementaton-for-traininig-v2 branch from 0120280 to c9b5a9c Compare September 3, 2024 21:34
@google-oss-prow google-oss-prow bot added size/L and removed size/XL labels Sep 3, 2024
@@ -0,0 +1,29 @@
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we name this file setup.go to make it more descriptive ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

import (
"context"

"github.com/go-logr/logr"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, we agree before to use controller-runtime logger everywhere to be consistent: #2048
Should we make that change @tenzen-y ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this is the controller-runtime logger. The controller-runtime logger implement this logr interface, so we need to import this library to use the controller-runtime logger. You can see the actual controller-runtime logger in the Reconcile function.

@tenzen-y tenzen-y force-pushed the first-implementaton-for-traininig-v2 branch from c9b5a9c to 9f8e37f Compare September 4, 2024 16:29
if err := r.client.Get(ctx, req.NamespacedName, &trainJob); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
log := ctrl.LoggerFrom(ctx).WithValues("trainJob", klog.KObj(&trainJob))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the value to configure controller runtime logger with klog.KObj(&trainJob) ?
Also, since we add logger to the TrainJob reconciler struct, why do you create a new logger here ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the value to configure controller runtime logger with klog.KObj(&trainJob) ?
Also, since we add logger to the TrainJob reconciler struct, why do you create a new logger here ?

This is not new logger, actually this logger is propagated from Reconciler and we reuse it.
Here, we add information for which objects are reconciling.

Copy link
Member Author

@tenzen-y tenzen-y Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, we do not generate new logger, and not add a logger name (WithName) here. That is WithValue as you can see here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that makes sense!

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this @tenzen-y!
/lgtm
/assign @kubeflow/wg-training-leads @shravan-achar @kannon92

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, we are ready to merge it!
/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tenzen-y
Copy link
Member Author

tenzen-y commented Sep 6, 2024

/hold cancel

@google-oss-prow google-oss-prow bot merged commit 6ddeb2b into kubeflow:master Sep 6, 2024
39 checks passed
@tenzen-y tenzen-y deleted the first-implementaton-for-traininig-v2 branch September 6, 2024 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants