-
Notifications
You must be signed in to change notification settings - Fork 715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2170: Initial Implementations for v2 Manager #2236
KEP-2170: Initial Implementations for v2 Manager #2236
Conversation
9265a1c
to
25e6763
Compare
25e6763
to
5828054
Compare
/hold |
Pull Request Test Coverage Report for Build 10705754285Details
💛 - Coveralls |
4780722
to
d52692c
Compare
d52692c
to
dafd032
Compare
Ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this @tenzen-y!
Makefile
Outdated
@@ -75,6 +75,10 @@ testall: manifests generate fmt vet golangci-lint test ## Run tests. | |||
test: envtest | |||
KUBEBUILDER_ASSETS="$(shell setup-envtest use $(ENVTEST_K8S_VERSION) -p path)" go test ./... -coverprofile cover.out | |||
|
|||
.PHONY: test-integrationv2 | |||
test-integrationv2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to add envtest
instructions, similar to v1 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that's a good point.
You're right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
flag.StringVar(&webhookServiceName, "webhook-service-name", "training-operator", "Name of the Service used as part of the DNSName") | ||
flag.StringVar(&webhookSecretName, "webhook-secret-name", "training-operator-webhook-cert", "Name of the Secret to store CA and server certs") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we separate resources names between V1 and V2 ? In that case, we can give user an option to deploy two versions of Training Operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense.
How about "webhook-secret-name=training-operator-v2-webhook-cert" and "webhook-service-name=training-operator-v2"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think it looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
} | ||
} | ||
|
||
func (r *ClusterTrainingRuntimeReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to have reconcilers for Training Runtimes since we don't do any orchestration for them initially ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need this reconciler since the TrainingRuntime reconcilers have only event handlers and predicators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will implement event handlers and predicators in the third phase.
1st phase: this PR.
2nd phase: JobSet builder interface.
3rd phase: actual reconciler and event handlers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kind of event handlers and predicators are you planning to implement for Training Runtime ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I'm planning the following handlers, but I may add more event handling mechanisms in the actual implementation steps.
Create: Trigger the TrainJob reconcilers by triggering the generic event of the TrainJob reconcilers.
Update: Ignore the Update event since the webhook can be disabled by webhookConfiguration failurePolicy.
Delete: Ignore the Delete event when the Runtime is still used by any TrainJob since the webhook can be disabled by webhookConfiguration failurePolicy.
Generic: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y For the Create event do you mean: Trigger the TrainingRuntime
reconcilers by triggering the generic event of the TrainingRuntime
reconcilers ?
For the Delete, do you mean that user can't delete TrainingRuntime if it used by any TrainJob ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the Create event do you mean: Trigger the TrainingRuntime reconcilers by triggering the generic event of the TrainingRuntime reconcilers ?
No, we do not trigger the TrainingRuntime reconciler; we just trigger the TrainJob reconciler by the TrainingRuntime event handler.
For the Delete, do you mean that user can't delete TrainingRuntime if it used by any TrainJob ?
Yes, that's right. We should prevent accidently deleting the TrainingRuntime when the TrainingRuntime is referred by any TranJob.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we do not trigger the TrainingRuntime reconciler; we just trigger the TrainJob reconciler by the TrainingRuntime event handler.
@tenzen-y What orchestration for the TrainJob are we going to do when we trigger such event ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me show the actual step in the following:
- During setting up controllers, we pass the go channel for TrainJob Generic EventHandler to the TrainigRuntime controller.
- TrainingRuntime Controller: Get TrainingRuntime creation event
- TrainingRuntime Controller: Send the object to the TrainJob controller using go channel
- TrainJob Controller: Start reconciling
This is a generic technique to start reconciling by out-of-reconcilers. I will implement the mechanism in the 3rd step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I synced with @andreyvelich offline.
I was supposed to provide functionality creation of TraininigRuntime after the TrainJob was created.
So, I added TrainingRuntime reconcilers. But, we decided not to allow user to create TrainJob before the TrainingRuntime is created in the first iteration. So, users need to create the TrainigRuntime before the TrainJob is created.
Additionally, we decided to handle the deletion of TrainingRuntim by finalizers, and then we never used to reconciler for deletion event for TrainingRuntime.
pkg/controller.v2/controller.go
Outdated
} | ||
if err := NewTrainJobReconciler( | ||
mgr.GetClient(), | ||
mgr.GetEventRecorderFor("training-operator-trainjob-controller"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to add the training-operator-
name in the event recorder ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should add the prefix since this event recorder name is used across the entire cluster.
|
||
func NewTrainJobReconciler(client client.Client, recorder record.EventRecorder) *TrainJobReconciler { | ||
return &TrainJobReconciler{ | ||
log: ctrl.Log.WithName("trainjob-reconciler"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we call this logger: trainjob-controller
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the "train job-reconciler" would be better since the controller contains reconcilers, event handlers, and predicators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but from the logging perspective we are not going to separate it, isn't ?
It might be easier from user perspective to see those logs under trainjob-controller
name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm, it seems that you have a point.
So, I will replace trainjob-reconciler
with trainjob-controller
. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -0,0 +1,74 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you want to separate those integration tests into separate folder (not into the /pkg/controller.v2/trainjob_controller_test.go
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you want to separate those integration tests into separate folder (not into the /pkg/controller.v2/trainjob_controller_test.go)?
This separate test directory allows us to avoid fragmentation of the place for testing.
Because, indeed, the integration testing depends on the multiple directories. So, I think locating integration testing in each directory would not be appropriate.
Maybe similar to Kueue and JobSet, we should create subfolder for e2e and integration tests under /test directory.
Yes, I assumed the directory structure. In the first implementation, we want to add unit and integration (envtest) tests. After that we want to implement E2E testing. How do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but can we keep the folder structure correctly initially ?
E.g. add the integration tests under:
/test/integration/controller.v2/trainjob_controller_test.go
/test/integration/controller.v2/trainingruntime_controller_test.go
/test/integration/controller.v2/suite_test.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proposed directory structure looks good.
Thank you for proposing that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
dafd032
to
470512e
Compare
/assign @kubeflow/wg-training-leads for review |
470512e
to
0120280
Compare
@@ -0,0 +1,72 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to remove these test as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can remove this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
"k8s.io/client-go/rest" | ||
"sigs.k8s.io/controller-runtime/pkg/client" | ||
|
||
"github.com/kubeflow/training-operator/test/integration/framework" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you see a value to separate framework logic into separate file ?
E.g. we can add this logic in the suite_test.go
, similar to JobSet: https://github.com/kubernetes-sigs/jobset/blob/main/test/integration/controller/suite_test.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed since we will add webhook server integration testing and we will use this framework to set up testing environment.
- /test/integration/controller.v2/
- /test/integration/webhook.v2/trainjob_test.go
- /test/integration/webhook.v2/trainingruntime_test.go
- /test/integration/webhook.v2/clustertrainingruntime_test.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using integration testing framework would be better between testing for trainjob_controller, trainjob_wwbhook, trainingruntime_webhook, and clustertrainingruntime_webhook.
@@ -0,0 +1,74 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove it for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
0120280
to
c9b5a9c
Compare
pkg/controller.v2/controller.go
Outdated
@@ -0,0 +1,29 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we name this file setup.go
to make it more descriptive ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
import ( | ||
"context" | ||
|
||
"github.com/go-logr/logr" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this is the controller-runtime logger. The controller-runtime logger implement this logr interface, so we need to import this library to use the controller-runtime logger. You can see the actual controller-runtime logger in the Reconcile function.
Signed-off-by: Yuki Iwai <[email protected]>
c9b5a9c
to
9f8e37f
Compare
if err := r.client.Get(ctx, req.NamespacedName, &trainJob); err != nil { | ||
return ctrl.Result{}, client.IgnoreNotFound(err) | ||
} | ||
log := ctrl.LoggerFrom(ctx).WithValues("trainJob", klog.KObj(&trainJob)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the value to configure controller runtime logger with klog.KObj(&trainJob)
?
Also, since we add logger to the TrainJob reconciler struct, why do you create a new logger here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the value to configure controller runtime logger with klog.KObj(&trainJob) ?
Also, since we add logger to the TrainJob reconciler struct, why do you create a new logger here ?
This is not new logger, actually this logger is propagated from Reconciler and we reuse it.
Here, we add information for which objects are reconciling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, we do not generate new logger, and not add a logger name (WithName
) here. That is WithValue
as you can see here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, that makes sense!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this @tenzen-y!
/lgtm
/assign @kubeflow/wg-training-leads @shravan-achar @kannon92
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, we are ready to merge it!
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel |
What this PR does / why we need it:
I implemented the skeleton v2 manager. So, this PR introduces only the following things:
These skeleton implementations allow us to unblock webhooks or other implementations for other contributors.
This depends on #2237, and #2237 blocks this PR.
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...
format, will close the issue(s) when PR gets merged):Part-of #2207
Checklist: