Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2170: Add the TrainJob state transition design #2298

Merged

Conversation

tenzen-y
Copy link
Member

What this PR does / why we need it:
I added the TrainJob state machine design.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Part-of #2207
Relates to: #2170

Checklist:

  • Docs included if any changes are user facing

@tenzen-y
Copy link
Member Author

/hold
/assign @kubeflow/wg-training-leads

@coveralls
Copy link

coveralls commented Oct 20, 2024

Pull Request Test Coverage Report for Build 11645733536

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 100.0%

Totals Coverage Status
Change from base Build 11636357716: 0.0%
Covered Lines: 77
Relevant Lines: 77

💛 - Coveralls

@tenzen-y tenzen-y force-pushed the add-trainjob-state-transition branch from 851738c to 069345f Compare October 20, 2024 22:00
@google-oss-prow google-oss-prow bot added size/M and removed size/L labels Oct 20, 2024
@tenzen-y tenzen-y force-pushed the add-trainjob-state-transition branch 2 times, most recently from 39bfab2 to 8ba2206 Compare October 20, 2024 22:28
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this @tenzen-y!
Please take a look @kubeflow/wg-training-leads @kannon92 @akshaychitneni @varshaprasad96 @ahg-g @vsoch @danielvegamyhre

// TrainJobSuspended means the TrainJob is suspended.
TrainJobSuspended string = "Suspended"

// TrainJobCompleted means that the actual jobs have completed its execution.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be TrainJob, and similar for other statuses ?

Suggested change
// TrainJobCompleted means that the actual jobs have completed its execution.
// TrainJobCompleted means that the TrainJob has completed its execution.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -279,6 +279,43 @@ type TrainJob struct {
Status TrainJobStatus `json:"status,omitempty"`
}

const (
Copy link
Member

@andreyvelich andreyvelich Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we define those consts as TrainJobConditionType similar to Batch/Job: https://github.com/kubernetes/api/blob/master/batch/v1/types.go#L617 ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so since our conditions typed are metav1.Condition and the Type is typed string.
The batch/v1 Job uses the typed JobConditionType since Job conditions typed are dedicated JobCondition: https://github.com/kubernetes/api/blob/2be23f6c5a7184af9846ff458a11751765ea2bde/batch/v1/types.go#L662

When we use the dedicated typed TrainJobConditionType, we need to cast the TrainJobConditionType to string everywhere. That is not ideal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you are right.


// TrainJobResumedReason is the "Suspended" condition reason.
// When the TrainJob suspension is changed from True to False, this is added.
TrainJobResumedReason string = "Resumed"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y @kannon92 Do we have that status reason in JobSet or Job ?

Copy link
Member Author

@tenzen-y tenzen-y Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we have. In the batch/v1 Job level, JobResumed is used as a reason for the Suspended condition type.
In the JobSet level, ResumeJobs is used as a reason for the Suspended condition type.

Comment on lines +305 to +316
// TrainJobJobsCreationSucceededReason is the "Created" condition reason.
// When the creating objects succeeded after building succeeded, this is added.
TrainJobJobsCreationSucceededReason string = "JobsCreationSucceeded"

// TrainJobJobsBuildFailedReason is the "Created" condition reason.
// When the building objects based on the TrainJob and the specified runtime failed,
// this is added.
TrainJobJobsBuildFailedReason string = "JobsBuildFailed"

// TrainJobJobsCreationFailedReason is "Created" condition reason.
// When the creating objects failed even though building succeeded, this is added.
TrainJobJobsCreationFailedReason string = "JobsCreationFailed"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we introduce those conditions in the 2nd iteration ?
I think, we should discuss if users really want to decouple conditions when TrainJob's object creation failed and when TrainJob fails.

Copy link
Member Author

@tenzen-y tenzen-y Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are reasons, so we use these in the following. Additionally, if we do not distinguish JobCreationFailed and JobBuildFailed, it is hard to understand which part failed. So, these reasons must be included in the first phase.

status:
  conditions:
  - type: Created
    status: false
    reason: JobsCreationFailed
status:
  conditions:
  - type: Created
    status: false
    reason: JobsBuildFailed
status:
  conditions:
  - type: Created
    status: true
    reason: JobsCreationSucceeded

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have validation webhook, what is the use-case you see when we can hit the JobsBuildFailed reason ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can not validate in advance if NewObject succeed since we can not know in advance what happens during the NewObject.

Only during reconciling, we can know the actual error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I guess one of the examples could be if NewObject calls Kubernetes API server and this call fails.
In that case, we want to transition TrainJob to JobsBuildFailed status.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, we want to transition TrainJob to JobsBuildFailed status.

Yes, that's right. In the current JobSet plugin, the building error could happen in the following:

Additionally, every plugin could return errors during Kubeflow Job Pipeline Framerowk.

Comment on lines 1000 to 1003
- type: JobSetSuspended
status: false
- type: JobSetCompleted
status: true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to propagate JobSet conditions to the TrainJob ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propagated these JobSet conditions so that runtimes can be propagated to the arbitrary conditions.
But, in the JobSet case, these seem to be slightly redundant. So, I'm wondering if we only extend the runtime interface so that the plugin can propagate arbitrary conditions, but JobSet plugin does not propagate own conditions, then TrainJob does not have the JobSet conditions.

But, I guess that the JobSet StartupPolocy conditions could help to understand what happens in the actual Jobs.
So, we can propagate only StartupPolicyInProgress and StartupPolicyCompleted.

https://github.com/kubernetes-sigs/jobset/blob/440f53e13ed1db15d4f1d5a04c2450e74df4d1e8/api/jobset/v1alpha2/jobset_types.go#L75-L78

@andreyvelich WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I'm wondering if we only extend the runtime interface so that the plugin can propagate arbitrary conditions, but JobSet plugin does not propagate own conditions, then TrainJob does not have the JobSet conditions.

I would suggest that we add those once we have initial implementation of TrainJob status.
I think, this is valid option, but we should discuss what conditions be be propagated by runtime plugins.

Any thoughts @tenzen-y ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest that we add those once we have initial implementation of TrainJob status.
I think, this is valid option, but we should discuss what conditions be be propagated by runtime plugins.

I think that we need to implement the conditions propagation mechanism, but it's not clear in the JobSet situation. Especially, which conditions should be propagated to the TrainJob. So, I would propose that we extend the runtime interface so that plugins can propagate the conditions to TrainJob. But, in the first status implementation, we do not propagate any JobSet conditions to the TrainJob.

What do you think? @andreyvelich

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I would propose that we extend the runtime interface so that plugins can propagate the conditions to TrainJob.

Sure, we can have this. Maybe we should create a tracking issue to implement it ?
I don't think is has a high priority compare to other tasks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sure. I will open the issue.
But, let me mention the propagation mechanism in this KEP.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

[...]
status:
conditions:
- type: Created
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be helpful to add transition time and also reason for failure within conditions? I believe it would be help debugging failures

Copy link
Member Author

@tenzen-y tenzen-y Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pseudo conditions. So, we actually support all fields supported in metav1.Conditions including transition time as you can see there:

// Conditions for the TrainJob.
Conditions []metav1.Condition `json:"conditions,omitempty"`

@tenzen-y
Copy link
Member Author

@andreyvelich I addressed all your comments. PTAL, thanks.

Comment on lines +963 to +970
created_choice --> Created=True: Succeeded to build and deploy Jobs.
created_choice --> Created=False: Failed to build and deploy Jobs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be helpful to add small clarified message what does Failed to build Jobs mean ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

[*] --> created_choice: TrainJob is submitted.
created_choice --> Created=True: Succeeded to build and deploy Jobs.
created_choice --> Created=False: Failed to build and deploy Jobs.
Created=False --> Created=False: Wait for updated appropriate TrainJob.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by Wait for updated appropriate TrainJob ?
I thought if build or deploy Jobs fails, we transition TrainJob to Failed condition ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TrainJob is the mutable object. So, we wait for updating TrainJob with proper fields here.

Copy link
Member

@andreyvelich andreyvelich Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some API in TrainJob is immutable, like managedBy, but the Trainer API is mutable, right ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right. We allow users to modify the TrainJob except for some fields like managedBy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, in that case my question is why do we transition TrainJob in the Failed state if the underlying JobSet is Failed ? Since TrainJob image is mutable, user can override it.
Or the main motivation is that when TrainJob is in Created state, it only can transition to Failed or Succeeded ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in that case my question is why do we transition TrainJob in the Failed state if the underlying JobSet is Failed ? Since TrainJob image is mutable, user can override it.

AFAIK, both batch/v1 Job, and JobSet can not be restarted once those have reached the terminal phase (complete or failed).

@tenzen-y tenzen-y force-pushed the add-trainjob-state-transition branch from cdf9267 to 1cca202 Compare November 1, 2024 19:40
@tenzen-y
Copy link
Member Author

tenzen-y commented Nov 1, 2024

@andreyvelich I addressed all your feedback. PTAL, thanks.

Failed=True --> [*]

#COMPLETION
terminal_choice --> Completed=True: Actual Jobs (e.g., JobSet) completed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we actually call this condition Succeeded to be consistent with JobSet: https://github.com/kubernetes-sigs/jobset/blob/main/api/jobset/v1alpha2/jobset_types.go#L182-L183 ?

From my understanding Completed: True mens that TrainJob is in Failed or Succeeded state.

Copy link
Member Author

@tenzen-y tenzen-y Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, Job and JobSet succeed (.replicatedJobsStatus.succeed), and failed (.replicatedJobsStatus.failed) count are not terminal phase, and those are intermediate one. Additionally those are not guaranteed consistency when Success and Failure criteria are conflicts.

So, instead of success count, we should rely on the .status.terminalState or conditions.

Copy link
Member

@andreyvelich andreyvelich Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to have Success state for the TrainJob conditions ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TrainJob has only Suspend, Completed, Failed, and Created conditions.

@tenzen-y tenzen-y force-pushed the add-trainjob-state-transition branch from 58d8587 to b81a633 Compare November 2, 2024 20:51
Comment on lines +286 to +287
// TrainJobComplete means that the TrainJob has completed its execution.
TrainJobComplete string = "Complete"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use "Complete" condition type name to align with batch/v1 Job for now, then we will open an issue to rename "Completed" to "Complete" in the JobSet side.

@andreyvelich
Copy link
Member

Thanks for the update @tenzen-y!
/lgtm
/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tenzen-y
Copy link
Member Author

tenzen-y commented Nov 2, 2024

All green, thanks!

/hold cancel

@google-oss-prow google-oss-prow bot merged commit 9e46f9d into kubeflow:master Nov 2, 2024
43 checks passed
@tenzen-y tenzen-y deleted the add-trainjob-state-transition branch November 2, 2024 21:31
saileshd1402 pushed a commit to saileshd1402/training-operator that referenced this pull request Dec 2, 2024
* KEP-2170: Add the TrainJob state transition design

Signed-off-by: Yuki Iwai <[email protected]>

* Replace actual jobs with TrainJob

Signed-off-by: Yuki Iwai <[email protected]>

* Remove the JobSet conditions propagation and Add expanding runtime framework interfaces for each plugin

Signed-off-by: Yuki Iwai <[email protected]>

* Expand the Creation Failed reasons

Signed-off-by: Yuki Iwai <[email protected]>

* Rename Completed to Complete

Signed-off-by: Yuki Iwai <[email protected]>

---------

Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>
google-oss-prow bot pushed a commit that referenced this pull request Dec 9, 2024
* Added test for create-pytorchjob.ipynb

Signed-off-by: sailesh duddupudi <[email protected]>

* fix yaml syntax

Signed-off-by: sailesh duddupudi <[email protected]>

* Fix uses path

Signed-off-by: sailesh duddupudi <[email protected]>

* Add actions/checkout

Signed-off-by: sailesh duddupudi <[email protected]>

* Add bash to action.yaml

Signed-off-by: sailesh duddupudi <[email protected]>

* Install pip dependencies step

Signed-off-by: sailesh duddupudi <[email protected]>

* Add quotes for args

Signed-off-by: sailesh duddupudi <[email protected]>

* Add jupyter

Signed-off-by: sailesh duddupudi <[email protected]>

* Add nbformat_minor: 5 to fix invalid format error

Signed-off-by: sailesh duddupudi <[email protected]>

* Fix job name

Signed-off-by: sailesh duddupudi <[email protected]>

* test papermill-args-yaml

Signed-off-by: sailesh duddupudi <[email protected]>

* testing multi line args

Signed-off-by: sailesh duddupudi <[email protected]>

* testing multi line args1

Signed-off-by: sailesh duddupudi <[email protected]>

* testing multi line args2

Signed-off-by: sailesh duddupudi <[email protected]>

* testing multi line args3

Signed-off-by: sailesh duddupudi <[email protected]>

* Parameterize sdk install

Signed-off-by: sailesh duddupudi <[email protected]>

* Remove unnecessary output

Signed-off-by: sailesh duddupudi <[email protected]>

* nbformat normailze

Signed-off-by: sailesh duddupudi <[email protected]>

* [SDK] Training Client Conditions related unit tests (#2253)

* test: add unit test for get_job_conditions function of training client

Signed-off-by: Bobbins228 <[email protected]>

* test: add unit test for is_job_created function of training client

Signed-off-by: Bobbins228 <[email protected]>

* test: add unit test for is_job_running function of training client

Signed-off-by: Bobbins228 <[email protected]>

* test: add unit test for is_job_restarting function of training client

Signed-off-by: Bobbins228 <[email protected]>

* test: add unit test for is_job_failed function of training client

Signed-off-by: Bobbins228 <[email protected]>

* test: add unit test for is_job_succeded function of training client

Signed-off-by: Bobbins228 <[email protected]>

* test: improve job condition unit tests efficiency

Signed-off-by: Bobbins228 <[email protected]>

---------

Signed-off-by: Bobbins228 <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* [SDK] test: add unit test for list_jobs method of the training_client (#2267)

Signed-off-by: wei-chenglai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Generate clientset, openapi spec for the V2 APIs (#2273)

Generate clientset, informers, listers and open api spec
for v2alpha1 APIs.

Signed-off-by: Varsha Prasad Narsing <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* [SDK] Use torchrun to create PyTorchJob from function (#2276)

* [SDK] Use torchrun to create PyTorchJob from function

Signed-off-by: Andrey Velichkevich <[email protected]>

* Update PyTorchJob SDK example

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add consts for entrypoint

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add check for num procs per worker

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* [SDK] test: add unit test for get_job_logs method of the training_client (#2275)

Signed-off-by: wei-chenglai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* [v2alpha] Move GV related codebase (#2281)

Move GV related codebase in v2alpha

Signed-off-by: Varsha Prasad Narsing <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Implement runtime framework (#2248)

* KEP-2170: Implement runtime framework interfaces

Signed-off-by: Yuki Iwai <[email protected]>

* Remove grep dependency

Signed-off-by: Yuki Iwai <[email protected]>

* KEP-2170: Implement ValidateObjects interface to the runtime framework

Signed-off-by: Yuki Iwai <[email protected]>

* KEP-2170: Expose the TrainingRuntime and ClusterTrainingRuntime Kind

Signed-off-by: Yuki Iwai <[email protected]>

* KEP-2170: Remove unneeded scheme field from the internal TrainingRuntime

Signed-off-by: Yuki Iwai <[email protected]>

* Rephrase the error message

Signed-off-by: Yuki Iwai <[email protected]>

* Distinguish TrainingRuntime and ClusterTrainingRuntime when creating indexes for the TrainJobs

Signed-off-by: Yuki Iwai <[email protected]>

* Propagate the TrainJob labels and annotations to the JobSet

Signed-off-by: Yuki Iwai <[email protected]>

* Remove PodAnnotations from the runtime info

Signed-off-by: Yuki Iwai <[email protected]>

* Implement TrainingRuntime ReplicatedJob validation

Signed-off-by: Yuki Iwai <[email protected]>

* Add TODO comments

Signed-off-by: Yuki Iwai <[email protected]>

* Replace queueSuspendedTrainJob with queueSuspendedTrainJobs

Signed-off-by: Yuki Iwai <[email protected]>

---------

Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Add DeepSpeed Example with Pytorch Operator (#2235)

Signed-off-by: Syulin7 <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API (#2283)

* KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API

Signed-off-by: Andrey Velichkevich <[email protected]>

* Rename RuntimeRef in runtime framework

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Adding CEL validations on v2 TrainJob CRD (#2260)

Signed-off-by: Akshay Chitneni <[email protected]>
Co-authored-by: Akshay Chitneni <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Upgrade Deepspeed demo dependencies (#2294)

Signed-off-by: Syulin7 <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Add manifests for Kubeflow Training V2 (#2289)

* KEP-2170: Add manifests for Kubeflow Training V2

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix invalid name for webhook config in cert

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix integration tests

Signed-off-by: Andrey Velichkevich <[email protected]>

* Move kubebuilder markers to runtime framework

Signed-off-by: Andrey Velichkevich <[email protected]>

* Use Kubernetes recommended labels

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* FSDP Example for T5 Fine-Tuning and PyTorchJob (#2286)

* FSDP Example with PyTorchJob and T5 Fine-Tuning

Signed-off-by: Andrey Velichkevich <[email protected]>

* Modify text

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Implement TrainJob Reconciler to manage objects (#2295)

* KEP-2170: Implement TrainJob Reconciler to manage objects

Signed-off-by: Yuki Iwai <[email protected]>

* Mode dep-crds to manifests/external-crds

Signed-off-by: Yuki Iwai <[email protected]>

* Rename run with runtime

Signed-off-by: Yuki Iwai <[email protected]>

---------

Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Remove Prometheus Monitoring doc (#2301)

Signed-off-by: Sophie <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Decouple JobSet from TrainJob (#2296)

Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Strictly verify the CRD marker validation and defaulting in the integration testings (#2304)

Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Initialize runtimes before the manager starts (#2306)

Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Generate Python SDK for Kubeflow Training V2 (#2310)

* Generate SDK models for the Training V2 APIs

Signed-off-by: Andrey Velichkevich <[email protected]>

* Create pyproject.toml config

Signed-off-by: Andrey Velichkevich <[email protected]>

* Remove comments

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix pre-commit

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Create model and dataset initializers (#2303)

* KEP-2170: Create model and dataset initializers

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add abstract classes

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add storage URI to config

Signed-off-by: Andrey Velichkevich <[email protected]>

* Update .gitignore

Co-authored-by: Kevin Hannon <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix the misspelling for initializer

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add .pt and .pth to ignore_patterns

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Co-authored-by: Kevin Hannon <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Implement JobSet, PlainML, and Torch Plugins (#2308)

* KEP-2170: Implement JobSet and PlainML Plugins

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix nil pointer exception for Trainer

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix unit tests in runtime package

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix unit tests

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix integration tests

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix lint

Signed-off-by: Andrey Velichkevich <[email protected]>

* Implement Torch Plugin

Signed-off-by: Andrey Velichkevich <[email protected]>

* Use list for the Info envs

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix golang ci

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix Torch plugin

Signed-off-by: Andrey Velichkevich <[email protected]>

* Use K8s sets
Update error return
Use ptr.Deref() for nil values

Signed-off-by: Andrey Velichkevich <[email protected]>

* Use client.Object for Build() call

Signed-off-by: Andrey Velichkevich <[email protected]>

* Remove DeepCopy

Signed-off-by: Andrey Velichkevich <[email protected]>

* Remove MLPolicy and PodGroupPolicy from the Info object

Signed-off-by: Andrey Velichkevich <[email protected]>

* Inline error

Signed-off-by: Andrey Velichkevich <[email protected]>

* Remove SDK jar file

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add integration test for Torch plugin

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add TODO to calculate PodGroup values in unit tests

Signed-off-by: Andrey Velichkevich <[email protected]>

* Revert the change to add original Runtime Policies to Info

Signed-off-by: Andrey Velichkevich <[email protected]>

* Create const for the DefaultJobReplicas

Signed-off-by: Andrey Velichkevich <[email protected]>

* Check if PodLabels is empty

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Implement Initializer builders in the JobSet plugin  (#2316)

* KEP-2170: Implement Initializer builder in the JobSet plugin

Signed-off-by: Andrey Velichkevich <[email protected]>

* Update the SDK models

Signed-off-by: Andrey Velichkevich <[email protected]>

* Remove Info from Initializer builder

Signed-off-by: Andrey Velichkevich <[email protected]>

* Update manifests

Signed-off-by: Andrey Velichkevich <[email protected]>

* Update pkg/constants/constants.go

Co-authored-by: Yuki Iwai <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>

* Use var for envs

Signed-off-by: Andrey Velichkevich <[email protected]>

* Remove check manifests from GitHub actions

Signed-off-by: Andrey Velichkevich <[email protected]>

* Move consts to JobSet plugin

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Co-authored-by: Yuki Iwai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Add the TrainJob state transition design (#2298)

* KEP-2170: Add the TrainJob state transition design

Signed-off-by: Yuki Iwai <[email protected]>

* Replace actual jobs with TrainJob

Signed-off-by: Yuki Iwai <[email protected]>

* Remove the JobSet conditions propagation and Add expanding runtime framework interfaces for each plugin

Signed-off-by: Yuki Iwai <[email protected]>

* Expand the Creation Failed reasons

Signed-off-by: Yuki Iwai <[email protected]>

* Rename Completed to Complete

Signed-off-by: Yuki Iwai <[email protected]>

---------

Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Update tf job examples to tf v2 (#2270)

* mnist with summaries updaetd to TF v2

Signed-off-by: yelias <[email protected]>

* tf_sample updaetd to TF v2

Signed-off-by: yelias <[email protected]>

* Add mnist_utils and update dist-mnist

Signed-off-by: yelias <[email protected]>

* Add mnist_utils and update dist-mnist

Signed-off-by: yelias <[email protected]>

* Remove old example - estimator-API, this example has been replaced by distribution_strategy

Signed-off-by: yelias <[email protected]>

* Small fix

Signed-off-by: yelias <[email protected]>

* Remove unsupported powerPC dockerfiles

Signed-off-by: yelias <[email protected]>

* Fix typo in copyright

Signed-off-by: yelias <[email protected]>

---------

Signed-off-by: yelias <[email protected]>
Co-authored-by: yelias <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Add TrainJob conditions (#2322)

* KEP-2170: Implement TrainJob conditions

Signed-off-by: Yuki Iwai <[email protected]>

* Fix API comments

Signed-off-by: Yuki Iwai <[email protected]>

* Make condition message constants

Signed-off-by: Yuki Iwai <[email protected]>

* Stop connecting condition type and reason in JobSet plugin

Signed-off-by: Yuki Iwai <[email protected]>

---------

Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Pin Gloo repository in JAX Dockerfile to a specific commit (#2329)

This commit pins the Gloo repository to a specific commit (43b7acbf) in
the JAX Dockerfile to prevent build failures caused by a recent bug
introduced in the Gloo codebase. By locking the version of Gloo to
a known working commit, we ensure that the JAX build remains stable and
functional until the issue is resolved upstream.

The build failure occurs when compiling the gloo/transport/tcp/buffer.cc
file due to an undefined __NR_gettid constant, which was introduced
after the pinned commit. By using this commit, we bypass the issue and
allow the build to complete successfully.

Signed-off-by: Sandipan Panda <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* [fix] Resolve v2alpha API exceptions (#2317)

Resolve v2alpha API exceptions by adding necessary listType validations.

Signed-off-by: Varsha Prasad Narsing <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Upgrade Kubernetes to v1.30.7 (#2332)

* Upgrade Kubernetes to v1.30.7

Signed-off-by: Antonin Stefanutti <[email protected]>

* Use typed event handlers and predicates in job controllers

Signed-off-by: Antonin Stefanutti <[email protected]>

* Re-organize pkg/common/util/reconciler.go

Signed-off-by: Antonin Stefanutti <[email protected]>

* Update installation instructions in README

Signed-off-by: Antonin Stefanutti <[email protected]>

---------

Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Ignore cache exporting errors in the image building workflows (#2336)

Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Add Torch Distributed Runtime (#2328)

* KEP-2170: Add Torch Distributed Runtime

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add pip list

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Refine the server-side apply installation args (#2337)

Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Add openapi-generator CLI option to skip SDK v2 test generation (#2338)

Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Upgrade kustomization files to Kustomize v5 (#2326)

Signed-off-by: oksanabaza <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Pin accelerate package version in trainer (#2340)

* Pin accelerate package version in trainer

Signed-off-by: Gavrish Prabhu <[email protected]>

* include new line to pass pre-commit hook

Signed-off-by: Gavrish Prabhu <[email protected]>

---------

Signed-off-by: Gavrish Prabhu <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Replace papermill command with bash script

Signed-off-by: sailesh duddupudi <[email protected]>

* Typo fix

Signed-off-by: sailesh duddupudi <[email protected]>

* Move Checkout step outside action.yaml file

Signed-off-by: sailesh duddupudi <[email protected]>

* Add newline EOF in script

Signed-off-by: sailesh duddupudi <[email protected]>

* Pass python dependencies as args and pin versions

Signed-off-by: sailesh duddupudi <[email protected]>

* Update Usage

Signed-off-by: sailesh duddupudi <[email protected]>

* Install dependencies in yaml

Signed-off-by: sailesh duddupudi <[email protected]>

* fix ipynb

Signed-off-by: sailesh duddupudi <[email protected]>

* set bash flags

Signed-off-by: sailesh duddupudi <[email protected]>

* Update script args and add more kubernetes versions for tests

Signed-off-by: sailesh duddupudi <[email protected]>

* add gang-scheduler-name to  template

Signed-off-by: sailesh duddupudi <[email protected]>

* move go setup to template

Signed-off-by: sailesh duddupudi <[email protected]>

* remove -p parameter from script

Signed-off-by: sailesh duddupudi <[email protected]>

---------

Signed-off-by: sailesh duddupudi <[email protected]>
Signed-off-by: Bobbins228 <[email protected]>
Signed-off-by: wei-chenglai <[email protected]>
Signed-off-by: Varsha Prasad Narsing <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: Syulin7 <[email protected]>
Signed-off-by: Akshay Chitneni <[email protected]>
Signed-off-by: Sophie <[email protected]>
Signed-off-by: yelias <[email protected]>
Signed-off-by: Sandipan Panda <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: oksanabaza <[email protected]>
Signed-off-by: Gavrish Prabhu <[email protected]>
Co-authored-by: Mark Campbell <[email protected]>
Co-authored-by: Wei-Cheng Lai <[email protected]>
Co-authored-by: Varsha <[email protected]>
Co-authored-by: Andrey Velichkevich <[email protected]>
Co-authored-by: Yuki Iwai <[email protected]>
Co-authored-by: yu lin <[email protected]>
Co-authored-by: Akshay Chitneni <[email protected]>
Co-authored-by: Akshay Chitneni <[email protected]>
Co-authored-by: Sophie Hsu <[email protected]>
Co-authored-by: Kevin Hannon <[email protected]>
Co-authored-by: YosiElias <[email protected]>
Co-authored-by: yelias <[email protected]>
Co-authored-by: Sandipan Panda <[email protected]>
Co-authored-by: Antonin Stefanutti <[email protected]>
Co-authored-by: Oksana Bazylieva <[email protected]>
Co-authored-by: Gavrish Prabhu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants