-
Notifications
You must be signed in to change notification settings - Fork 715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2170: Add the TrainJob state transition design #2298
Changes from 4 commits
950f384
01628fc
e184fea
1cca202
b81a633
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
@@ -279,6 +279,43 @@ type TrainJob struct { | |||||||||
Status TrainJobStatus `json:"status,omitempty"` | ||||||||||
} | ||||||||||
|
||||||||||
const ( | ||||||||||
// TrainJobSuspended means the TrainJob is suspended. | ||||||||||
TrainJobSuspended string = "Suspended" | ||||||||||
|
||||||||||
// TrainJobCompleted means that the TrainJob has completed its execution. | ||||||||||
TrainJobCompleted string = "Completed" | ||||||||||
|
||||||||||
// TrainJobFailed means that the actual jobs have failed its execution. | ||||||||||
TrainJobFailed string = "Failed" | ||||||||||
|
||||||||||
// TrainJobCreated means that the actual jobs creation has succeeded. | ||||||||||
TrainJobCreated string = "Created" | ||||||||||
) | ||||||||||
|
||||||||||
const ( | ||||||||||
// TrainJobSuspendedReason is the "Suspended" condition reason. | ||||||||||
// When the TrainJob is suspended, this is added. | ||||||||||
TrainJobSuspendedReason string = "Suspended" | ||||||||||
|
||||||||||
// TrainJobResumedReason is the "Suspended" condition reason. | ||||||||||
// When the TrainJob suspension is changed from True to False, this is added. | ||||||||||
TrainJobResumedReason string = "Resumed" | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, we have. In the batch/v1 Job level, |
||||||||||
|
||||||||||
// TrainJobJobsCreationSucceededReason is the "Created" condition reason. | ||||||||||
// When the creating objects succeeded after building succeeded, this is added. | ||||||||||
TrainJobJobsCreationSucceededReason string = "JobsCreationSucceeded" | ||||||||||
|
||||||||||
// TrainJobJobsBuildFailedReason is the "Created" condition reason. | ||||||||||
// When the building objects based on the TrainJob and the specified runtime failed, | ||||||||||
// this is added. | ||||||||||
TrainJobJobsBuildFailedReason string = "JobsBuildFailed" | ||||||||||
|
||||||||||
// TrainJobJobsCreationFailedReason is "Created" condition reason. | ||||||||||
// When the creating objects failed even though building succeeded, this is added. | ||||||||||
TrainJobJobsCreationFailedReason string = "JobsCreationFailed" | ||||||||||
Comment on lines
+305
to
+316
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we introduce those conditions in the 2nd iteration ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These are reasons, so we use these in the following. Additionally, if we do not distinguish
status:
conditions:
- type: Created
status: false
reason: JobsCreationFailed status:
conditions:
- type: Created
status: false
reason: JobsBuildFailed status:
conditions:
- type: Created
status: true
reason: JobsCreationSucceeded There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since we have validation webhook, what is the use-case you see when we can hit the JobsBuildFailed reason ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can not validate in advance if NewObject succeed since we can not know in advance what happens during the NewObject. Only during reconciling, we can know the actual error. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So, I guess one of the examples could be if NewObject calls Kubernetes API server and this call fails. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, that's right. In the current JobSet plugin, the building error could happen in the following:
Additionally, every plugin could return errors during Kubeflow Job Pipeline Framerowk. |
||||||||||
) | ||||||||||
|
||||||||||
type TrainJobSpec struct { | ||||||||||
// Reference to the training runtime. | ||||||||||
// The field is immutable. | ||||||||||
|
@@ -916,6 +953,52 @@ spec: | |||||||||
claimName: user-123-volume | ||||||||||
``` | ||||||||||
|
||||||||||
### State Transition | ||||||||||
|
||||||||||
In this section, we're explaining the TrainJob state transition (`.status.conditions`). | ||||||||||
The basic TrainJob state machine is the below. | ||||||||||
Especially, if we specify the TrainingRuntime or ClusterTrainingRuntime as a runtime, | ||||||||||
the TrainJob terminal condition (`Failed` or `Completed`) is decided based on the JobSet terminal state (`status.terminalState`) | ||||||||||
instead of computing from JobSet conditions. | ||||||||||
|
||||||||||
```mermaid | ||||||||||
stateDiagram-v2 | ||||||||||
#CREATION | ||||||||||
state created_choice <<choice>> | ||||||||||
[*] --> created_choice: TrainJob is submitted. | ||||||||||
created_choice --> Created=True: Succeeded to build and deploy Jobs. | ||||||||||
created_choice --> Created=False: Failed to build and deploy Jobs. | ||||||||||
Comment on lines
+969
to
+970
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would it be helpful to add small clarified message what does Failed to build Jobs mean ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. |
||||||||||
Created=False --> Created=False: Wait for updated appropriate TrainJob. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What do you mean by Wait for updated appropriate TrainJob ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. TrainJob is the mutable object. So, we wait for updating TrainJob with proper fields here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Some API in TrainJob is immutable, like managedBy, but the Trainer API is mutable, right ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that's right. We allow users to modify the TrainJob except for some fields like managedBy. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Make sense, in that case my question is why do we transition TrainJob in the Failed state if the underlying JobSet is Failed ? Since TrainJob image is mutable, user can override it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
AFAIK, both batch/v1 Job, and JobSet can not be restarted once those have reached the terminal phase (complete or failed). |
||||||||||
Created=False --> Created=True: Succeeded to build and deploy Jobs. | ||||||||||
|
||||||||||
#SUSPENSION | ||||||||||
state suspended_choice <<choice>> | ||||||||||
Created=True --> suspended_choice: Handle TrainJob suspension. | ||||||||||
suspended_choice --> Suspended=True: TrainJob is suspended. | ||||||||||
Suspended=True --> Suspended=True: Wait for unsuspending. | ||||||||||
Suspended=True --> Suspended=False: TrainJob is unsuspended. | ||||||||||
suspended_choice --> Suspended=False: TrainJob is not suspended. | ||||||||||
|
||||||||||
#FAILURE | ||||||||||
state terminal_choice <<choice>> | ||||||||||
Suspended=False --> terminal_choice: Actual Jobs go to terminal phase. | ||||||||||
terminal_choice --> Failed=True: Actual Jobs (e.g., JobSet) failed. | ||||||||||
Failed=True --> [*] | ||||||||||
|
||||||||||
#COMPLETION | ||||||||||
terminal_choice --> Completed=True: Actual Jobs (e.g., JobSet) completed. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we actually call this condition Succeeded to be consistent with JobSet: https://github.com/kubernetes-sigs/jobset/blob/main/api/jobset/v1alpha2/jobset_types.go#L182-L183 ? From my understanding Completed: True mens that TrainJob is in Failed or Succeeded state. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Basically, Job and JobSet succeed (.replicatedJobsStatus.succeed), and failed (.replicatedJobsStatus.failed) count are not terminal phase, and those are intermediate one. Additionally those are not guaranteed consistency when Success and Failure criteria are conflicts. So, instead of success count, we should rely on the .status.terminalState or conditions. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are we going to have Success state for the TrainJob conditions ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. TrainJob has only Suspend, Completed, Failed, and Created conditions. |
||||||||||
Completed=True --> [*] | ||||||||||
``` | ||||||||||
|
||||||||||
In the above state transition, the `Created=False` will happen in the following situations and | ||||||||||
those different situations can be identified by the condition reasons (`.status.conditions.[type="Created"].reason`). | ||||||||||
|
||||||||||
- `JobsBuildFailed`: When the TrainJob controller failed to construct objects (resources) using the [runtime framework interfaces](../../../pkg/runtime.v2/framework/interface.go) | ||||||||||
- `JobsCreationFailed`: When the TrainJob controller succeeded to construct objects, but it failed to deploy objects to the cluster. | ||||||||||
|
||||||||||
Additionally, we extend the [runtime framework interfaces](../../../pkg/runtime.v2/framework/interface.go) | ||||||||||
to allow each plugin to propagate the arbitrary conditions to the TrainJob. | ||||||||||
|
||||||||||
## The Training Runtime API | ||||||||||
|
||||||||||
The `TrainingRuntime` is the pre-created configurations of model training on the cluster, | ||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we define those consts as
TrainJobConditionType
similar to Batch/Job: https://github.com/kubernetes/api/blob/master/batch/v1/types.go#L617 ?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so since our conditions typed are
metav1.Condition
and theType
is typed string.The batch/v1 Job uses the typed
JobConditionType
since Job conditions typed are dedicatedJobCondition
: https://github.com/kubernetes/api/blob/2be23f6c5a7184af9846ff458a11751765ea2bde/batch/v1/types.go#L662When we use the dedicated typed
TrainJobConditionType
, we need to cast theTrainJobConditionType
to string everywhere. That is not ideal.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, you are right.