refactor: Multi-Node Jobs ended_at time not being set and Complete status#184
refactor: Multi-Node Jobs ended_at time not being set and Complete status#184yuechao-qin wants to merge 1 commit intomasterfrom
ended_at time not being set and Complete status#184Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
| return datetime.datetime.now(tz=datetime.timezone.utc) | ||
|
|
||
|
|
||
| def _set_terminal_state( |
There was a problem hiding this comment.
I'm not sure this function adds any value. It might also be adding confusion as it's not obvious (from the sall site) what exactly it is doing. Let's revert it.
There was a problem hiding this comment.
I messed up on this function. I meant to make the ended_at also required, not optional. So that it's clear for any terminal state transitions it requires 1) status and 2) ended time changes. I renamed the function to be _record_terminal_state and ended_at to be required for more clarity.
Do you think this function is helpful then?
Else, do you have any suggestions on how to 1) make it clear to the developers where the terminal transitions are and 2) what required changes (status, ended_at) are needed when transitioning to a terminal state?
|
Nit: This looks like a fix combined with a refactoring. This increases the diff and makes it harder to see what has changed and what hasn't. The fix changes are hidden among refactoring changes. |
c4b7a93 to
28d034c
Compare
|
Can you please extract the Job completion fix, so that we can merge it ASAP? |
28d034c to
ec6564b
Compare
ec6564b to
610c157
Compare
ended_at time not being setended_at time not being set and Complete status
|
Created another branch that has the fix https://app.graphite.com/github/pr/TangleML/tangle/200 |
8f74583 to
75bda80
Compare
610c157 to
1831a39
Compare
1831a39 to
2b839ff
Compare

Background
Kubernetes Job conditions use
"Complete"for successful jobs and"Failed"for failed jobs. TheLaunchedKubernetesJob.ended_atproperty was checking for"Succeeded"instead of"Complete", which meantended_atwas alwaysNonefor successful K8s Jobs. This caused the API to return no end time for completed executions.Separately, when the orchestrator hit an internal error processing a running execution (SYSTEM_ERROR), it set the status but never recorded
ended_at, making it impossible to tell when the failure occurred.Refactor
JobConditionTypeandConditionStatusstatus,ended_at,exit_code,started_at) are set through a single helper to prevent future inconsistenciesChanges
cloud_pipelines_backend/launchers/kubernetes_launchers.pyJobConditionTypeandConditionStatusenums with K8s API doc references, replacing magic strings throughout the fileended_atto check("Complete", "Failed")instead of("Succeeded", "Failed")cloud_pipelines_backend/orchestrator_sql.py_set_terminal_state()helper that sets all terminal fields on aContainerExecutionin one placeended_at(was previously missing)Test