Skip to content

refactor: Multi-Node Jobs ended_at time not being set and Complete status#184

Open
yuechao-qin wants to merge 1 commit intomasterfrom
ycq/fix-multinode-ended-at
Open

refactor: Multi-Node Jobs ended_at time not being set and Complete status#184
yuechao-qin wants to merge 1 commit intomasterfrom
ycq/fix-multinode-ended-at

Conversation

@yuechao-qin
Copy link
Copy Markdown
Collaborator

@yuechao-qin yuechao-qin commented Mar 21, 2026

Background

Kubernetes Job conditions use "Complete" for successful jobs and "Failed" for failed jobs. The LaunchedKubernetesJob.ended_at property was checking for "Succeeded" instead of "Complete", which meant ended_at was always None for successful K8s Jobs. This caused the API to return no end time for completed executions.

Separately, when the orchestrator hit an internal error processing a running execution (SYSTEM_ERROR), it set the status but never recorded ended_at, making it impossible to tell when the failure occurred.

Refactor

  • Created enums to set the JobConditionType and ConditionStatus
  • All terminal state fields (status, ended_at, exit_code, started_at) are set through a single helper to prevent future inconsistencies

Changes

cloud_pipelines_backend/launchers/kubernetes_launchers.py

  • Added JobConditionType and ConditionStatus enums with K8s API doc references, replacing magic strings throughout the file
  • Fixed ended_at to check ("Complete", "Failed") instead of ("Succeeded", "Failed")

cloud_pipelines_backend/orchestrator_sql.py

  • Added _set_terminal_state() helper that sets all terminal fields on a ContainerExecution in one place
  • Refactored SUCCEEDED, FAILED, CANCELLED, and SYSTEM_ERROR branches to use the helper
  • SYSTEM_ERROR branch now sets ended_at (was previously missing)

Test

uv run pytest tests/test_kubernetes_launchers.py tests/test_orchestrator_terminal_state.py

Copy link
Copy Markdown
Collaborator Author

yuechao-qin commented Mar 21, 2026

@yuechao-qin yuechao-qin marked this pull request as ready for review March 21, 2026 03:47
@yuechao-qin yuechao-qin requested a review from Ark-kun as a code owner March 21, 2026 03:47
return datetime.datetime.now(tz=datetime.timezone.utc)


def _set_terminal_state(
Copy link
Copy Markdown
Contributor

@Ark-kun Ark-kun Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this function adds any value. It might also be adding confusion as it's not obvious (from the sall site) what exactly it is doing. Let's revert it.

Copy link
Copy Markdown
Collaborator Author

@yuechao-qin yuechao-qin Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I messed up on this function. I meant to make the ended_at also required, not optional. So that it's clear for any terminal state transitions it requires 1) status and 2) ended time changes. I renamed the function to be _record_terminal_state and ended_at to be required for more clarity.

Do you think this function is helpful then?

Else, do you have any suggestions on how to 1) make it clear to the developers where the terminal transitions are and 2) what required changes (status, ended_at) are needed when transitioning to a terminal state?

Copy link
Copy Markdown
Contributor

Ark-kun commented Mar 25, 2026

Nit: This looks like a fix combined with a refactoring. This increases the diff and makes it harder to see what has changed and what hasn't. The fix changes are hidden among refactoring changes.

@yuechao-qin yuechao-qin force-pushed the ycq/fix-multinode-ended-at branch from c4b7a93 to 28d034c Compare March 28, 2026 00:38
@yuechao-qin yuechao-qin requested a review from Ark-kun March 28, 2026 01:12
@yuechao-qin yuechao-qin changed the title fix: Multi-Node Jobs ended_at time not being set d Apr 7, 2026
@Ark-kun
Copy link
Copy Markdown
Contributor

Ark-kun commented Apr 7, 2026

Can you please extract the Job completion fix, so that we can merge it ASAP?

@yuechao-qin yuechao-qin changed the title d fix: Multi-Node Jobs ended_at time not being set Apr 7, 2026
@yuechao-qin yuechao-qin force-pushed the ycq/fix-multinode-ended-at branch from 28d034c to ec6564b Compare April 7, 2026 22:12
@yuechao-qin yuechao-qin changed the base branch from master to graphite-base/184 April 7, 2026 22:44
@yuechao-qin yuechao-qin force-pushed the ycq/fix-multinode-ended-at branch from ec6564b to 610c157 Compare April 7, 2026 22:44
@yuechao-qin yuechao-qin changed the base branch from graphite-base/184 to ycq/fix-only-multinode-issues April 7, 2026 22:45
@yuechao-qin yuechao-qin changed the title fix: Multi-Node Jobs ended_at time not being set refactor: Multi-Node Jobs ended_at time not being set and Complete status Apr 7, 2026
Copy link
Copy Markdown
Collaborator Author

Created another branch that has the fix https://app.graphite.com/github/pr/TangleML/tangle/200
This branch is for refactoring.

@yuechao-qin yuechao-qin changed the base branch from ycq/fix-only-multinode-issues to graphite-base/184 April 9, 2026 22:55
@yuechao-qin yuechao-qin force-pushed the ycq/fix-multinode-ended-at branch from 610c157 to 1831a39 Compare April 9, 2026 22:55
@graphite-app graphite-app bot changed the base branch from graphite-base/184 to master April 9, 2026 22:55
@yuechao-qin yuechao-qin force-pushed the ycq/fix-multinode-ended-at branch from 1831a39 to 2b839ff Compare April 9, 2026 22:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants