Skip to content

🐛 fix: (boxcutter) Enable archival and spec changes for ProgressDeadlineExceeded revisions#2610

Open
camilamacedo86 wants to merge 1 commit intooperator-framework:mainfrom
camilamacedo86:fix-transitions
Open

🐛 fix: (boxcutter) Enable archival and spec changes for ProgressDeadlineExceeded revisions#2610
camilamacedo86 wants to merge 1 commit intooperator-framework:mainfrom
camilamacedo86:fix-transitions

Conversation

@camilamacedo86
Copy link
Copy Markdown
Contributor

@camilamacedo86 camilamacedo86 commented Mar 31, 2026

How it works on main

The COS reconciler checks if a revision took too long to roll out. If it did, it sets
Progressing=False/ProgressDeadlineExceeded.

To avoid a reconcile loop after that, main has a watch predicate that blocks all
updates to COS objects with ProgressDeadlineExceeded. The predicate only allows
deletions through.

What is the problem?

The predicate blocks too much. It blocks all updates, not just status updates.

Example: a new revision rolls out and tries to archive the old stuck one by patching
lifecycleState: Archived. The predicate sees ProgressDeadlineExceeded and drops the
event. The old revision never gets archived.

Example scenario

  1. User installs a ClusterExtension. The CE controller creates COS-rev-1.
  2. COS-rev-1 gets stuck (e.g. a Deployment never becomes ready). After
    ProgressDeadlineMinutes, the reconciler sets Progressing=False/ProgressDeadlineExceeded.
  3. User updates the ClusterExtension. The CE controller creates COS-rev-2.
  4. COS-rev-2 rolls out successfully. It patches COS-rev-1 with lifecycleState: Archived
    so the old revision gets cleaned up.
  5. The watch predicate sees COS-rev-1 has ProgressDeadlineExceeded and drops the event.
    COS-rev-1 never reconciles, never processes the archival, and stays stuck forever.

Why does the reconcile loop happen?

Every time the reconciler runs on a deadline-exceeded COS, it calls markAsProgressing()
which sets Progressing=True. Then the deadline check immediately sets it back to
Progressing=False/ProgressDeadlineExceeded. This back-and-forth produces a status
update, which triggers another reconcile, and so on.

The predicate was added to break this loop. But it also breaks archiving.

What is the fix?

1. Make ProgressDeadlineExceeded sticky (prevents reconcile loop)

Once ProgressDeadlineExceeded is set, markAsProgressing() will not overwrite
it with RollingOut for Active revisions. This prevents the status flip-flop
that caused the reconcile loop.

The guard only blocks RollingOut. Other reasons (Succeeded, Retrying) can
still update the condition because they represent meaningful state changes.

With no flip-flop, there are no unnecessary status updates, so no reconcile
loop. The watch predicate is removed.

2. Skip progress deadline enforcement for Archived revisions

The deadline check now skips Archived revisions:

if pd > 0 && lifecycleState != Archived {
    // deadline check logic
}

This allows Archived revisions to retry teardown operations and update their
status without being overridden by the deadline check.

3. Detect spec changes during rollout (resolution digest)

A SHA256 digest of all catalog filter inputs (packageName, version, channels,
selector, upgradeConstraintPolicy) is stored in the revision's
olm.operatorframework.io/resolution-digest annotation.

Before reusing a rolling-out revision, the controller compares the current
spec's digest with the revision's digest. If they differ, a new revision is
created.

This ensures spec changes are not ignored when a revision has
ProgressDeadlineExceeded.

How it works

  1. Archival: A new successful revision can archive old revisions with
    ProgressDeadlineExceeded
  2. Deletion: Users can delete ClusterExtensions with stuck revisions
  3. Spec changes: Users can update catalog filters even while a revision is rolling out
  4. Error visibility: Retrying conditions are visible even after deadline exceeded
  5. No reconcile loops: Status updates stabilize once deadline is exceeded

Motivation

Addresses feedback from:
openshift/operator-framework-operator-controller#687 (comment)

Copilot AI review requested due to automatic review settings March 31, 2026 05:30
@netlify
Copy link
Copy Markdown

netlify bot commented Mar 31, 2026

Deploy Preview for olmv1 ready!

Name Link
🔨 Latest commit a4ce4b3
🔍 Latest deploy log https://app.netlify.com/projects/olmv1/deploys/69d68a3af8553c0009fe97d9
😎 Deploy Preview https://deploy-preview-2610--olmv1.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Mar 31, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign oceanc80 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@camilamacedo86 camilamacedo86 changed the title 🐛 fix: Allow spec changes after ProgressDeadlineExceeded 🐛 fix: (boxcutter) Allow spec changes after ProgressDeadlineExceeded Mar 31, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the ClusterObjectSet controller’s update-event predicate so that once a ClusterObjectSet hits ProgressDeadlineExceeded, spec-driven updates (identified via metadata.generation changes) are still reconciled while status-only updates remain suppressed—preventing resources from getting stuck and unable to transition (e.g., to Archived).

Changes:

  • Adjust skipProgressDeadlineExceededPredicate to allow updates when ObjectOld.Generation != ObjectNew.Generation.
  • Preserve the existing behavior of suppressing status-only update churn after ProgressDeadlineExceeded.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 31, 2026

Codecov Report

❌ Patch coverage is 81.53846% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.94%. Comparing base (4510b1b) to head (a4ce4b3).
⚠️ Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
...er/controllers/clusterextension_reconcile_steps.go 72.09% 6 Missing and 6 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2610      +/-   ##
==========================================
- Coverage   68.95%   68.94%   -0.01%     
==========================================
  Files         139      139              
  Lines        9891     9922      +31     
==========================================
+ Hits         6820     6841      +21     
- Misses       2562     2567       +5     
- Partials      509      514       +5     
Flag Coverage Δ
e2e 37.62% <40.00%> (+0.07%) ⬆️
experimental-e2e 52.24% <63.07%> (-0.02%) ⬇️
unit 53.64% <60.00%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

func (c *ClusterObjectSetReconciler) SetupWithManager(mgr ctrl.Manager) error {
skipProgressDeadlineExceededPredicate := predicate.Funcs{
UpdateFunc: func(e event.UpdateEvent) bool {
rev, ok := e.ObjectNew.(*ocv1.ClusterObjectSet)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would extract the whole update func into standalone function, not just the logic under shouldAllowProgressDeadlineExceededUpdate

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pedjak
I think we might can remove the predicate.
See the current proposal now.
WDYT? I tried to update the PR desc as well

Copilot AI review requested due to automatic review settings March 31, 2026 07:32
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@joelanford
Copy link
Copy Markdown
Member

Maybe I misunderstand the progress deadline exceeded stuff. But I thought exceeding the deadline literally only meant that we set Progressing=False?

But otherwise all other functionality and decision making around reconciliation of the CE works the same as it does prior to the deadline exceeding.

If that's not the case, why?

@camilamacedo86 camilamacedo86 changed the title 🐛 fix: (boxcutter) Allow spec changes after ProgressDeadlineExceeded WIP 🐛 fix: (boxcutter) Allow spec changes after ProgressDeadlineExceeded Apr 1, 2026
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 1, 2026
@camilamacedo86 camilamacedo86 changed the title WIP 🐛 fix: (boxcutter) Allow spec changes after ProgressDeadlineExceeded 🐛 fix: (boxcutter) Allow spec changes after ProgressDeadlineExceeded Apr 1, 2026
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 1, 2026
Copilot AI review requested due to automatic review settings April 1, 2026 06:16
@camilamacedo86
Copy link
Copy Markdown
Contributor Author

Hi @joelanford

I changed the proposal here. I think it might be a better way to solve it.
Also, with the Claude I update the description with a description of this.
I tried to answer your questions. Thank you for your time.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

@pedjak pedjak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three suggestions:

  1. Add a comment explaining the ReasonSucceeded carve-out. The guard uses reason != ReasonSucceeded, which means any new reason added in the future is silently blocked by default. That's the safe default, but it's a hidden constraint — a short comment explaining why only Succeeded is exempt would prevent a future contributor from accidentally breaking this.

  2. Add a debug log in the early return. When the guard fires, it silently swallows a state transition with zero signal. A V(1) log line like "skipping markAsProgressing: ProgressDeadlineExceeded is sticky" would help debugging when someone is troubleshooting why a COS condition isn't updating.

  3. Consider an integration test for the archival scenario (follow-up). The unit tests cover markAsProgressing well, but the motivating bug (archival blocked after deadline exceeded) isn't tested e2e. A test simulating deadline exceeded → lifecycleState: Archived → verify reconcile proceeds would cover the actual user-facing scenario.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@camilamacedo86 camilamacedo86 changed the title 🐛 fix: (boxcutter) Allow spec changes after ProgressDeadlineExceeded WIP 🐛 fix: (boxcutter) Allow spec changes after ProgressDeadlineExceeded Apr 8, 2026
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 8, 2026
@camilamacedo86 camilamacedo86 requested a review from Copilot April 8, 2026 09:44
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@camilamacedo86 camilamacedo86 changed the title WIP 🐛 fix: (boxcutter) Allow spec changes after ProgressDeadlineExceeded 🐛 fix: (boxcutter) Allow spec changes after ProgressDeadlineExceeded Apr 8, 2026
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 8, 2026
@camilamacedo86 camilamacedo86 requested review from Copilot and pedjak April 8, 2026 11:56
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

internal/operator-controller/controllers/clusterextension_reconcile_steps.go:1

  • The digest is sensitive to ordering in slices/arrays (e.g., Channels, MatchExpressions) even when different orderings may be semantically equivalent. This can cause unnecessary “spec changed → re-resolve” behavior. Consider canonicalizing inputs before hashing (e.g., sort Channels, and convert metav1.LabelSelector to a normalized string form via metav1.LabelSelectorAsSelector(...).String() or explicit sorting of requirements) to reduce unintended churn.
/*

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@camilamacedo86 camilamacedo86 changed the title 🐛 fix: (boxcutter) Allow spec changes after ProgressDeadlineExceeded 🐛 fix: (boxcutter) Enable archival and spec changes for ProgressDeadlineExceeded revisions Apr 8, 2026
This commit addresses the reconcile loop and archival timeout issues when
ProgressDeadlineExceeded is set on a ClusterObjectSet.

Generated-By: Claude
@tmshort tmshort changed the title 🐛 fix: (boxcutter) Enable archival and spec changes for ProgressDeadlineExceeded revisions 🐛 fix: (boxcutter) Enable archival and spec changes for ProgressDeadlineExceeded revisions Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants