Skip to content

Fix export task not being killed during s3 outage#1564

Open
arthurpassos wants to merge 2 commits intoantalya-26.1from
fix_s3_outage_preventing_export_from_being_cancelled
Open

Fix export task not being killed during s3 outage#1564
arthurpassos wants to merge 2 commits intoantalya-26.1from
fix_s3_outage_preventing_export_from_being_cancelled

Conversation

@arthurpassos
Copy link
Collaborator

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

The drop table operation must signal cancellation to all background tasks and wait until they ack it. This is done checking the is_cancelled flag at each pipeline iteration. If S3 is unreachable and s3_retries_attempt is big (by default, it is 500), the pipeline gets stuck deep in the AWS SDK and never gets a chance to check the signal / flag. Making the task "unkillable".

This PR fixes it in a hackish way by overwriting the query_is_cancelled_predicate, which is checked by the S3 client retry strategy upon ShouldRetry.

Documentation entry for user-facing changes

...

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • S3 Export (2h)
  • Swarms (30m)
  • Tiered Storage (2h)

@arthurpassos
Copy link
Collaborator Author

@codex review

@github-actions
Copy link

github-actions bot commented Mar 20, 2026

Workflow [PR], commit [f15590e]

@arthurpassos arthurpassos added antalya port-antalya PRs to be ported to all new Antalya releases antalya-26.1 labels Mar 20, 2026
@arthurpassos
Copy link
Collaborator Author

AI audit note: This review comment was generated by AI (gpt-5.3-codex).

Audit update for PR #1564 (export-task cancellation during S3 outage):
Reviewed diff: PR page, raw diff

No confirmed defects in reviewed scope.

Coverage summary:

Scope reviewed: src/Common/ThreadStatus.h and src/Storages/MergeTree/ExportPartTask.cpp, including call path into CurrentThread::get().isQueryCanceled() used by S3 retry logic.
Categories failed: none.
Categories passed: call-graph consistency, transition invariants, branch outcomes (success/cancel/exception), cancellation propagation to S3 retry, multithreaded interleavings (task thread vs cancel caller), rollback/partial-update safety in exception path, C++ bug classes (lifetime/race/deadlock/exception-safety/UB) within changed code.
Assumptions/limits: static audit only (no runtime fault injection execution), constrained to the provided PR diff and directly affected call paths.

@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. Chef's kiss.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@arthurpassos
Copy link
Collaborator Author

I'll see if I can add tests (I actually already have those, but for some reason they were not failing :))

@arthurpassos
Copy link
Collaborator Author

I'll see if I can add tests (I actually already have those, but for some reason they were not failing :))

I think I know why. Probably because blocking S3 communication with IP tables was throwing an exception that is non retryable, leading to the export failing fast and no issues at all.

Comment on lines +189 to +191
(*exports_list_entry)->thread_group->setCancelPredicate(
[this]() -> bool { return isCancelled(); });

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that lifetime of the task exceeds lifetime of the thread group at all times? Maybe capture a weak pointer instead of this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So.. when writing this code, this came to mind. The thing is that ThreadGroup is a member of ExportListEntry, which is tied to the lifetime of this task. So I assumed it would be valid. At the same time, ThreadGroupPtr is a shared_ptr, and it gets passed down to the pipeline and ThreadGroupSwitcher, so I am not actually sure about it... Too much wizardry

Maybe a weak_pointer would indeed be safer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

antalya antalya-26.1 port-antalya PRs to be ported to all new Antalya releases

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants