Fix export task not being killed during s3 outage#1564
Fix export task not being killed during s3 outage#1564arthurpassos wants to merge 2 commits intoantalya-26.1from
Conversation
|
@codex review |
|
AI audit note: This review comment was generated by AI (gpt-5.3-codex). Audit update for PR #1564 (export-task cancellation during S3 outage): No confirmed defects in reviewed scope. Coverage summary: Scope reviewed: src/Common/ThreadStatus.h and src/Storages/MergeTree/ExportPartTask.cpp, including call path into CurrentThread::get().isQueryCanceled() used by S3 retry logic. |
|
Codex Review: Didn't find any major issues. Chef's kiss. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
I'll see if I can add tests (I actually already have those, but for some reason they were not failing :)) |
I think I know why. Probably because blocking S3 communication with IP tables was throwing an exception that is non retryable, leading to the export failing fast and no issues at all. |
| (*exports_list_entry)->thread_group->setCancelPredicate( | ||
| [this]() -> bool { return isCancelled(); }); | ||
|
|
There was a problem hiding this comment.
Are you sure that lifetime of the task exceeds lifetime of the thread group at all times? Maybe capture a weak pointer instead of this?
There was a problem hiding this comment.
So.. when writing this code, this came to mind. The thing is that ThreadGroup is a member of ExportListEntry, which is tied to the lifetime of this task. So I assumed it would be valid. At the same time, ThreadGroupPtr is a shared_ptr, and it gets passed down to the pipeline and ThreadGroupSwitcher, so I am not actually sure about it... Too much wizardry
Maybe a weak_pointer would indeed be safer
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
The drop table operation must signal cancellation to all background tasks and wait until they ack it. This is done checking the
is_cancelledflag at each pipeline iteration. If S3 is unreachable and s3_retries_attempt is big (by default, it is 500), the pipeline gets stuck deep in the AWS SDK and never gets a chance to check the signal / flag. Making the task "unkillable".This PR fixes it in a hackish way by overwriting the
query_is_cancelled_predicate, which is checked by the S3 client retry strategy uponShouldRetry.Documentation entry for user-facing changes
...
CI/CD Options
Exclude tests:
Regression jobs to run: