Abandon alarms that are deleted due too many retries#6396
Abandon alarms that are deleted due too many retries#6396
Conversation
This method should be called when abandoning an alarm, without having it successfuly run
Abandons alarms when the maximum amount of retries is reached
Abandons sqlite alarms that hit the maximum retries
4fa841e to
f7af926
Compare
There was a problem hiding this comment.
I'm Bonk, and I've done a quick review of your PR.
This PR adds an abandonAlarm RPC call and implementations in ActorCache/ActorSqlite to clear stale alarm state after AlarmManager gives up retrying, gated behind an autogate.
Issues found (ranked by severity):
-
[HIGH] Race condition:
scheduledTimeis accepted but never checked -- BothActorCache::abandonAlarmandActorSqlite::abandonAlarmmarkscheduledTimeas[[maybe_unused]]. If the user sets a new alarm between the last failed retry and theabandonAlarmcall, and that new alarm has already flushed to CLEAN status,abandonAlarmwill incorrectly clear it. The time should be compared before clearing. -
[MEDIUM]
AlarmScheduler(localworkerd servepath) does not callabandonAlarm--AlarmScheduler::makeAlarmTaskinalarm-scheduler.c++callsdeleteAlarmwhencountedRetry >= RETRY_MAX_TRIES(line 239-240) but never callsabandonAlarmon the actor. The same stale-cache bug this PR fixes for production will still occur in local development. Consider adding anabandonAlarmcall beforedeleteAlarmin that code path. -
[LOW] Missing
KJ_DEFERfor autogate cleanup in "without fix" tests -- The tests that demonstrate the bug calldeinitAutogate()at the top but don't restore state on exit withKJ_DEFER, unlike the other tests.
These comments were generated by an AI code review assistant and may not be perfect. Please verify the suggestions before applying them.
| kj::Promise<void> ActorCache::abandonAlarm([[maybe_unused]] kj::Date scheduledTime) { | ||
| // Called when AlarmManager has given up retrying an alarm after too many counted failures. | ||
| // Clear the in-memory alarm state so getAlarm() returns null instead of a stale time. | ||
| // AlarmManager already deleted the CRDB row directly; we only need to update our cache. | ||
| if (!util::Autogate::isEnabled(util::AutogateKey::ACTOR_ALARM_ABANDONED_CLEANUP)) { | ||
| return kj::READY_NOW; | ||
| } | ||
| // Only clear if we still have a stale KnownAlarmTime. If the user has already set a new | ||
| // alarm (KnownAlarmTime{DIRTY}) or if we're mid-handler (DeferredAlarmDelete), leave it. | ||
| KJ_IF_SOME(t, currentAlarmTime.tryGet<KnownAlarmTime>()) { | ||
| if (t.status == KnownAlarmTime::Status::CLEAN) { | ||
| currentAlarmTime = KnownAlarmTime{ | ||
| .status = KnownAlarmTime::Status::CLEAN, .time = kj::none, .noCache = t.noCache}; | ||
| } | ||
| } | ||
| return kj::READY_NOW; |
There was a problem hiding this comment.
[HIGH] scheduledTime is [[maybe_unused]] but should be checked against the cached alarm time to prevent a race condition.
Scenario: (1) alarm at time T fails 6 times → KnownAlarmTime{CLEAN, T}, (2) user sets new alarm at T2 → DIRTY → flushes → KnownAlarmTime{CLEAN, T2}, (3) abandonAlarm(T) arrives → sees CLEAN → clears it → user's new alarm at T2 is silently lost.
The fix is to compare t.time against scheduledTime before clearing:
| kj::Promise<void> ActorCache::abandonAlarm([[maybe_unused]] kj::Date scheduledTime) { | |
| // Called when AlarmManager has given up retrying an alarm after too many counted failures. | |
| // Clear the in-memory alarm state so getAlarm() returns null instead of a stale time. | |
| // AlarmManager already deleted the CRDB row directly; we only need to update our cache. | |
| if (!util::Autogate::isEnabled(util::AutogateKey::ACTOR_ALARM_ABANDONED_CLEANUP)) { | |
| return kj::READY_NOW; | |
| } | |
| // Only clear if we still have a stale KnownAlarmTime. If the user has already set a new | |
| // alarm (KnownAlarmTime{DIRTY}) or if we're mid-handler (DeferredAlarmDelete), leave it. | |
| KJ_IF_SOME(t, currentAlarmTime.tryGet<KnownAlarmTime>()) { | |
| if (t.status == KnownAlarmTime::Status::CLEAN) { | |
| currentAlarmTime = KnownAlarmTime{ | |
| .status = KnownAlarmTime::Status::CLEAN, .time = kj::none, .noCache = t.noCache}; | |
| } | |
| } | |
| return kj::READY_NOW; | |
| kj::Promise<void> ActorCache::abandonAlarm(kj::Date scheduledTime) { | |
| // Called when AlarmManager has given up retrying an alarm after too many counted failures. | |
| // Clear the in-memory alarm state so getAlarm() returns null instead of a stale time. | |
| // AlarmManager already deleted the CRDB row directly; we only need to update our cache. | |
| if (!util::Autogate::isEnabled(util::AutogateKey::ACTOR_ALARM_ABANDONED_CLEANUP)) { | |
| return kj::READY_NOW; | |
| } | |
| // Only clear if we still have a stale KnownAlarmTime matching the abandoned alarm. | |
| // If the user has already set a new alarm (different time, DIRTY status), or if | |
| // we're mid-handler (DeferredAlarmDelete), leave it alone. | |
| KJ_IF_SOME(t, currentAlarmTime.tryGet<KnownAlarmTime>()) { | |
| if (t.status == KnownAlarmTime::Status::CLEAN && t.time == scheduledTime) { | |
| currentAlarmTime = KnownAlarmTime{ | |
| .status = KnownAlarmTime::Status::CLEAN, .time = kj::none, .noCache = t.noCache}; | |
| } | |
| } | |
| return kj::READY_NOW; |
| kj::Promise<void> ActorSqlite::abandonAlarm([[maybe_unused]] kj::Date scheduledTime) { | ||
| // Called when AlarmManager has given up retrying an alarm after too many counted failures. | ||
| // Clear the alarm from SQLite so getAlarm() returns null instead of a stale time. | ||
| // AlarmManager already deleted the CRDB row; we only need to update SQLite. | ||
| if (!util::Autogate::isEnabled(util::AutogateKey::ACTOR_ALARM_ABANDONED_CLEANUP)) { | ||
| return kj::READY_NOW; | ||
| } | ||
| // Only clear if SQLite currently has an alarm and we're not mid-handler. | ||
| // If the user has set a new alarm since the handler last ran, leave it alone. | ||
| if (inAlarmHandler) { | ||
| // Shouldn't happen -- AlarmManager shouldn't call abandonAlarm while a handler is running. | ||
| LOG_WARNING_ONCE("abandonAlarm called while alarm handler is still running"); | ||
| return kj::READY_NOW; | ||
| } | ||
| if (metadata.getAlarm() != kj::none) { | ||
| setAlarm(kj::none, {}, nullptr); | ||
| } | ||
| return kj::READY_NOW; | ||
| } |
There was a problem hiding this comment.
[HIGH] Same race condition as in ActorCache::abandonAlarm -- scheduledTime is [[maybe_unused]] but should be compared against the current alarm time before clearing.
If the user called setAlarm(T2) after the last retry failure and before abandonAlarm(T) arrives, this will incorrectly delete the new alarm.
| kj::Promise<void> ActorSqlite::abandonAlarm([[maybe_unused]] kj::Date scheduledTime) { | |
| // Called when AlarmManager has given up retrying an alarm after too many counted failures. | |
| // Clear the alarm from SQLite so getAlarm() returns null instead of a stale time. | |
| // AlarmManager already deleted the CRDB row; we only need to update SQLite. | |
| if (!util::Autogate::isEnabled(util::AutogateKey::ACTOR_ALARM_ABANDONED_CLEANUP)) { | |
| return kj::READY_NOW; | |
| } | |
| // Only clear if SQLite currently has an alarm and we're not mid-handler. | |
| // If the user has set a new alarm since the handler last ran, leave it alone. | |
| if (inAlarmHandler) { | |
| // Shouldn't happen -- AlarmManager shouldn't call abandonAlarm while a handler is running. | |
| LOG_WARNING_ONCE("abandonAlarm called while alarm handler is still running"); | |
| return kj::READY_NOW; | |
| } | |
| if (metadata.getAlarm() != kj::none) { | |
| setAlarm(kj::none, {}, nullptr); | |
| } | |
| return kj::READY_NOW; | |
| } | |
| kj::Promise<void> ActorSqlite::abandonAlarm(kj::Date scheduledTime) { | |
| // Called when AlarmManager has given up retrying an alarm after too many counted failures. | |
| // Clear the alarm from SQLite so getAlarm() returns null instead of a stale time. | |
| // AlarmManager already deleted the CRDB row; we only need to update SQLite. | |
| if (!util::Autogate::isEnabled(util::AutogateKey::ACTOR_ALARM_ABANDONED_CLEANUP)) { | |
| return kj::READY_NOW; | |
| } | |
| // Only clear if SQLite currently has an alarm matching the abandoned time | |
| // and we're not mid-handler. | |
| // If the user has set a new alarm since the handler last ran, leave it alone. | |
| if (inAlarmHandler) { | |
| // Shouldn't happen -- AlarmManager shouldn't call abandonAlarm while a handler is running. | |
| LOG_WARNING_ONCE("abandonAlarm called while alarm handler is still running"); | |
| return kj::READY_NOW; | |
| } | |
| KJ_IF_SOME(currentTime, metadata.getAlarm()) { | |
| if (currentTime == scheduledTime) { | |
| setAlarm(kj::none, {}, nullptr); | |
| } | |
| } | |
| return kj::READY_NOW; |
| // KnownAlarmTime{CLEAN, T} and getAlarm() returns T from cache even after | ||
| // AlarmManager has deleted the CRDB row. | ||
|
|
||
| util::Autogate::deinitAutogate(); |
There was a problem hiding this comment.
[LOW] Unlike the other two tests which use KJ_DEFER(util::Autogate::deinitAutogate()), this test calls deinitAutogate() at the top without restoring state on exit. If tests run in the same process and a later test depends on autogate state, this could cause issues.
Consider adding:
| util::Autogate::deinitAutogate(); | |
| util::Autogate::deinitAutogate(); | |
| KJ_DEFER(util::Autogate::deinitAutogate()); |
| KJ_TEST("ActorSqlite alarm state is stale after max counted retry failures without fix") { | ||
| // Regression test for STOR-3654 -- demonstrates the BUG (gate OFF). | ||
|
|
||
| util::Autogate::deinitAutogate(); |
There was a problem hiding this comment.
[LOW] Same as in actor-cache-test.c++ -- deinitAutogate() without KJ_DEFER cleanup. Consider adding KJ_DEFER(util::Autogate::deinitAutogate()); for consistency with the other tests.
|
Review posted successfully on PR #6396. Here's a summary of the findings:
|
This is still a draft.
When we delete an alarm because it retried too many times, we were not removing it from actor-cache or actor-sqlite.
This PR adds a new rpc call to abandonAlarm. This can be used to clear the actor-cache and actor-sqlite state when we delete an alarm that never successfully ran.