Abandon alarms that are deleted due too many retries by jqmmes · Pull Request #6396 · cloudflare/workerd

jqmmes · 2026-03-24T11:54:16Z

This is still a draft.

When we delete an alarm because it retried too many times, we were not removing it from actor-cache or actor-sqlite.
This PR adds a new rpc call to abandonAlarm. This can be used to clear the actor-cache and actor-sqlite state when we delete an alarm that never successfully ran.

This method should be called when abandoning an alarm, without having it successfuly run

Abandons alarms when the maximum amount of retries is reached

Abandons sqlite alarms that hit the maximum retries

ask-bonk

I'm Bonk, and I've done a quick review of your PR.

This PR adds an abandonAlarm RPC call and implementations in ActorCache/ActorSqlite to clear stale alarm state after AlarmManager gives up retrying, gated behind an autogate.

Issues found (ranked by severity):

[HIGH] Race condition: scheduledTime is accepted but never checked -- Both ActorCache::abandonAlarm and ActorSqlite::abandonAlarm mark scheduledTime as [[maybe_unused]]. If the user sets a new alarm between the last failed retry and the abandonAlarm call, and that new alarm has already flushed to CLEAN status, abandonAlarm will incorrectly clear it. The time should be compared before clearing.
[MEDIUM] AlarmScheduler (local workerd serve path) does not call abandonAlarm -- AlarmScheduler::makeAlarmTask in alarm-scheduler.c++ calls deleteAlarm when countedRetry >= RETRY_MAX_TRIES (line 239-240) but never calls abandonAlarm on the actor. The same stale-cache bug this PR fixes for production will still occur in local development. Consider adding an abandonAlarm call before deleteAlarm in that code path.
[LOW] Missing KJ_DEFER for autogate cleanup in "without fix" tests -- The tests that demonstrate the bug call deinitAutogate() at the top but don't restore state on exit with KJ_DEFER, unlike the other tests.

These comments were generated by an AI code review assistant and may not be perfect. Please verify the suggestions before applying them.

ask-bonk · 2026-03-24T11:59:30Z

src/workerd/io/actor-cache.c++

+kj::Promise<void> ActorCache::abandonAlarm([[maybe_unused]] kj::Date scheduledTime) {
+  // Called when AlarmManager has given up retrying an alarm after too many counted failures.
+  // Clear the in-memory alarm state so getAlarm() returns null instead of a stale time.
+  // AlarmManager already deleted the CRDB row directly; we only need to update our cache.
+  if (!util::Autogate::isEnabled(util::AutogateKey::ACTOR_ALARM_ABANDONED_CLEANUP)) {
+    return kj::READY_NOW;
+  }
+  // Only clear if we still have a stale KnownAlarmTime. If the user has already set a new
+  // alarm (KnownAlarmTime{DIRTY}) or if we're mid-handler (DeferredAlarmDelete), leave it.
+  KJ_IF_SOME(t, currentAlarmTime.tryGet<KnownAlarmTime>()) {
+    if (t.status == KnownAlarmTime::Status::CLEAN) {
+      currentAlarmTime = KnownAlarmTime{
+        .status = KnownAlarmTime::Status::CLEAN, .time = kj::none, .noCache = t.noCache};
+    }
+  }
+  return kj::READY_NOW;


[HIGH] scheduledTime is [[maybe_unused]] but should be checked against the cached alarm time to prevent a race condition.

Scenario: (1) alarm at time T fails 6 times → KnownAlarmTime{CLEAN, T}, (2) user sets new alarm at T2 → DIRTY → flushes → KnownAlarmTime{CLEAN, T2}, (3) abandonAlarm(T) arrives → sees CLEAN → clears it → user's new alarm at T2 is silently lost.

The fix is to compare t.time against scheduledTime before clearing:

Suggested change

kj::Promise<void> ActorCache::abandonAlarm([[maybe_unused]] kj::Date scheduledTime) {

// Called when AlarmManager has given up retrying an alarm after too many counted failures.

// Clear the in-memory alarm state so getAlarm() returns null instead of a stale time.

// AlarmManager already deleted the CRDB row directly; we only need to update our cache.

if (!util::Autogate::isEnabled(util::AutogateKey::ACTOR_ALARM_ABANDONED_CLEANUP)) {

return kj::READY_NOW;

}

// Only clear if we still have a stale KnownAlarmTime. If the user has already set a new

// alarm (KnownAlarmTime{DIRTY}) or if we're mid-handler (DeferredAlarmDelete), leave it.

KJ_IF_SOME(t, currentAlarmTime.tryGet<KnownAlarmTime>()) {

if (t.status == KnownAlarmTime::Status::CLEAN) {

currentAlarmTime = KnownAlarmTime{

.status = KnownAlarmTime::Status::CLEAN, .time = kj::none, .noCache = t.noCache};

}

}

return kj::READY_NOW;

kj::Promise<void> ActorCache::abandonAlarm(kj::Date scheduledTime) {

// Called when AlarmManager has given up retrying an alarm after too many counted failures.

// Clear the in-memory alarm state so getAlarm() returns null instead of a stale time.

// AlarmManager already deleted the CRDB row directly; we only need to update our cache.

if (!util::Autogate::isEnabled(util::AutogateKey::ACTOR_ALARM_ABANDONED_CLEANUP)) {

return kj::READY_NOW;

}

// Only clear if we still have a stale KnownAlarmTime matching the abandoned alarm.

// If the user has already set a new alarm (different time, DIRTY status), or if

// we're mid-handler (DeferredAlarmDelete), leave it alone.

KJ_IF_SOME(t, currentAlarmTime.tryGet<KnownAlarmTime>()) {

if (t.status == KnownAlarmTime::Status::CLEAN && t.time == scheduledTime) {

currentAlarmTime = KnownAlarmTime{

.status = KnownAlarmTime::Status::CLEAN, .time = kj::none, .noCache = t.noCache};

}

}

return kj::READY_NOW;

ask-bonk · 2026-03-24T11:59:30Z

src/workerd/io/actor-sqlite.c++

+kj::Promise<void> ActorSqlite::abandonAlarm([[maybe_unused]] kj::Date scheduledTime) {
+  // Called when AlarmManager has given up retrying an alarm after too many counted failures.
+  // Clear the alarm from SQLite so getAlarm() returns null instead of a stale time.
+  // AlarmManager already deleted the CRDB row; we only need to update SQLite.
+  if (!util::Autogate::isEnabled(util::AutogateKey::ACTOR_ALARM_ABANDONED_CLEANUP)) {
+    return kj::READY_NOW;
+  }
+  // Only clear if SQLite currently has an alarm and we're not mid-handler.
+  // If the user has set a new alarm since the handler last ran, leave it alone.
+  if (inAlarmHandler) {
+    // Shouldn't happen -- AlarmManager shouldn't call abandonAlarm while a handler is running.
+    LOG_WARNING_ONCE("abandonAlarm called while alarm handler is still running");
+    return kj::READY_NOW;
+  }
+  if (metadata.getAlarm() != kj::none) {
+    setAlarm(kj::none, {}, nullptr);
+  }
+  return kj::READY_NOW;
+}


[HIGH] Same race condition as in ActorCache::abandonAlarm -- scheduledTime is [[maybe_unused]] but should be compared against the current alarm time before clearing.

If the user called setAlarm(T2) after the last retry failure and before abandonAlarm(T) arrives, this will incorrectly delete the new alarm.

Suggested change

kj::Promise<void> ActorSqlite::abandonAlarm([[maybe_unused]] kj::Date scheduledTime) {

// Called when AlarmManager has given up retrying an alarm after too many counted failures.

// Clear the alarm from SQLite so getAlarm() returns null instead of a stale time.

// AlarmManager already deleted the CRDB row; we only need to update SQLite.

if (!util::Autogate::isEnabled(util::AutogateKey::ACTOR_ALARM_ABANDONED_CLEANUP)) {

return kj::READY_NOW;

}

// Only clear if SQLite currently has an alarm and we're not mid-handler.

// If the user has set a new alarm since the handler last ran, leave it alone.

if (inAlarmHandler) {

// Shouldn't happen -- AlarmManager shouldn't call abandonAlarm while a handler is running.

LOG_WARNING_ONCE("abandonAlarm called while alarm handler is still running");

return kj::READY_NOW;

}

if (metadata.getAlarm() != kj::none) {

setAlarm(kj::none, {}, nullptr);

}

return kj::READY_NOW;

}

kj::Promise<void> ActorSqlite::abandonAlarm(kj::Date scheduledTime) {

// Called when AlarmManager has given up retrying an alarm after too many counted failures.

// Clear the alarm from SQLite so getAlarm() returns null instead of a stale time.

// AlarmManager already deleted the CRDB row; we only need to update SQLite.

if (!util::Autogate::isEnabled(util::AutogateKey::ACTOR_ALARM_ABANDONED_CLEANUP)) {

return kj::READY_NOW;

}

// Only clear if SQLite currently has an alarm matching the abandoned time

// and we're not mid-handler.

// If the user has set a new alarm since the handler last ran, leave it alone.

if (inAlarmHandler) {

// Shouldn't happen -- AlarmManager shouldn't call abandonAlarm while a handler is running.

LOG_WARNING_ONCE("abandonAlarm called while alarm handler is still running");

return kj::READY_NOW;

}

KJ_IF_SOME(currentTime, metadata.getAlarm()) {

if (currentTime == scheduledTime) {

setAlarm(kj::none, {}, nullptr);

}

}

return kj::READY_NOW;

ask-bonk · 2026-03-24T11:59:30Z

src/workerd/io/actor-cache-test.c++

+  // KnownAlarmTime{CLEAN, T} and getAlarm() returns T from cache even after
+  // AlarmManager has deleted the CRDB row.
+
+  util::Autogate::deinitAutogate();


[LOW] Unlike the other two tests which use KJ_DEFER(util::Autogate::deinitAutogate()), this test calls deinitAutogate() at the top without restoring state on exit. If tests run in the same process and a later test depends on autogate state, this could cause issues.

Consider adding:

Suggested change

util::Autogate::deinitAutogate();

util::Autogate::deinitAutogate();

KJ_DEFER(util::Autogate::deinitAutogate());

ask-bonk · 2026-03-24T11:59:30Z

src/workerd/io/actor-sqlite-test.c++

+KJ_TEST("ActorSqlite alarm state is stale after max counted retry failures without fix") {
+  // Regression test for STOR-3654 -- demonstrates the BUG (gate OFF).
+
+  util::Autogate::deinitAutogate();


[LOW] Same as in actor-cache-test.c++ -- deinitAutogate() without KJ_DEFER cleanup. Consider adding KJ_DEFER(util::Autogate::deinitAutogate()); for consistency with the other tests.

ask-bonk · 2026-03-24T11:59:48Z

Review posted successfully on PR #6396. Here's a summary of the findings:

[HIGH] Race condition in both ActorCache::abandonAlarm and ActorSqlite::abandonAlarm -- The scheduledTime parameter is marked [[maybe_unused]] and never compared against the current alarm time. If a user sets a new alarm between the last failed retry and the abandonAlarm call, and that alarm has flushed to CLEAN status, it will be incorrectly cleared. Suggested fix: compare the cached time against scheduledTime before clearing.
[MEDIUM] AlarmScheduler not updated -- The local workerd serve path in alarm-scheduler.c++ calls deleteAlarm when retries are exhausted but never calls abandonAlarm on the actor, so the same stale-cache bug will persist in local dev.
[LOW] Missing KJ_DEFER cleanup -- The "without fix" tests call deinitAutogate() without KJ_DEFER, unlike the other tests that properly clean up.

github run

jqmmes added 6 commits March 24, 2026 11:54

Add a new ACTOR_ALARM_ABANDONED_CLEANUP autogate

993b5d6

Add a new abandomAlarm to WorkerInterface

af6310d

This method should be called when abandoning an alarm, without having it successfuly run

Add abandonAlarm to actor-cache

60076b8

Abandons alarms when the maximum amount of retries is reached

Update actor-cache-test with abandonAlarm

3ac0622

Add abandonAlarm to actor-sqlite

c218d5d

Abandons sqlite alarms that hit the maximum retries

Update actor-sqlite-test with abandonAlarm

f7af926

jqmmes force-pushed the joaquim/abandon-deleted-alarm branch from 4fa841e to f7af926 Compare March 24, 2026 11:55

ask-bonk bot reviewed Mar 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abandon alarms that are deleted due too many retries#6396

Abandon alarms that are deleted due too many retries#6396
jqmmes wants to merge 6 commits intomainfrom
joaquim/abandon-deleted-alarm

jqmmes commented Mar 24, 2026

Uh oh!

ask-bonk bot left a comment

Uh oh!

ask-bonk bot Mar 24, 2026

Uh oh!

ask-bonk bot Mar 24, 2026

Uh oh!

ask-bonk bot Mar 24, 2026

Uh oh!

ask-bonk bot Mar 24, 2026

Uh oh!

ask-bonk bot commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	util::Autogate::deinitAutogate();
	util::Autogate::deinitAutogate();
	KJ_DEFER(util::Autogate::deinitAutogate());

Conversation

jqmmes commented Mar 24, 2026

Uh oh!

ask-bonk bot left a comment

Choose a reason for hiding this comment

Uh oh!

ask-bonk bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

ask-bonk bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

ask-bonk bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

ask-bonk bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

ask-bonk bot commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant