Skip to content

Abandon alarms that are deleted due too many retries#6396

Draft
jqmmes wants to merge 6 commits intomainfrom
joaquim/abandon-deleted-alarm
Draft

Abandon alarms that are deleted due too many retries#6396
jqmmes wants to merge 6 commits intomainfrom
joaquim/abandon-deleted-alarm

Conversation

@jqmmes
Copy link
Contributor

@jqmmes jqmmes commented Mar 24, 2026

This is still a draft.

When we delete an alarm because it retried too many times, we were not removing it from actor-cache or actor-sqlite.
This PR adds a new rpc call to abandonAlarm. This can be used to clear the actor-cache and actor-sqlite state when we delete an alarm that never successfully ran.

jqmmes added 6 commits March 24, 2026 11:54
This method should be called when abandoning an alarm, without having it successfuly run
Abandons alarms when the maximum amount of retries is reached
Abandons sqlite alarms that hit the maximum retries
@jqmmes jqmmes force-pushed the joaquim/abandon-deleted-alarm branch from 4fa841e to f7af926 Compare March 24, 2026 11:55
Copy link
Contributor

@ask-bonk ask-bonk bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm Bonk, and I've done a quick review of your PR.

This PR adds an abandonAlarm RPC call and implementations in ActorCache/ActorSqlite to clear stale alarm state after AlarmManager gives up retrying, gated behind an autogate.

Issues found (ranked by severity):

  1. [HIGH] Race condition: scheduledTime is accepted but never checked -- Both ActorCache::abandonAlarm and ActorSqlite::abandonAlarm mark scheduledTime as [[maybe_unused]]. If the user sets a new alarm between the last failed retry and the abandonAlarm call, and that new alarm has already flushed to CLEAN status, abandonAlarm will incorrectly clear it. The time should be compared before clearing.

  2. [MEDIUM] AlarmScheduler (local workerd serve path) does not call abandonAlarm -- AlarmScheduler::makeAlarmTask in alarm-scheduler.c++ calls deleteAlarm when countedRetry >= RETRY_MAX_TRIES (line 239-240) but never calls abandonAlarm on the actor. The same stale-cache bug this PR fixes for production will still occur in local development. Consider adding an abandonAlarm call before deleteAlarm in that code path.

  3. [LOW] Missing KJ_DEFER for autogate cleanup in "without fix" tests -- The tests that demonstrate the bug call deinitAutogate() at the top but don't restore state on exit with KJ_DEFER, unlike the other tests.


These comments were generated by an AI code review assistant and may not be perfect. Please verify the suggestions before applying them.

Comment on lines +214 to +229
kj::Promise<void> ActorCache::abandonAlarm([[maybe_unused]] kj::Date scheduledTime) {
// Called when AlarmManager has given up retrying an alarm after too many counted failures.
// Clear the in-memory alarm state so getAlarm() returns null instead of a stale time.
// AlarmManager already deleted the CRDB row directly; we only need to update our cache.
if (!util::Autogate::isEnabled(util::AutogateKey::ACTOR_ALARM_ABANDONED_CLEANUP)) {
return kj::READY_NOW;
}
// Only clear if we still have a stale KnownAlarmTime. If the user has already set a new
// alarm (KnownAlarmTime{DIRTY}) or if we're mid-handler (DeferredAlarmDelete), leave it.
KJ_IF_SOME(t, currentAlarmTime.tryGet<KnownAlarmTime>()) {
if (t.status == KnownAlarmTime::Status::CLEAN) {
currentAlarmTime = KnownAlarmTime{
.status = KnownAlarmTime::Status::CLEAN, .time = kj::none, .noCache = t.noCache};
}
}
return kj::READY_NOW;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[HIGH] scheduledTime is [[maybe_unused]] but should be checked against the cached alarm time to prevent a race condition.

Scenario: (1) alarm at time T fails 6 times → KnownAlarmTime{CLEAN, T}, (2) user sets new alarm at T2 → DIRTY → flushes → KnownAlarmTime{CLEAN, T2}, (3) abandonAlarm(T) arrives → sees CLEAN → clears it → user's new alarm at T2 is silently lost.

The fix is to compare t.time against scheduledTime before clearing:

Suggested change
kj::Promise<void> ActorCache::abandonAlarm([[maybe_unused]] kj::Date scheduledTime) {
// Called when AlarmManager has given up retrying an alarm after too many counted failures.
// Clear the in-memory alarm state so getAlarm() returns null instead of a stale time.
// AlarmManager already deleted the CRDB row directly; we only need to update our cache.
if (!util::Autogate::isEnabled(util::AutogateKey::ACTOR_ALARM_ABANDONED_CLEANUP)) {
return kj::READY_NOW;
}
// Only clear if we still have a stale KnownAlarmTime. If the user has already set a new
// alarm (KnownAlarmTime{DIRTY}) or if we're mid-handler (DeferredAlarmDelete), leave it.
KJ_IF_SOME(t, currentAlarmTime.tryGet<KnownAlarmTime>()) {
if (t.status == KnownAlarmTime::Status::CLEAN) {
currentAlarmTime = KnownAlarmTime{
.status = KnownAlarmTime::Status::CLEAN, .time = kj::none, .noCache = t.noCache};
}
}
return kj::READY_NOW;
kj::Promise<void> ActorCache::abandonAlarm(kj::Date scheduledTime) {
// Called when AlarmManager has given up retrying an alarm after too many counted failures.
// Clear the in-memory alarm state so getAlarm() returns null instead of a stale time.
// AlarmManager already deleted the CRDB row directly; we only need to update our cache.
if (!util::Autogate::isEnabled(util::AutogateKey::ACTOR_ALARM_ABANDONED_CLEANUP)) {
return kj::READY_NOW;
}
// Only clear if we still have a stale KnownAlarmTime matching the abandoned alarm.
// If the user has already set a new alarm (different time, DIRTY status), or if
// we're mid-handler (DeferredAlarmDelete), leave it alone.
KJ_IF_SOME(t, currentAlarmTime.tryGet<KnownAlarmTime>()) {
if (t.status == KnownAlarmTime::Status::CLEAN && t.time == scheduledTime) {
currentAlarmTime = KnownAlarmTime{
.status = KnownAlarmTime::Status::CLEAN, .time = kj::none, .noCache = t.noCache};
}
}
return kj::READY_NOW;

Comment on lines +1009 to +1027
kj::Promise<void> ActorSqlite::abandonAlarm([[maybe_unused]] kj::Date scheduledTime) {
// Called when AlarmManager has given up retrying an alarm after too many counted failures.
// Clear the alarm from SQLite so getAlarm() returns null instead of a stale time.
// AlarmManager already deleted the CRDB row; we only need to update SQLite.
if (!util::Autogate::isEnabled(util::AutogateKey::ACTOR_ALARM_ABANDONED_CLEANUP)) {
return kj::READY_NOW;
}
// Only clear if SQLite currently has an alarm and we're not mid-handler.
// If the user has set a new alarm since the handler last ran, leave it alone.
if (inAlarmHandler) {
// Shouldn't happen -- AlarmManager shouldn't call abandonAlarm while a handler is running.
LOG_WARNING_ONCE("abandonAlarm called while alarm handler is still running");
return kj::READY_NOW;
}
if (metadata.getAlarm() != kj::none) {
setAlarm(kj::none, {}, nullptr);
}
return kj::READY_NOW;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[HIGH] Same race condition as in ActorCache::abandonAlarm -- scheduledTime is [[maybe_unused]] but should be compared against the current alarm time before clearing.

If the user called setAlarm(T2) after the last retry failure and before abandonAlarm(T) arrives, this will incorrectly delete the new alarm.

Suggested change
kj::Promise<void> ActorSqlite::abandonAlarm([[maybe_unused]] kj::Date scheduledTime) {
// Called when AlarmManager has given up retrying an alarm after too many counted failures.
// Clear the alarm from SQLite so getAlarm() returns null instead of a stale time.
// AlarmManager already deleted the CRDB row; we only need to update SQLite.
if (!util::Autogate::isEnabled(util::AutogateKey::ACTOR_ALARM_ABANDONED_CLEANUP)) {
return kj::READY_NOW;
}
// Only clear if SQLite currently has an alarm and we're not mid-handler.
// If the user has set a new alarm since the handler last ran, leave it alone.
if (inAlarmHandler) {
// Shouldn't happen -- AlarmManager shouldn't call abandonAlarm while a handler is running.
LOG_WARNING_ONCE("abandonAlarm called while alarm handler is still running");
return kj::READY_NOW;
}
if (metadata.getAlarm() != kj::none) {
setAlarm(kj::none, {}, nullptr);
}
return kj::READY_NOW;
}
kj::Promise<void> ActorSqlite::abandonAlarm(kj::Date scheduledTime) {
// Called when AlarmManager has given up retrying an alarm after too many counted failures.
// Clear the alarm from SQLite so getAlarm() returns null instead of a stale time.
// AlarmManager already deleted the CRDB row; we only need to update SQLite.
if (!util::Autogate::isEnabled(util::AutogateKey::ACTOR_ALARM_ABANDONED_CLEANUP)) {
return kj::READY_NOW;
}
// Only clear if SQLite currently has an alarm matching the abandoned time
// and we're not mid-handler.
// If the user has set a new alarm since the handler last ran, leave it alone.
if (inAlarmHandler) {
// Shouldn't happen -- AlarmManager shouldn't call abandonAlarm while a handler is running.
LOG_WARNING_ONCE("abandonAlarm called while alarm handler is still running");
return kj::READY_NOW;
}
KJ_IF_SOME(currentTime, metadata.getAlarm()) {
if (currentTime == scheduledTime) {
setAlarm(kj::none, {}, nullptr);
}
}
return kj::READY_NOW;

// KnownAlarmTime{CLEAN, T} and getAlarm() returns T from cache even after
// AlarmManager has deleted the CRDB row.

util::Autogate::deinitAutogate();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[LOW] Unlike the other two tests which use KJ_DEFER(util::Autogate::deinitAutogate()), this test calls deinitAutogate() at the top without restoring state on exit. If tests run in the same process and a later test depends on autogate state, this could cause issues.

Consider adding:

Suggested change
util::Autogate::deinitAutogate();
util::Autogate::deinitAutogate();
KJ_DEFER(util::Autogate::deinitAutogate());

KJ_TEST("ActorSqlite alarm state is stale after max counted retry failures without fix") {
// Regression test for STOR-3654 -- demonstrates the BUG (gate OFF).

util::Autogate::deinitAutogate();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[LOW] Same as in actor-cache-test.c++ -- deinitAutogate() without KJ_DEFER cleanup. Consider adding KJ_DEFER(util::Autogate::deinitAutogate()); for consistency with the other tests.

@ask-bonk
Copy link
Contributor

ask-bonk bot commented Mar 24, 2026

Review posted successfully on PR #6396. Here's a summary of the findings:

  1. [HIGH] Race condition in both ActorCache::abandonAlarm and ActorSqlite::abandonAlarm -- The scheduledTime parameter is marked [[maybe_unused]] and never compared against the current alarm time. If a user sets a new alarm between the last failed retry and the abandonAlarm call, and that alarm has flushed to CLEAN status, it will be incorrectly cleared. Suggested fix: compare the cached time against scheduledTime before clearing.

  2. [MEDIUM] AlarmScheduler not updated -- The local workerd serve path in alarm-scheduler.c++ calls deleteAlarm when retries are exhausted but never calls abandonAlarm on the actor, so the same stale-cache bug will persist in local dev.

  3. [LOW] Missing KJ_DEFER cleanup -- The "without fix" tests call deinitAutogate() without KJ_DEFER, unlike the other tests that properly clean up.

github run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant