Skip to content

feat(kiloclaw): proactively refresh API keys approaching expiry#1049

Open
pandemicsyn wants to merge 5 commits intomainfrom
florian/chore/proactively-refresh-token
Open

feat(kiloclaw): proactively refresh API keys approaching expiry#1049
pandemicsyn wants to merge 5 commits intomainfrom
florian/chore/proactively-refresh-token

Conversation

@pandemicsyn
Copy link
Contributor

@pandemicsyn pandemicsyn commented Mar 11, 2026

Summary

Instances' API keys (JWTs) have a fixed expiry. Today, if a key expires while a sandbox is running, the gateway loses API access until the next full restart re-mints the key. This PR adds proactive refresh: the reconciliation alarm checks if the key expires within 3 days (configurable via PROACTIVE_REFRESH_THRESHOLD_HOURS wrangler var) and mints a fresh one.

How the fresh key is delivered:

  1. Fly machine config is always updated with skipLaunch (durable persist). This ensures the key survives cold starts regardless of whether the live push succeeds.
  2. Live push to the controller's process.env via POST /_kilo/env/patch is attempted. If the controller supports it, SIGUSR1 triggers a graceful in-process restart in OpenClaw — it drains active tasks (up to 90s), closes the server, and restarts the server loop, which re-reads process.env and picks up the new key.
  3. If the push fails (404 from old controller, network error, not signaled), no forced restart — the Fly config already has the new key, and the machine picks it up on its next natural restart (user-initiated, crash, deploy).

No version gating — capability detection is used. The push is always attempted; a 404 from old controllers is caught and handled gracefully.

Failure handling:

  • Both paths fail (Fly config update AND push): key/expiry is NOT persisted to DO state, so the next alarm cycle retries.
  • Push fails, Fly config succeeds: key is persisted. Next restart picks it up.
  • Fly config fails, push succeeds: key is persisted. Gateway has it live, but next cold start will need another refresh.

What changed:

  • Controller endpoint (POST /_kilo/env/patch): accepts an allowlisted set of env vars (KILOCODE_API_KEY), writes them to process.env, and sends SIGUSR1 to the gateway. Bearer-auth gated same as existing /_kilo/config/* routes.
  • Fly client (updateMachine): added skipLaunch option — updates machine config without restarting.
  • Reconciliation (reconcileApiKeyExpiry): new step wired after reconcileVolume. Flow: mint → update Fly config (skipLaunch) → try push → persist only if at least one path succeeded. minSecretsVersion forwarded from ensureEnvKey() to prevent secret propagation races.
  • Config (getProactiveRefreshThresholdMs): reads PROACTIVE_REFRESH_THRESHOLD_HOURS wrangler var with fallback to 72h default. Set to a large value (e.g. 8760 = 1 year) to trigger refresh on all running instances for testing.

Verification

  • pnpm typecheck — passes
  • pnpm test — 566/566 tests pass (30 test files), including 26 new tests across 5 new/modified test files
  • pnpm lint — passes
  • Additional verification (manual testing, staging deploy, etc.)

Visual Changes

Old controller:

Screenshot 2026-03-12 at 11 52 06 AM

New controller:

Reviewer Notes

  • No forced restarts. The refresh process never causes downtime. If the live push fails, the machine keeps running with the old key until it naturally restarts, at which point it boots with the fresh key from Fly config.
  • Capability detection replaces version gating: the push to /_kilo/env/patch is always attempted. Old controllers return 404, which is caught gracefully. No manual version constant to maintain.
  • The Fly config update always uses skipLaunch: true with minSecretsVersion from ensureEnvKey().
  • Key/expiry is only persisted to DO state when at least one delivery path succeeded (pushed || flyConfigUpdated). If both fail, the next alarm retries.
  • Promise.race for the mint timeout clears the timer on success. Chosen over AbortSignal.timeout because Hyperdrive doesn't propagate abort signals.
  • To test in staging: set PROACTIVE_REFRESH_THRESHOLD_HOURS to 8760 (1 year) in wrangler vars — every running instance with a known expiry will refresh on its next alarm cycle (within 5 min).
  • Structured log events for dashboarding: filter on tag:"reconcile" AND action:"api_key_*". Key events: api_key_refreshed (with pushed and flyConfigUpdated fields), api_key_push_error, api_key_refresh_failed_all_paths.

The reconciliation alarm now checks if the instance's API key expires
within 7 days and, if the controller supports it, mints a fresh key,
pushes it via the new /_kilo/env/patch endpoint, updates the Fly machine
config (without restart via skip_launch), and persists the new expiry.

Key changes:
- Controller: POST /_kilo/env/patch with KILOCODE_API_KEY allowlist
- Fly client: skip_launch option on updateMachine
- Reconcile: reconcileApiKeyExpiry with version gating, mint, push, persist
- Config: isCalverAtLeast helper and PROACTIVE_REFRESH_THRESHOLD_MS constant
…controllers

Restructure reconcileApiKeyExpiry so the version check only gates the
push-to-controller step, not the entire flow. When the controller is
too old for /_kilo/env/patch, we still mint a fresh key, update the
Fly machine config (triggering a restart), and persist to DO state.

Also:
- Reduce PROACTIVE_REFRESH_THRESHOLD_MS from 7 days to 3 days
- Guard against starting stopped machines (check Fly machine.state
  before deciding skipLaunch)
- Add test for stopped-machine safety guard
@kilo-code-bot
Copy link
Contributor

kilo-code-bot bot commented Mar 11, 2026

Code Review Summary

Status: 4 Issues Found | Recommendation: Address before merge

Fix these issues in Kilo Cloud

Overview

Severity Count
CRITICAL 0
WARNING 4
SUGGESTION 0
Issue Details (click to expand)

WARNING

File Line Issue
kiloclaw/src/durable-objects/kiloclaw-instance/reconcile.ts N/A ensureEnvKey() result drops secretsVersion
kiloclaw/src/durable-objects/kiloclaw-instance/reconcile.ts 185 Persisting the refreshed key can leave durable state ahead of machine config
kiloclaw/src/config.ts N/A Hardcoded controller version gate can drift from the deployed image
kiloclaw/src/durable-objects/kiloclaw-instance/reconcile.ts 178 Persisting the new expiry after an unsignaled push can leave the live gateway on the old key
Other Observations (not in diff)

None.

Files Reviewed (13 files)
  • kiloclaw/controller/src/index.ts - 0 issues
  • kiloclaw/controller/src/routes/env.test.ts - 0 issues
  • kiloclaw/controller/src/routes/env.ts - 0 issues
  • kiloclaw/src/config.test.ts - 0 issues
  • kiloclaw/src/config.ts - 1 issue
  • kiloclaw/src/durable-objects/gateway-controller-types.ts - 0 issues
  • kiloclaw/src/durable-objects/kiloclaw-instance.test.ts - 0 issues
  • kiloclaw/src/durable-objects/kiloclaw-instance/gateway.ts - 0 issues
  • kiloclaw/src/durable-objects/kiloclaw-instance/reconcile.ts - 3 issues
  • kiloclaw/src/fly/client.test.ts - 0 issues
  • kiloclaw/src/fly/client.ts - 0 issues
  • kiloclaw/src/types.ts - 0 issues
  • kiloclaw/wrangler.jsonc - 0 issues

Reviewed by gpt-5.4-20260305 · 1,456,988 tokens

- Reorder: update Fly config (skipLaunch) before hot patch attempt so
  the key is durably persisted before we try the live push
- Forward minSecretsVersion from ensureEnvKey() to updateMachine to
  prevent secret propagation races on restart
- Use updateMachine without skipLaunch for restart instead of
  stop+start to avoid leaving the machine stopped on partial failure
- Only persist new key/expiry to DO state when at least one delivery
  path succeeded (push or Fly config update)
- Make refresh threshold configurable via PROACTIVE_REFRESH_THRESHOLD_HOURS
  wrangler var (default 72h) for testing
- Reduce default threshold from 7 days to 3 days
Remove MIN_ENV_PATCH_CONTROLLER_VERSION, isCalverAtLeast(), and the
getControllerVersion() pre-flight check. Instead, always try the push
to /_kilo/env/patch — if the controller returns 404 (old image), the
catch block handles it and falls through to the restart path.

This eliminates a manually maintained calver constant that had to
match the controller release date. The cost is one extra HTTP call
per refresh event on old controllers (the 404), which is negligible
since refresh only triggers once per key expiry cycle.
Never force-restart the machine during key refresh. The Fly config is
updated with skipLaunch (durable persist), the push is attempted for
live delivery, and if the push fails the machine picks up the new key
on its next natural restart (user-initiated, crash, deploy).

This avoids any risk of downtime caused by the refresh process itself.
// update and push failed, the running gateway still has the old key.
// Persisting the new expiry would cause future alarms to skip refresh,
// letting the old key expire silently.
if (!pushed && !flyConfigUpdated) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Persisting the new expiry after an unsignaled push can strand the live gateway on the expiring key

When /_kilo/env/patch returns 404, throws, or reports signaled: false, the running process never receives freshKey.token, but this branch still advances durable state as soon as the Fly config write succeeds. Subsequent alarms will see the 30-day expiry and skip another refresh, so a long-lived machine can keep using the old in-memory key until it expires. Only persisting after the hot patch succeeds, or scheduling a retry/restart whenever pushed is false, avoids that silent failure mode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant