feat(kiloclaw): proactively refresh API keys approaching expiry#1049
feat(kiloclaw): proactively refresh API keys approaching expiry#1049pandemicsyn wants to merge 5 commits intomainfrom
Conversation
The reconciliation alarm now checks if the instance's API key expires within 7 days and, if the controller supports it, mints a fresh key, pushes it via the new /_kilo/env/patch endpoint, updates the Fly machine config (without restart via skip_launch), and persists the new expiry. Key changes: - Controller: POST /_kilo/env/patch with KILOCODE_API_KEY allowlist - Fly client: skip_launch option on updateMachine - Reconcile: reconcileApiKeyExpiry with version gating, mint, push, persist - Config: isCalverAtLeast helper and PROACTIVE_REFRESH_THRESHOLD_MS constant
…controllers Restructure reconcileApiKeyExpiry so the version check only gates the push-to-controller step, not the entire flow. When the controller is too old for /_kilo/env/patch, we still mint a fresh key, update the Fly machine config (triggering a restart), and persist to DO state. Also: - Reduce PROACTIVE_REFRESH_THRESHOLD_MS from 7 days to 3 days - Guard against starting stopped machines (check Fly machine.state before deciding skipLaunch) - Add test for stopped-machine safety guard
Code Review SummaryStatus: 4 Issues Found | Recommendation: Address before merge Fix these issues in Kilo Cloud Overview
Issue Details (click to expand)WARNING
Other Observations (not in diff)None. Files Reviewed (13 files)
Reviewed by gpt-5.4-20260305 · 1,456,988 tokens |
- Reorder: update Fly config (skipLaunch) before hot patch attempt so the key is durably persisted before we try the live push - Forward minSecretsVersion from ensureEnvKey() to updateMachine to prevent secret propagation races on restart - Use updateMachine without skipLaunch for restart instead of stop+start to avoid leaving the machine stopped on partial failure - Only persist new key/expiry to DO state when at least one delivery path succeeded (push or Fly config update) - Make refresh threshold configurable via PROACTIVE_REFRESH_THRESHOLD_HOURS wrangler var (default 72h) for testing - Reduce default threshold from 7 days to 3 days
Remove MIN_ENV_PATCH_CONTROLLER_VERSION, isCalverAtLeast(), and the getControllerVersion() pre-flight check. Instead, always try the push to /_kilo/env/patch — if the controller returns 404 (old image), the catch block handles it and falls through to the restart path. This eliminates a manually maintained calver constant that had to match the controller release date. The cost is one extra HTTP call per refresh event on old controllers (the 404), which is negligible since refresh only triggers once per key expiry cycle.
Never force-restart the machine during key refresh. The Fly config is updated with skipLaunch (durable persist), the push is attempted for live delivery, and if the push fails the machine picks up the new key on its next natural restart (user-initiated, crash, deploy). This avoids any risk of downtime caused by the refresh process itself.
| // update and push failed, the running gateway still has the old key. | ||
| // Persisting the new expiry would cause future alarms to skip refresh, | ||
| // letting the old key expire silently. | ||
| if (!pushed && !flyConfigUpdated) { |
There was a problem hiding this comment.
WARNING: Persisting the new expiry after an unsignaled push can strand the live gateway on the expiring key
When /_kilo/env/patch returns 404, throws, or reports signaled: false, the running process never receives freshKey.token, but this branch still advances durable state as soon as the Fly config write succeeds. Subsequent alarms will see the 30-day expiry and skip another refresh, so a long-lived machine can keep using the old in-memory key until it expires. Only persisting after the hot patch succeeds, or scheduling a retry/restart whenever pushed is false, avoids that silent failure mode.
Summary
Instances' API keys (JWTs) have a fixed expiry. Today, if a key expires while a sandbox is running, the gateway loses API access until the next full restart re-mints the key. This PR adds proactive refresh: the reconciliation alarm checks if the key expires within 3 days (configurable via
PROACTIVE_REFRESH_THRESHOLD_HOURSwrangler var) and mints a fresh one.How the fresh key is delivered:
skipLaunch(durable persist). This ensures the key survives cold starts regardless of whether the live push succeeds.process.envviaPOST /_kilo/env/patchis attempted. If the controller supports it,SIGUSR1triggers a graceful in-process restart in OpenClaw — it drains active tasks (up to 90s), closes the server, and restarts the server loop, which re-readsprocess.envand picks up the new key.No version gating — capability detection is used. The push is always attempted; a 404 from old controllers is caught and handled gracefully.
Failure handling:
What changed:
POST /_kilo/env/patch): accepts an allowlisted set of env vars (KILOCODE_API_KEY), writes them toprocess.env, and sendsSIGUSR1to the gateway. Bearer-auth gated same as existing/_kilo/config/*routes.updateMachine): addedskipLaunchoption — updates machine config without restarting.reconcileApiKeyExpiry): new step wired afterreconcileVolume. Flow: mint → update Fly config (skipLaunch) → try push → persist only if at least one path succeeded.minSecretsVersionforwarded fromensureEnvKey()to prevent secret propagation races.getProactiveRefreshThresholdMs): readsPROACTIVE_REFRESH_THRESHOLD_HOURSwrangler var with fallback to 72h default. Set to a large value (e.g.8760= 1 year) to trigger refresh on all running instances for testing.Verification
pnpm typecheck— passespnpm test— 566/566 tests pass (30 test files), including 26 new tests across 5 new/modified test filespnpm lint— passesVisual Changes
Old controller:
New controller:
Reviewer Notes
/_kilo/env/patchis always attempted. Old controllers return 404, which is caught gracefully. No manual version constant to maintain.skipLaunch: truewithminSecretsVersionfromensureEnvKey().pushed || flyConfigUpdated). If both fail, the next alarm retries.Promise.racefor the mint timeout clears the timer on success. Chosen overAbortSignal.timeoutbecause Hyperdrive doesn't propagate abort signals.PROACTIVE_REFRESH_THRESHOLD_HOURSto8760(1 year) in wrangler vars — every running instance with a known expiry will refresh on its next alarm cycle (within 5 min).tag:"reconcile" AND action:"api_key_*". Key events:api_key_refreshed(withpushedandflyConfigUpdatedfields),api_key_push_error,api_key_refresh_failed_all_paths.