Skip to content

feat(controller): pair with controller for channel/device pairing#1095

Open
evanjacobson wants to merge 14 commits intomainfrom
feature/kiloclaw-pair-with-controller
Open

feat(controller): pair with controller for channel/device pairing#1095
evanjacobson wants to merge 14 commits intomainfrom
feature/kiloclaw-pair-with-controller

Conversation

@evanjacobson
Copy link
Contributor

Summary

Adds a controller-side pairing cache and HTTP API so the Durable Object can resolve pairing requests via the controller's in-memory cache instead of shelling out via fly exec every time.

  • Pairing cache on the controller periodically runs kilo pairing list / kilo device-pairing list, parses stdout, and serves cached results. Refreshes reactively when the gateway logs a new pairing event (via a new onStdoutLine supervisor hook).
  • HTTP routes at /_kilo/pairing/{channels,devices} and /_kilo/pairing/{channels,devices}/approve.
  • DO pairing methods now try the controller HTTP API first and fall back to fly exec on unknown-route errors, preserving backward compatibility with older controller versions.
  • Zod schemas for typed controller pairing responses.
  • 409 machine recovery: if createMachine hits a 409 "already_exists", adopts the existing machine instead of crashing.
  • Volume existence check after machine ID recovery to prevent startups against a missing volume.

Verification

  • Pairing cache tests (749 lines) — pairing-cache.test.ts
  • Pairing route tests (410 lines) — routes/pairing.test.ts
  • Supervisor onStdoutLine tests (135 lines) — supervisor.test.ts
  • DO integration tests (519+ new lines) — kiloclaw-instance.test.ts covering controller-first pairing, fly exec fallback, 409 recovery, volume verification
  • Manual testing — paired Telegram channel

Visual Changes

N/A

Reviewer Notes

  • The controller-first / fly-exec-fallback pattern uses isErrorUnknownRoute to detect older controllers that don't have the pairing routes, so rollout is safe without version gating.
  • The pairing cache debounces rapid refresh triggers and caps concurrent CLI invocations to avoid overwhelming the gateway process.

- Add stopped guards to refresh methods preventing post-cleanup work
- Replace unsafe type casts with isRecord() helper in route handlers
- Wrap KV delete calls in try-catch to prevent cache invalidation failures
  from masking successful approvals
- Fix cross-reference comments and JSDoc accuracy
- Add tests covering empty-field filtering, post-cleanup behavior,
  detectChannels edge cases, start idempotency, post-approve refresh
  failures, and DO-level approve/refresh fallbacks
When a DO has userId/sandboxId but lost its flyMachineId (e.g. after
a DO migration or state reset), the start() flow would try to create
a new machine and hit a 409 "already_exists" from Fly.

Two fixes:
1. Run attemptMetadataRecovery in start() when flyMachineId is missing,
   not just during full Postgres restore.
2. Safety net: if createNewMachine gets a 409 "already_exists", extract
   the machine ID from the error body and adopt it via startExistingMachine.
The volume verification was guarded by !flyMachineId, so it was
skipped after metadata recovery set the machine ID. A stale
flyVolumeId would then be passed to updateMachine, causing a 400
"volume does not exist" error.
err.status === 409 &&
err.body.includes('already_exists')
) {
const match = err.body.match(/machine ID (\w+)/);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Brittle 409 recovery depends on one error-string format

This safety net only adopts the existing machine when the 409 body matches /machine ID (\w+)/. If Fly still returns already_exists but changes the message wording, this branch rethrows and the start path fails instead of recovering the orphaned machine. Since this code is meant to handle lost flyMachineId state, it would be safer to fall back to a deterministic recovery path (for example metadata recovery or a machine lookup by sandbox name) rather than gating recovery on one exact error string.

@kilo-code-bot
Copy link
Contributor

kilo-code-bot bot commented Mar 13, 2026

Code Review Summary

Status: 1 Issue Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 0
WARNING 1
SUGGESTION 0

Fix these issues in Kilo Cloud

Issue Details (click to expand)

WARNING

File Line Issue
kiloclaw/src/durable-objects/kiloclaw-instance/fly-machines.ts 234 409 recovery depends on parsing one Fly error-string format, so a wording change can turn the new safety-net path back into a hard startup failure.
Other Observations (not in diff)

None.

Files Reviewed (16 files)
  • kiloclaw/controller/src/index.ts - 0 issues
  • kiloclaw/controller/src/pairing-cache.test.ts - 0 issues
  • kiloclaw/controller/src/pairing-cache.ts - 0 issues
  • kiloclaw/controller/src/routes/pairing.test.ts - 0 issues
  • kiloclaw/controller/src/routes/pairing.ts - 0 issues
  • kiloclaw/controller/src/supervisor.test.ts - 0 issues
  • kiloclaw/controller/src/supervisor.ts - 0 issues
  • kiloclaw/scripts/.dev-start.conf.example - 0 issues
  • kiloclaw/src/durable-objects/gateway-controller-types.ts - 0 issues
  • kiloclaw/src/durable-objects/kiloclaw-instance.test.ts - 0 issues
  • kiloclaw/src/durable-objects/kiloclaw-instance/fly-machines.ts - 1 issue
  • kiloclaw/src/durable-objects/kiloclaw-instance/gateway.ts - 0 issues
  • kiloclaw/src/durable-objects/kiloclaw-instance/index.ts - 0 issues
  • kiloclaw/src/durable-objects/kiloclaw-instance/pairing.ts - 0 issues
  • kiloclaw/src/routes/platform.ts - 0 issues
  • src/app/(app)/claw/components/modelSupport.ts - 0 issues

Reviewed by gpt-5.4-20260305 · 1,145,707 tokens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant