feat(controller): pair with controller for channel/device pairing#1095
Open
evanjacobson wants to merge 14 commits intomainfrom
Open
feat(controller): pair with controller for channel/device pairing#1095evanjacobson wants to merge 14 commits intomainfrom
evanjacobson wants to merge 14 commits intomainfrom
Conversation
…iented stdout monitoring
…and initial fetch
…ng, reduce type assertions
…ntroller startup/shutdown
- Add stopped guards to refresh methods preventing post-cleanup work - Replace unsafe type casts with isRecord() helper in route handlers - Wrap KV delete calls in try-catch to prevent cache invalidation failures from masking successful approvals - Fix cross-reference comments and JSDoc accuracy - Add tests covering empty-field filtering, post-cleanup behavior, detectChannels edge cases, start idempotency, post-approve refresh failures, and DO-level approve/refresh fallbacks
When a DO has userId/sandboxId but lost its flyMachineId (e.g. after a DO migration or state reset), the start() flow would try to create a new machine and hit a 409 "already_exists" from Fly. Two fixes: 1. Run attemptMetadataRecovery in start() when flyMachineId is missing, not just during full Postgres restore. 2. Safety net: if createNewMachine gets a 409 "already_exists", extract the machine ID from the error body and adopt it via startExistingMachine.
The volume verification was guarded by !flyMachineId, so it was skipped after metadata recovery set the machine ID. A stale flyVolumeId would then be passed to updateMachine, causing a 400 "volume does not exist" error.
| err.status === 409 && | ||
| err.body.includes('already_exists') | ||
| ) { | ||
| const match = err.body.match(/machine ID (\w+)/); |
Contributor
There was a problem hiding this comment.
WARNING: Brittle 409 recovery depends on one error-string format
This safety net only adopts the existing machine when the 409 body matches /machine ID (\w+)/. If Fly still returns already_exists but changes the message wording, this branch rethrows and the start path fails instead of recovering the orphaned machine. Since this code is meant to handle lost flyMachineId state, it would be safer to fall back to a deterministic recovery path (for example metadata recovery or a machine lookup by sandbox name) rather than gating recovery on one exact error string.
Contributor
Code Review SummaryStatus: 1 Issue Found | Recommendation: Address before merge Overview
Fix these issues in Kilo Cloud Issue Details (click to expand)WARNING
Other Observations (not in diff)None. Files Reviewed (16 files)
Reviewed by gpt-5.4-20260305 · 1,145,707 tokens |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a controller-side pairing cache and HTTP API so the Durable Object can resolve pairing requests via the controller's in-memory cache instead of shelling out via
fly execevery time.kilo pairing list/kilo device-pairing list, parses stdout, and serves cached results. Refreshes reactively when the gateway logs a new pairing event (via a newonStdoutLinesupervisor hook)./_kilo/pairing/{channels,devices}and/_kilo/pairing/{channels,devices}/approve.fly execon unknown-route errors, preserving backward compatibility with older controller versions.createMachinehits a 409 "already_exists", adopts the existing machine instead of crashing.Verification
pairing-cache.test.tsroutes/pairing.test.tsonStdoutLinetests (135 lines) —supervisor.test.tskiloclaw-instance.test.tscovering controller-first pairing, fly exec fallback, 409 recovery, volume verificationVisual Changes
N/A
Reviewer Notes
isErrorUnknownRouteto detect older controllers that don't have the pairing routes, so rollout is safe without version gating.