Skip to content

feat: add destroy() to MdkNode to fix 402 zombie node race condition#30

Merged
martinsaposnic merged 2 commits intomainfrom
fix/destroy-node-serverless
Feb 20, 2026
Merged

feat: add destroy() to MdkNode to fix 402 zombie node race condition#30
martinsaposnic merged 2 commits intomainfrom
fix/destroy-node-serverless

Conversation

@martinsaposnic
Copy link
Contributor

Summary

Adds a destroy() method to MdkNode that explicitly drops the inner Rust Node and its tokio runtime, preventing zombie node race conditions on serverless platforms.

Problem

402 (machine-to-machine) payments fail intermittently with "retries exhausted" on serverless platforms (Vercel/Lambda). Root cause is a race condition between two MdkNode instances for the same wallet:

  1. create402Response() builds MdkNode add pay to lnurl capability, also add other methods that may be helpful #1, calls getInvoice() (which internally starts/stops the node), returns the 402 response
  2. MdkNode add pay to lnurl capability, also add other methods that may be helpful #1's JavaScript object goes out of scope but is not garbage collected - the Rust Node and its tokio runtime (with peer reconnection loop) survive
  3. Agent pays instantly (<1 second)
  4. LSP intercepts HTLC, sends webhook
  5. Webhook handler builds MdkNode Use ldk-node lsp-0.6.2 branch #2 for the same wallet, calls startReceiving()
  6. MdkNode add pay to lnurl capability, also add other methods that may be helpful #1's reconnection loop fires, connects to the LSP with the same node identity
  7. LSP replaces MdkNode Use ldk-node lsp-0.6.2 branch #2's connection with MdkNode add pay to lnurl capability, also add other methods that may be helpful #1's (BOLT 8: same pubkey = replace old connection)
  8. LSP sends open_channel to MdkNode add pay to lnurl capability, also add other methods that may be helpful #1 (stopped, can't process it)
  9. 10-second HTLC timeout expires, payment failed back

In normal checkout this doesn't happen because the human delay (5-30s between invoice creation and payment) gives V8 GC time to collect the old node.

Why stop() isn't enough

Node::stop() disconnects peers and signals background tasks, but:

  • Does not clear the peer store (persisted in VSS, survives stop/start)
  • Does not shut down the tokio runtime (owned by Arc<Runtime>, only dies on Node::drop())
  • The reconnection loop reads from the persisted peer store on each tick

Solution

  • Wrap the inner Node in Option<Node>
  • destroy() calls node.take() + stop() + drop(), killing the tokio runtime and all background tasks immediately
  • All other methods go through a node() helper that panics with a clear message if called after destroy
  • TypeScript types updated

Usage in the 402 flow

const invoice = await node.getInvoice(...);
node.destroy(); // kills tokio runtime before agent can pay
return create402Response(invoice);

This is safe because getInvoice() is fully synchronous via block_on() - by the time it returns, the LSPS4 negotiation is complete and the invoice string is in JS land. The client never persists a local SCID mapping (only the LSP does).

Test plan

  • cargo check passes
  • Verify 402 flow on regtest: create invoice, destroy node, pay from agent, confirm webhook handler claims successfully
  • Verify normal checkout still works (destroy not called in that path yet)
  • Verify calling methods after destroy() panics with clear message

On serverless platforms (Vercel/Lambda), MdkNode's inner Rust Node
and its tokio runtime survive after getInvoice() returns because V8
GC is non-deterministic. When the agent pays a 402 invoice instantly
(<1s), the webhook handler creates a second MdkNode for the same
wallet while the first is still alive. The zombie's reconnection
loop steals the LSP peer connection from the new node, preventing
the JIT channel from being established and causing "retries exhausted"
payment failures.

In normal checkout this doesn't happen because the human delay (5-30s)
gives GC time to collect the old node before the webhook fires.

destroy() wraps the inner Node in Option<Node>, allowing JS callers
to explicitly drop the Rust Node and its tokio runtime immediately
after invoice creation, eliminating the race condition.
@martinsaposnic martinsaposnic merged commit 71c722d into main Feb 20, 2026
12 checks passed
martinsaposnic added a commit that referenced this pull request Feb 20, 2026
…30)

* feat: add destroy() method to MdkNode for explicit cleanup

On serverless platforms (Vercel/Lambda), MdkNode's inner Rust Node
and its tokio runtime survive after getInvoice() returns because V8
GC is non-deterministic. When the agent pays a 402 invoice instantly
(<1s), the webhook handler creates a second MdkNode for the same
wallet while the first is still alive. The zombie's reconnection
loop steals the LSP peer connection from the new node, preventing
the JIT channel from being established and causing "retries exhausted"
payment failures.

In normal checkout this doesn't happen because the human delay (5-30s)
gives GC time to collect the old node before the webhook fires.

destroy() wraps the inner Node in Option<Node>, allowing JS callers
to explicitly drop the Rust Node and its tokio runtime immediately
after invoice creation, eliminating the race condition.

* style: fix cargo fmt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants