feat: add destroy() to MdkNode to fix 402 zombie node race condition by martinsaposnic · Pull Request #30 · moneydevkit/lightning-js

martinsaposnic · 2026-02-20T14:47:18Z

Summary

Adds a destroy() method to MdkNode that explicitly drops the inner Rust Node and its tokio runtime, preventing zombie node race conditions on serverless platforms.

Problem

402 (machine-to-machine) payments fail intermittently with "retries exhausted" on serverless platforms (Vercel/Lambda). Root cause is a race condition between two MdkNode instances for the same wallet:

create402Response() builds MdkNode add pay to lnurl capability, also add other methods that may be helpful #1, calls getInvoice() (which internally starts/stops the node), returns the 402 response
MdkNode add pay to lnurl capability, also add other methods that may be helpful #1's JavaScript object goes out of scope but is not garbage collected - the Rust Node and its tokio runtime (with peer reconnection loop) survive
Agent pays instantly (<1 second)
LSP intercepts HTLC, sends webhook
Webhook handler builds MdkNode Use ldk-node lsp-0.6.2 branch #2 for the same wallet, calls startReceiving()
MdkNode add pay to lnurl capability, also add other methods that may be helpful #1's reconnection loop fires, connects to the LSP with the same node identity
LSP replaces MdkNode Use ldk-node lsp-0.6.2 branch #2's connection with MdkNode add pay to lnurl capability, also add other methods that may be helpful #1's (BOLT 8: same pubkey = replace old connection)
LSP sends open_channel to MdkNode add pay to lnurl capability, also add other methods that may be helpful #1 (stopped, can't process it)
10-second HTLC timeout expires, payment failed back

In normal checkout this doesn't happen because the human delay (5-30s between invoice creation and payment) gives V8 GC time to collect the old node.

Why `stop()` isn't enough

Node::stop() disconnects peers and signals background tasks, but:

Does not clear the peer store (persisted in VSS, survives stop/start)
Does not shut down the tokio runtime (owned by Arc<Runtime>, only dies on Node::drop())
The reconnection loop reads from the persisted peer store on each tick

Solution

Wrap the inner Node in Option<Node>
destroy() calls node.take() + stop() + drop(), killing the tokio runtime and all background tasks immediately
All other methods go through a node() helper that panics with a clear message if called after destroy
TypeScript types updated

Usage in the 402 flow

const invoice = await node.getInvoice(...);
node.destroy(); // kills tokio runtime before agent can pay
return create402Response(invoice);

This is safe because getInvoice() is fully synchronous via block_on() - by the time it returns, the LSPS4 negotiation is complete and the invoice string is in JS land. The client never persists a local SCID mapping (only the LSP does).

Test plan

cargo check passes
Verify 402 flow on regtest: create invoice, destroy node, pay from agent, confirm webhook handler claims successfully
Verify normal checkout still works (destroy not called in that path yet)
Verify calling methods after destroy() panics with clear message

On serverless platforms (Vercel/Lambda), MdkNode's inner Rust Node and its tokio runtime survive after getInvoice() returns because V8 GC is non-deterministic. When the agent pays a 402 invoice instantly (<1s), the webhook handler creates a second MdkNode for the same wallet while the first is still alive. The zombie's reconnection loop steals the LSP peer connection from the new node, preventing the JIT channel from being established and causing "retries exhausted" payment failures. In normal checkout this doesn't happen because the human delay (5-30s) gives GC time to collect the old node before the webhook fires. destroy() wraps the inner Node in Option<Node>, allowing JS callers to explicitly drop the Rust Node and its tokio runtime immediately after invoice creation, eliminating the race condition.

…30) * feat: add destroy() method to MdkNode for explicit cleanup On serverless platforms (Vercel/Lambda), MdkNode's inner Rust Node and its tokio runtime survive after getInvoice() returns because V8 GC is non-deterministic. When the agent pays a 402 invoice instantly (<1s), the webhook handler creates a second MdkNode for the same wallet while the first is still alive. The zombie's reconnection loop steals the LSP peer connection from the new node, preventing the JIT channel from being established and causing "retries exhausted" payment failures. In normal checkout this doesn't happen because the human delay (5-30s) gives GC time to collect the old node before the webhook fires. destroy() wraps the inner Node in Option<Node>, allowing JS callers to explicitly drop the Rust Node and its tokio runtime immediately after invoice creation, eliminating the race condition. * style: fix cargo fmt

martinsaposnic added 2 commits February 20, 2026 11:46

style: fix cargo fmt

fc6f422

f3r10 approved these changes Feb 20, 2026

View reviewed changes

martinsaposnic merged commit 71c722d into main Feb 20, 2026
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add destroy() to MdkNode to fix 402 zombie node race condition#30

feat: add destroy() to MdkNode to fix 402 zombie node race condition#30
martinsaposnic merged 2 commits intomainfrom
fix/destroy-node-serverless

martinsaposnic commented Feb 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

martinsaposnic commented Feb 20, 2026

Summary

Problem

Why stop() isn't enough

Solution

Usage in the 402 flow

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Why `stop()` isn't enough