Skip to content

feat: optimize nacos discovery with privileged agent and LRU cache#12867

Open
9268 wants to merge 12 commits intoapache:masterfrom
9268:master
Open

feat: optimize nacos discovery with privileged agent and LRU cache#12867
9268 wants to merge 12 commits intoapache:masterfrom
9268:master

Conversation

@9268
Copy link
Copy Markdown

@9268 9268 commented Jan 7, 2026

Description

This PR fixes a performance issue introduced by #12353 where all worker processes were fetching nacos service discovery information, causing unnecessary network requests and JSON parsing overhead per requests.

Changes made:

  1. Restricted nacos discovery to privileged agent only - Only the privileged agent process now fetches service information from nacos servers
  2. Added LRU cache optimization - Worker processes use version-controlled LRU cache to avoid repeated JSON parsing
  3. Improved performance - Reduced shared memory access and eliminated redundant JSON decoding

Performance improvements:

  • Eliminates duplicate nacos API calls from multiple worker processes
  • Reduces JSON parsing overhead through LRU caching
  • Maintains data consistency through version control mechanism

Which issue(s) this PR fixes:

Fixes #12353

Checklist

  • I have explained the need for this PR and the problem it solves
  • I have explained the changes or the new features added to this PR
  • I have added tests corresponding to this change
  • I have updated the documentation to reflect this change
  • I have verified that this change is backward compatible (If not, please discuss on the APISIX mailing list first)

@9268 9268 marked this pull request as ready for review January 7, 2026 06:39
@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. performance generate flamegraph for the current PR labels Jan 7, 2026
Comment thread apisix/discovery/nacos/init.lua Outdated
@Baoyuantop Baoyuantop requested a review from Copilot January 26, 2026 08:27
@Baoyuantop Baoyuantop added the wait for update wait for the author's response in this issue/PR label Jan 26, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes Nacos service discovery by moving registry fetch work to the privileged agent process and introducing a version-controlled per-worker LRU cache to reduce repeated JSON decoding.

Changes:

  • Restrict Nacos registry fetching timers to the privileged agent process only.
  • Add a per-worker LRU cache for decoded node lists keyed by a shared-dict “version”.
  • Store and manage per-service #version entries in the nacos shared dict to drive cache invalidation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread apisix/discovery/nacos/init.lua
Comment thread apisix/discovery/nacos/init.lua
Comment thread apisix/discovery/nacos/init.lua Outdated
Comment thread apisix/discovery/nacos/init.lua Outdated
Comment thread apisix/discovery/nacos/init.lua Outdated
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Jan 27, 2026
@9268 9268 requested a review from Baoyuantop January 27, 2026 11:23
@Baoyuantop
Copy link
Copy Markdown
Contributor

Hi @9268, please fix failed CI

@9268
Copy link
Copy Markdown
Author

9268 commented Feb 8, 2026

@Baoyuantop Baoyuantop added awaiting review and removed wait for update wait for the author's response in this issue/PR user responded labels Feb 9, 2026
@9268
Copy link
Copy Markdown
Author

9268 commented Feb 9, 2026

a code lint issue has been fixed;

@9268
Copy link
Copy Markdown
Author

9268 commented Feb 24, 2026

@Baoyuantop
It seems that the error in the GitHub Actions is a compilation error, and this PR does not involve that part. Could you please rerun the failing part? In my fork's branch, the GitHub Actions succeeded except for the Changelog check. https://github.com/9268/apisix

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread apisix/discovery/nacos/init.lua
Baoyuantop
Baoyuantop previously approved these changes Mar 6, 2026
@Baoyuantop
Copy link
Copy Markdown
Contributor

Hi @9268, there is a failed test that needs to be fixed.

Copy link
Copy Markdown
Member

@moonming moonming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @9268, thank you for optimizing the Nacos discovery with privileged agent and LRU cache!

Running Nacos discovery only in the privileged agent (rather than all workers) is a smart optimization that reduces redundant API calls. With 9 reviews already, this seems well-discussed.

To move forward:

  1. Please confirm all 9 review comments have been addressed
  2. Does the LRU cache properly handle Nacos service updates? What's the cache invalidation strategy?
  3. Any performance benchmarks showing the reduction in Nacos API calls?

This looks like a solid optimization. Let's get it to merge-ready! Thank you.

Comment thread apisix/discovery/nacos/init.lua
@9268
Copy link
Copy Markdown
Author

9268 commented Mar 30, 2026

Hi @9268, thank you for optimizing the Nacos discovery with privileged agent and LRU cache!

Running Nacos discovery only in the privileged agent (rather than all workers) is a smart optimization that reduces redundant API calls. With 9 reviews already, this seems well-discussed.

To move forward:

  1. Please confirm all 9 review comments have been addressed
  2. Does the LRU cache properly handle Nacos service updates? What's the cache invalidation strategy?
  3. Any performance benchmarks showing the reduction in Nacos API calls?

This looks like a solid optimization. Let's get it to merge-ready! Thank you.

Overall, a two-level cache architecture is adopted:
Nacos API → (periodic pull) → nacos_dict (shared memory) → nodes_lrucache (worker-local)
The invalidation strategy is version-driven:

  1. On write (fetch_from_host): After each pull, calculate the CRC32 of the node list using ngx.crc32_long(content) as the version number, and write to key#version. If the node content remains unchanged, the version number stays the same.
  2. On read (_M.nodes): First retrieve key#version from nacos_dict, then pass it as the version parameter to nodes_lrucache(key, nodes_version, ...). The LRU cache internally compares the version number; if the version number changes, the local cache is automatically invalidated and reloaded from nacos_dict.
  3. On service decommissioning: At the end of fetch_from_host, compare curr_service_in_use, delete the key and key#version of services that are no longer in use. The next _M.nodes call will not find the version number and will fall back to returning nil.

LRU's own TTL: ttl = 60s, count = 1024, acts as a fallback eviction mechanism. Normally, version number changes trigger invalidation earlier.

@9268
Copy link
Copy Markdown
Author

9268 commented Mar 30, 2026

Hi @9268, thank you for optimizing the Nacos discovery with privileged agent and LRU cache!

Running Nacos discovery only in the privileged agent (rather than all workers) is a smart optimization that reduces redundant API calls. With 9 reviews already, this seems well-discussed.

To move forward:

  1. Please confirm all 9 review comments have been addressed
  2. Does the LRU cache properly handle Nacos service updates? What's the cache invalidation strategy?
  3. Any performance benchmarks showing the reduction in Nacos API calls?

This looks like a solid optimization. Let's get it to merge-ready! Thank you.

no benchmark tested,my old version apsix doesn't have such issue,after nacos alert,i noticed and fixed it

Baoyuantop
Baoyuantop previously approved these changes Apr 2, 2026
@Baoyuantop
Copy link
Copy Markdown
Contributor

Hi @9268, we recently refactored the service discovery code, and there are currently code conflicts in the PR. I was planning to help resolve them, but I’ve found the conflicts to be quite complex—it might require reimplementing the code based on the new service discovery implementation. Can you continue working on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting review performance generate flamegraph for the current PR size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants