feat: optimize nacos discovery with privileged agent and LRU cache#12867
feat: optimize nacos discovery with privileged agent and LRU cache#128679268 wants to merge 12 commits intoapache:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR optimizes Nacos service discovery by moving registry fetch work to the privileged agent process and introducing a version-controlled per-worker LRU cache to reduce repeated JSON decoding.
Changes:
- Restrict Nacos registry fetching timers to the privileged agent process only.
- Add a per-worker LRU cache for decoded node lists keyed by a shared-dict “version”.
- Store and manage per-service
#versionentries in thenacosshared dict to drive cache invalidation.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Hi @9268, please fix failed CI |
|
@Baoyuantop hi,ci has been fixed https://github.com/9268/apisix/actions/runs/21792352299 |
|
a code lint issue has been fixed; |
|
@Baoyuantop |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Hi @9268, there is a failed test that needs to be fixed. |
moonming
left a comment
There was a problem hiding this comment.
Hi @9268, thank you for optimizing the Nacos discovery with privileged agent and LRU cache!
Running Nacos discovery only in the privileged agent (rather than all workers) is a smart optimization that reduces redundant API calls. With 9 reviews already, this seems well-discussed.
To move forward:
- Please confirm all 9 review comments have been addressed
- Does the LRU cache properly handle Nacos service updates? What's the cache invalidation strategy?
- Any performance benchmarks showing the reduction in Nacos API calls?
This looks like a solid optimization. Let's get it to merge-ready! Thank you.
Overall, a two-level cache architecture is adopted:
LRU's own TTL: ttl = 60s, count = 1024, acts as a fallback eviction mechanism. Normally, version number changes trigger invalidation earlier. |
no benchmark tested,my old version apsix doesn't have such issue,after nacos alert,i noticed and fixed it |
…es, reduce lru ttl to 60s
|
Hi @9268, we recently refactored the service discovery code, and there are currently code conflicts in the PR. I was planning to help resolve them, but I’ve found the conflicts to be quite complex—it might require reimplementing the code based on the new service discovery implementation. Can you continue working on this? |
Description
This PR fixes a performance issue introduced by #12353 where all worker processes were fetching nacos service discovery information, causing unnecessary network requests and JSON parsing overhead per requests.
Changes made:
Performance improvements:
Which issue(s) this PR fixes:
Fixes #12353
Checklist