Skip to content

Update to apache lucene 9.X#9207

Open
matthiasblaesing wants to merge 1 commit intoapache:masterfrom
matthiasblaesing:update-lucene3
Open

Update to apache lucene 9.X#9207
matthiasblaesing wants to merge 1 commit intoapache:masterfrom
matthiasblaesing:update-lucene3

Conversation

@matthiasblaesing
Copy link
Contributor

  • Additional package lucene-analysis-common required (KeywordAnalyzer, WhitespaceAnalyzer, PerFieldAnalyzerWrapper, CharTokenizer, LimitTokenCountAnalyzer)
  • FieldTypes were introduced to carry the field behavior (tokenization state, indexing options, storage settings)
  • BooleanQuery construction moved to builder
  • Collector interface was modified
  • TermEnum was replaced by TermsEnum
  • Terms are stored per field, so queries have to be specified with the target field
  • QuerySelectors were replaced by String sets
  • RAMDirecory was removed and is replaced by ByteBufferDirectories
  • Lock handling was reworked

@matthiasblaesing matthiasblaesing added this to the NB30 milestone Feb 14, 2026
@matthiasblaesing matthiasblaesing added ci:all-tests [ci] enable all tests ci:dev-build [ci] produce a dev-build zip artifact (7 days expiration, see link on workflow summary page) labels Feb 14, 2026
@matthiasblaesing matthiasblaesing marked this pull request as draft February 15, 2026 15:16
@lkishalmi
Copy link
Contributor

@matthiasblaesing This looks great! Thank you!

As far as I've tested. this generrally works, however there is a bug in the MemoryIndex that sometimes tries to search on a closed index.

@lkishalmi
Copy link
Contributor

Also is it possible to upgrade the base JDK version to Java 21 for parsing.lucene module?

@matthiasblaesing
Copy link
Contributor Author

Also is it possible to upgrade the base JDK version to Java 21 for parsing.lucene module?

I had an initial branch that did that (and bumped to lucene 10), but that will cause issues in the enterprise cluster, that runs with JDK17 for tests. @mbien gave the hint that lucene 9 allows to stay with the older java level.

As far as I've tested. this generrally works, however there is a bug in the MemoryIndex that sometimes tries to search on a closed index.

I'm seeing issues with LuceneIndex. There is a strange construction in LuceneIndex#doClear, which knocks away all locks on invocation. I think there is a race somewhere.

@lkishalmi
Copy link
Contributor

I think I can solve the issue of Java 21 on Enterprise Tests. What I see the majority of the fails comes from the Micronaut integration. I had some clash with that lately.

I also see that my PR broke your paperwork as the sig file needs to be re-generated for parsing.lucene

@matthiasblaesing
Copy link
Contributor Author

The micronaut problems are real. Real as in, I manually tested the case (started IDE with example project) and found completion broken. I'll look a bit further.

@lkishalmi
Copy link
Contributor

@matthiasblaesing please rebase on master, @mbien fixed Micronaut tests with Java 21.

@matthiasblaesing
Copy link
Contributor Author

The commits are rebased, but that does not seem to be a problem. I can reproduce the micronaut problem locally. The problem is not deterministic. I tested it by:

  • removing the test index folder used by the test rm -rf /tmp/nbcache/index/
  • run the micronaut tests
  • check the diffs from the test failures

I noticed changes in the packages there were not listed anymore. I then located the index folders for the jars and indeed the indices were at least partially broken.

That matches outputs in the unittests reporting: WARNUNG: Locked index folder: /tmp/nbcache/index/s8/java/15/refs. I'll stare a bit more at the LuceneIndex and its embedded DirCache. The lock schema feels off, but I can't point to the core problem yet.

@matthiasblaesing
Copy link
Contributor Author

I found at least one problem with the updated code: the locks move from per-factory to global and that explains the seemingly random behavior.

@matthiasblaesing
Copy link
Contributor Author

And I found the second round of problems:

  • ByteRefs are mutable and they are mutated when a TermsEnum is iterated. I.e. if the ByteRef is to be used after the TermsEnum moves to the next term, the old ByteRef might be modified. So a defensive copy is created now.
  • For class member search a case insensitive match is done. That works by transforming the token stream to lowercase. The old implementation does not work anymore and needs to be replaced by the LowerCaseFilter.

@matthiasblaesing matthiasblaesing marked this pull request as ready for review February 21, 2026 07:50
@lkishalmi
Copy link
Contributor

I've done some testing today. The old $NETBEANS_CACHE_DIR/index should be removed if there is one exists before this change, otherwise there are a bunch of IndexFormatTooOldException would be thrown. Though that should not be an issue. This is just an FYI for others reading this PR.

MemoryIndex still throws an AlreadyClosedException, after a file has been modified and auto-completion has been invoked.

Other probably biassed (not measured in any way) experience: The indexing seems to be a bit slower than before, though the index search seems to be faster. Never had a completion when it was not feeling instant. Will test this on Raspberry Pi.

Copy link
Contributor

@lkishalmi lkishalmi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on the edge on getting this merged. Even when there are known issues, merging it allows more people to test/give feedback chance to debug.

@matthiasblaesing
Copy link
Contributor Author

I've done some testing today. The old $NETBEANS_CACHE_DIR/index should be removed if there is one exists before this change, otherwise there are a bunch of IndexFormatTooOldException would be thrown. Though that should not be an issue. This is just an FYI for others reading this PR.

Yeah. I noticed that when I switch between 10 and 9 for testing. I'll see if I can up with a method to allow a smother up/downgrade (most probably this will mean: nuke the exiting index).

MemoryIndex still throws an AlreadyClosedException, after a file has been modified and auto-completion has been invoked.

Found the issue. Somewhere between multiple iterations I move to a try-with-resource setup and this was a remaining not reverted change. The readers in MemoryIndex and LuceneIndex are shared and must not be closed at the use-sites. Should be fixed now.

Copy link
Member

@mbien mbien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updates look good to me and some parts are also similar to some of my update attempts which I never finished since something else always seemed to show up.

I ran some smoke testing scripts which mostly test initial index creation and everything worked well. The index size and indexing times were fairly similar to NB 29 which I somewhat expected since I think to remember that the bottle neck wasn't the lucene end last time I looked (edit: index creation is also only a small piece of the puzzle, queries would be interesting to benchmark). Didn't see any exceptions and everything seemed to work after indexing - great job!

Yeah. I noticed that when I switch between 10 and 9 for testing. I'll see if I can up with a method to allow a smother up/downgrade (most probably this will mean: nuke the exiting index).

technically we don't have to worry about this too much since the NB upgrade process does not copy the index. There is also the option for adding the lucene backward-codecs dependency - this was done for the maven indexer which does actually move the index if its fresh enough.

This allows to use old indices for a bit longer before they have to be reset. (this does not cover lucene 3 but it is something we might consider when we update lucene the next time)

left one comment inline

@matthiasblaesing matthiasblaesing force-pushed the update-lucene3 branch 2 times, most recently from 029628e to d752c8f Compare February 25, 2026 17:06
@matthiasblaesing
Copy link
Contributor Author

Pushed an update:

  • bumped index versions for the java indexer that is nearer to the lucene index and the general indexing framework, that PHP, JS, CSS and the rest use (LuceneIndexFactory.java and JavaIndex.java)
  • integrated @mbien change
  • noticed a problem with JS indexing, that produces huge terms that can't be stored and implemented a work around (see LuceneIndex.java)

For comparison see:
https://github.com/apache/netbeans/compare/b9d392c3e630e3d8ddf236305e66e6f82b36d730..029628e796366e8f91835c666029bc50c1540ea8. The second force push is just a rebase to fix conflict with the merged signature files.

@mbien
Copy link
Member

mbien commented Feb 25, 2026

Ran the smoke test script again (opens and indexes ~850 NB modules) and everything worked fine without exceptions or indexer related warnings. Editor features worked after that too. Seems to be solid!

- Additional package lucene-analysis-common required
  (KeywordAnalyzer, WhitespaceAnalyzer, PerFieldAnalyzerWrapper, CharTokenizer,
  LimitTokenCountAnalyzer)
- FieldTypes were introduced to carry the field behavior (tokenization
  state, indexing options, storage settings)
- BooleanQuery construction moved to builder
- Collector interface was modified
- TermEnum was replaced by TermsEnum
- Terms are stored per field, so queries have to be specified with the
  target field
- QuerySelectors were replaced by String sets
- RAMDirecory was removed and is replaced by ByteBufferDirectories
- Lock handling was reworked
- Remove StoppableConvertor from Lucene Support
- Work around potentially excessivily large terms in index documents.
  (observed was a single `usage` entry for a JS document with about
   40_000 characters)

Co-authored-by: Laszlo Kishalmi <laszlo.kishalmi@gmail.com>
Co-authored-by: Michael Bien <mbien42@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci:all-tests [ci] enable all tests ci:dev-build [ci] produce a dev-build zip artifact (7 days expiration, see link on workflow summary page) Editor Upgrade Library Library (Dependency) Upgrade

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update Lucene 3.x editor dependency Apache Lucene 3.6.2 critical vulnerability issue - CVE-2017-12629

3 participants