Fix Chinese Whispers clustering to process all nodes each iteration#3133
Open
SamareshSingh wants to merge 1 commit intodavisking:masterfrom
Open
Fix Chinese Whispers clustering to process all nodes each iteration#3133SamareshSingh wants to merge 1 commit intodavisking:masterfrom
SamareshSingh wants to merge 1 commit intodavisking:masterfrom
Conversation
- Changed algorithm from random node selection to guaranteed sequential processing using Fisher-Yates shuffle per iteration - Each iteration now shuffles all node indices and processes each sequentially, ensuring complete label propagation
Owner
|
Thanks but this isn't what the algorithm is supposed to be doing. I.e. the way it's written in dlib isn't a bug. What is here in this PR is a different, but related algorithm. |
Owner
|
Although I agree this is what the original chinese whispers paper said to do. I forget at this point why the version in dlib deviates from that paper, but what's in dlib works really well for the applications it's used for so I don't want to go changing it. Might not be as good for existing users. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixed a critical bug in the Chinese Whispers clustering algorithm where vectors within the distance threshold were not being grouped together correctly.
The Problem
The algorithm was supposed to perform
num_iterationscomplete passes over all nodes in the graph, but instead it was using random node selection. This meant some nodes could be skipped entirely during an iteration, breaking the label propagation required for correct clustering.For example, vectors with a distance of 0.371814 (below the 0.38 threshold) were ending up in different clusters when they should have been grouped together.
The Solution
Changed the algorithm to guarantee that every node is processed at least once per iteration using a Fisher-Yates shuffle approach: