Skip to content

Fix Chinese Whispers clustering to process all nodes each iteration#3133

Open
SamareshSingh wants to merge 1 commit intodavisking:masterfrom
SamareshSingh:fix/chinese-whispers-issue-2829
Open

Fix Chinese Whispers clustering to process all nodes each iteration#3133
SamareshSingh wants to merge 1 commit intodavisking:masterfrom
SamareshSingh:fix/chinese-whispers-issue-2829

Conversation

@SamareshSingh
Copy link

@SamareshSingh SamareshSingh commented Feb 2, 2026

Summary

Fixed a critical bug in the Chinese Whispers clustering algorithm where vectors within the distance threshold were not being grouped together correctly.

The Problem

The algorithm was supposed to perform num_iterations complete passes over all nodes in the graph, but instead it was using random node selection. This meant some nodes could be skipped entirely during an iteration, breaking the label propagation required for correct clustering.

For example, vectors with a distance of 0.371814 (below the 0.38 threshold) were ending up in different clusters when they should have been grouped together.

The Solution

Changed the algorithm to guarantee that every node is processed at least once per iteration using a Fisher-Yates shuffle approach:

  • Each iteration shuffles all node indices
  • Then processes each node sequentially in the shuffled order
  • This ensures complete label propagation through the graph while maintaining randomization

- Changed algorithm from random node selection to guaranteed sequential processing using Fisher-Yates shuffle per iteration
- Each iteration now shuffles all node indices and processes each sequentially, ensuring complete label propagation
@davisking
Copy link
Owner

Thanks but this isn't what the algorithm is supposed to be doing. I.e. the way it's written in dlib isn't a bug. What is here in this PR is a different, but related algorithm.

@davisking
Copy link
Owner

Although I agree this is what the original chinese whispers paper said to do. I forget at this point why the version in dlib deviates from that paper, but what's in dlib works really well for the applications it's used for so I don't want to go changing it. Might not be as good for existing users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants