-
Notifications
You must be signed in to change notification settings - Fork 14
Open source HRIBLT #94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,12 @@ | ||
| [package] | ||
| name = "hriblt" | ||
| version = "0.1.0" | ||
| edition = "2024" | ||
| description = "Algorithm for rateless set reconciliation" | ||
| repository = "https://github.com/github/rust-gems" | ||
| license = "MIT" | ||
| keywords = ["set-reconciliation", "sync", "algorithm", "probabilistic"] | ||
| categories = ["algorithms", "data-structures", "mathematics", "science"] | ||
|
|
||
| [dependencies] | ||
| thiserror = "2" | ||
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,55 @@ | ||||||||||
| # Hierarchical Rateless Bloom Lookup Tables | ||||||||||
|
|
||||||||||
| A novel algorithm for computing the symmetric difference between sets where the amount of data shared is proportional to the size of the difference in the sets rather than proportional to the overall size. | ||||||||||
|
|
||||||||||
| ## Usage | ||||||||||
|
|
||||||||||
| Add the library to your `Cargo.toml` file. | ||||||||||
|
|
||||||||||
| ```toml | ||||||||||
| [dependencies] | ||||||||||
| hriblt = "0.1" | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| Create two encoding sessions, one containing your data, and another containing the counter-parties data. This counterparty data might have been sent to you over a network for example. | ||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. replace your and counter-party with Alice and Bob from the start... |
||||||||||
|
|
||||||||||
| The following example attempts to reconcile the differences between two sets of `u64` integers, and is done from the perspective of "Bob", who has recieved some symbols from "Alice". | ||||||||||
|
||||||||||
| The following example attempts to reconcile the differences between two sets of `u64` integers, and is done from the perspective of "Bob", who has recieved some symbols from "Alice". | |
| The following example attempts to reconcile the differences between two sets of `u64` integers, and is done from the perspective of "Bob", who has received some symbols from "Alice". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The following example attempts to reconcile the differences between two sets of `u64` integers, and is done from the perspective of "Bob", who has recieved some symbols from "Alice". | |
| The following example attempts to reconcile the differences between such two sets of `u64` integers, and is done from the perspective of "Bob", who has received some symbols from "Alice". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You want to show here that it ALSO tells you whether a value was Inserted or Removed.
Maybe use the sign of the value for demonstration purposes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we also want an example showing the hierarchical aspect of this.
I.e. both sides compute an encoding session of size 1024 and convert it into a hierarchy. Then, one side transfers 128 blocks at a time until decoding succeeds.
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,21 @@ | ||||||
| # Hash Functions | ||||||
|
|
||||||
| This library has a trait, `HashFunctions` which is used to create the hashes required to place your symbol into the range of coded symbols. | ||||||
|
|
||||||
| The following documentation provides more details on this trait in particular. How and why this is done is explained in the `overview.md` documentation. | ||||||
|
||||||
| The following documentation provides more details on this trait in particular. How and why this is done is explained in the `overview.md` documentation. | |
| The following documentation provides more details on this trait in particular. |
Copilot
AI
Jan 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spelling error: "guarenteed" should be "guaranteed".
This issue also appears in the following locations of the same file:
- line 11
| We provide a `DefaultHashFunctions` type which is a wrapper around the `DefaultHasher` type provided by the Rust standard library. Though the seed for this function is fixed, it should be noted that the hashes produces by this type are *not* guarenteed to be stable across different versions of the Rust standard library. As such, you should not use this type for any situation where clients might potentially be running on a binary built with an unspecified version of Rust. | |
| We provide a `DefaultHashFunctions` type which is a wrapper around the `DefaultHasher` type provided by the Rust standard library. Though the seed for this function is fixed, it should be noted that the hashes produces by this type are *not* guaranteed to be stable across different versions of the Rust standard library. As such, you should not use this type for any situation where clients might potentially be running on a binary built with an unspecified version of Rust. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have a feature flag which provides the SHA1 functionality for you...
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,17 @@ | ||
| # Sizing your HRIBLT | ||
|
|
||
| Because the HRIBLT is rateless, it is possible to append additional data in order to make it decoding possible. That is, it does not need to be sized in advance like a standard invertible bloom lookup table. | ||
|
|
||
| Regardless, there are some advantages to getting the size of your decoding session correct the first time. An example might be if you're performing set reconciliation over some RPC and you want to minimise the number of round trips it takes to perform a decode. | ||
|
|
||
| ## Coded Symbol Multiplier | ||
|
|
||
| The number of coded symbols required to find the difference between two sets is proportional to the difference between the two sets. The following chart shows the relationship between the number of coded symbols required to decode HRIBLT and the size of the diff. Note that the size of the base set (before diffs were added) was fixed. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I don't quite understand what that sentence means :) When decoding, only the size of the diff matters, i.e. you don't necessarily need to build a base set or two sets that you actually diff. You just need to create a random diff set (i.e. with insertions and deletions). |
||
|
|
||
| `y = len(coded_symbols) / diff_size` | ||
|
|
||
|  | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We also need to visualize the standard deviation (or maybe percentiles like 90%-ile, 99%-ile). Max doesn't really make sense, since in theory it is infinite. Those are almost more important than the average.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. mmm. I couldn't find how you generated this file... We should/must merge that code. |
||
|
|
||
| For small diffs, the number of coded symbols required per value is larger, after a difference of approximately 100 values the coefficient settles on around 1.3 to 1.4. | ||
|
|
||
| You can use this chart, combined with an estimate of the diff size (perhaps from a `geo_filter`) to increase the probability that you will have a successful decode after a single round-trip while also minimising the amount of data sent. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,127 @@ | ||
| use crate::{Encodable, HashFunctions, index_for_seed, indices}; | ||
|
|
||
| /// Represents a coded symbol in the invertible bloom filter table. | ||
| /// In some of the literature this is referred to as a "cell" or "bucket". | ||
| /// It includes a checksum to verify whether the instance represents a pure value. | ||
| #[derive(Debug, Clone, Copy, Eq, PartialEq, Hash)] | ||
| pub struct CodedSymbol<T: Encodable> { | ||
| /// Values aggregated by XOR operation. | ||
| pub value: T, | ||
| /// We repurpose the two least significant bits of the checksum: | ||
| /// - The least significant bit is a one bit counter which is incremented for each entity. | ||
| /// This bit must be set when there is a single entity represented by this hash. | ||
| /// - The second least significant bit indicates whether the entity is a deletion or insertion. | ||
| pub checksum: u64, | ||
| } | ||
|
|
||
| impl<T: Encodable> Default for CodedSymbol<T> { | ||
| fn default() -> Self { | ||
| CodedSymbol { | ||
| value: T::zero(), | ||
| checksum: 0, | ||
| } | ||
| } | ||
| } | ||
|
|
||
| impl<T: Encodable> From<(T, u64)> for CodedSymbol<T> { | ||
| fn from(tuple: (T, u64)) -> Self { | ||
| Self { | ||
| value: tuple.0, | ||
| checksum: tuple.1, | ||
| } | ||
| } | ||
| } | ||
|
|
||
| impl<T: Encodable> CodedSymbol<T> { | ||
| /// Creates a new coded symbol with the given hash and deletion flag. | ||
| pub(crate) fn new<S: HashFunctions<T>>(state: &S, hash: T, deletion: bool) -> Self { | ||
| let mut checksum = state.check_sum(&hash); | ||
| checksum |= 1; // Add a single bit counter | ||
| if deletion { | ||
| checksum = checksum.wrapping_neg(); | ||
| } | ||
| CodedSymbol { | ||
| value: hash, | ||
| checksum, | ||
| } | ||
| } | ||
|
|
||
| /// Merges another coded symbol into this one. | ||
| pub(crate) fn add(&mut self, other: &CodedSymbol<T>, negate: bool) { | ||
| self.value.xor(other.value); | ||
| if negate { | ||
| self.checksum = self.checksum.wrapping_sub(other.checksum); | ||
| } else { | ||
| self.checksum = self.checksum.wrapping_add(other.checksum); | ||
| } | ||
| } | ||
|
|
||
| /// Checks whether this coded symbol is pure, i.e., whether it represents a single entity | ||
| /// A pure coded symbol must satisfy the following conditions: | ||
| /// - The 1-bit counter must be 1 or -1 (which are both represented by the bit being set) | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. -> by the least significant bit being set |
||
| /// - The checksum must match the checksum of the value. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. must match the absolute value of the checksum of the value (since the sign tells you whether it is an insertion or deletion). |
||
| /// - The indices of the value must match the index of this coded symbol. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (exactly) one of the indices of the value must match the index of this coded symbol. |
||
| pub(crate) fn is_pure<S: HashFunctions<T>>( | ||
| &self, | ||
| state: &S, | ||
| i: usize, | ||
| len: usize, | ||
| ) -> (bool, usize) { | ||
| if self.checksum & 1 == 0 { | ||
| return (false, 0); | ||
| } | ||
| let multiplicity = indices_contains(state, &self.value, len, i); | ||
| if multiplicity != 1 { | ||
| return (false, 0); | ||
| } | ||
| let checksum = state.check_sum(&self.value) | 1; | ||
| if checksum == self.checksum || checksum.wrapping_neg() == self.checksum { | ||
| (true, 0) | ||
| } else { | ||
| let required_bits = self | ||
| .checksum | ||
| .wrapping_sub(checksum) | ||
| .leading_zeros() | ||
| .max(self.checksum.wrapping_add(checksum).leading_zeros()) | ||
| as usize; | ||
| (false, required_bits) | ||
| } | ||
| } | ||
|
|
||
| /// Checks whether this coded symbol is zero, i.e., whether it represents no entity. | ||
| pub(crate) fn is_zero(&self) -> bool { | ||
| self.checksum == 0 && self.value == T::zero() | ||
| } | ||
|
|
||
| /// Checks whether this coded symbol represents a deletion. | ||
| pub(crate) fn is_deletion<S: HashFunctions<T>>(&self, state: &S) -> bool { | ||
| let checksum = state.check_sum(&self.value) | 1; | ||
| checksum != self.checksum | ||
| } | ||
| } | ||
|
|
||
| /// This function checks efficiently whether the given index is contained in the indices. | ||
| /// | ||
| /// Note: we have constructed the indices such that we can determine from the last 5 bits | ||
| /// which hash function would map to this index. Therefore, we only need to check against | ||
| /// a single hash function and not all 5! | ||
| /// The only exception is for very small indices (0..32) or if the index is a multiple of 32. | ||
| /// | ||
| /// The function returns the multiplicity, i.e. how many indices hit this particular index. | ||
| /// Thereby, it takes into account whether the value is stored negated or not. | ||
| fn indices_contains<T: std::hash::Hash>( | ||
| state: &impl HashFunctions<T>, | ||
| value: &T, | ||
| stream_len: usize, | ||
| i: usize, | ||
| ) -> i32 { | ||
| if stream_len > 32 && i % 32 != 0 { | ||
| let seed = i % 4; | ||
| let j = index_for_seed(state, value, stream_len, seed as u32); | ||
| if i == j { 1 } else { 0 } | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should this be written differently in Rust? (same couple lines below) |
||
| } else { | ||
| indices(state, value, stream_len) | ||
| .map(|j| if j == i { 1 } else { 0 }) | ||
| .sum() | ||
| } | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,31 @@ | ||
| /// A value that has been found by the set reconciliation algorithm. | ||
| #[derive(Debug, Clone, Copy, Eq, PartialEq, Hash, PartialOrd, Ord)] | ||
| pub enum DecodedValue<T> { | ||
| /// A value that has been added | ||
| Addition(T), | ||
| /// A value that has been removed | ||
| Deletion(T), | ||
| } | ||
|
|
||
| impl<T> DecodedValue<T> { | ||
| /// Consume this `DecodedValue` to return the value | ||
| pub fn into_value(self) -> T { | ||
| match self { | ||
| DecodedValue::Addition(v) => v, | ||
| DecodedValue::Deletion(v) => v, | ||
| } | ||
| } | ||
|
|
||
| /// Borrow the value within this decoded value. | ||
| pub fn value(&self) -> &T { | ||
| match self { | ||
| DecodedValue::Addition(v) => v, | ||
| DecodedValue::Deletion(v) => v, | ||
| } | ||
| } | ||
|
|
||
| /// Returns true if this decoded value is a deletion | ||
| pub fn is_deletion(&self) -> bool { | ||
| matches!(self, DecodedValue::Deletion(_)) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we also want some benchmarks to show throughput