Native serialization to a stream for FlatIndex#280
Native serialization to a stream for FlatIndex#280razdoburdin merged 9 commits intointel:dev/razdoburdin_streamingfrom
Conversation
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
|
||
| template <typename T = void> class StreamWriter : public Writer<T, StreamWriter<T>> { | ||
| public: | ||
| StreamWriter(std::ostream& os) |
There was a problem hiding this comment.
It seemed like the Header structure written by FileWriter at the beginning of a file has some important information including:
- magic number and uuid - for versioning.
- stored data size.
WhyStreamWriterdoes not populate the same header?
General question:
How are we going to handle cases when several objects to be stored/loaded in a stream?
E.g. in case of Vamana index, we have to store/load configuration, graph and data (where data may contain 2 simple datasets for LVQ/LeanVec cases).
There was a problem hiding this comment.
For FileWriter ostream is seek able (we know that it is a fstream) , so we can insert placeholder, write data, calculate size of data being written, replace placeholder to an actual header. But for StreamWriter ostream may be non seek able, and we can't do the same trick with placeholder Header.
I see two options here:
- Create temporary seek able stringstream, and use it as a buffer. But it creates a 2x memory overhead in serialization.
- Extract all required information from metadata. In this case we don't need Header.
I have used the fhe first approach (with stringstream) for toml::table serialization, since metadata are small, and overhead doesn't look like an acceptable trade-off in this case.
But for the main data I try to realize the second option (without overhead). I haven't started work on Vamana yet, so I am not confident, if metadata contains all required information in this case.
There was a problem hiding this comment.
So, I would add a test for flat+LVQ/LeanVec to validate if multi-dataset cases are managed properly.
rfsaliev
left a comment
There was a problem hiding this comment.
LGFM
Except objections regarding multiple data/datasets in 1 stream - to be verified on next steps during implementation of Vamana index support.
Reopening of #275 for developer branch