-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Export the filter #17
Comments
Hi Roy, thanks for dropping a note. An ability to serialize the Bloom filters is becoming increasingly requested. Unfortunately, it doesn't suffice to simply expose the bit vector internals. You also need to remember what hash function and what seed you have been used to fully reconstruct the Bloom filter. In a topic branch I will add the proposed N3980 While doing so, I plan to improve support for serialization, which is currently lacking. It will very likely via free functions, however, and not member functions. I'd like to stay API-compatible with Boost Serialization: template <class Processor>
void serialize(Processor& proc, basic_bloom_filter& bf) {
// serialize members here
} That said, I'm currently pretty backed up with other projects and cannot promise you when this will be available. |
Thank you for the response. From what I have seen if you use always the same random number in the initialization such portability is doable... remaining on the same machine. BTW I`ll try to read again the doc after some sleep! |
This would be a very useful feature of this excellent library. Any news on that? Thanks! |
Unfortunately I'm lacking the cycles to pursue this myself at the moment, but I'm happy to supervise contributions. |
@mavam do we want to use Boost::serialization here? |
Adding a Boost dependency just for serialization would be overkill. I would like to keep the dependencies as minimal as possible: CMake plus a C++11 compiler. Actually, I think we can bump the requirement to C++14, since most compilers have a solid implementation by now. C++17 would be fun, but it's too cutting edge and we don't really reap the benefits in this library.
Using shift operators is indeed the most common model: std::istream is;
bloom_filter bf;
is >> bf; // throws exception on failure? As hinted in the comment, the error handling is a bit awkward. So, let's take one step back and think about an introspection framework that we can then use to generate those overloads where needed. We have something really neat in CAF: http://actor-framework.readthedocs.io/en/stable/TypeInspection.html. A simple version of this (without annotations) would be a good fit, in my opinion. This would mean that all we need is to write one function per serializable Bloom filter template <class Inspector>
auto inspect(Inspector& f, BF& bf) {
return f(bf.x, bf.y, bf.z); // x, y, z represent the serializable state
} Then, we can use this introspection API to support I/O stream serialization, simple string serialization, or whatever we want. The main advantage is that we can reuse the same mechanism for the hashable concept: a type only needs to provide an |
Totally agree that Boost is an overkill only for this, but if you consider all the other places where we could use it, then it might make sense to have it. But lets try to proceed without for now.
EDIT: looking better at the framework I think it is not too complicated. Probably we can just go directly with that. Do you have any suggestion on how to serialize a hasher? I was thinking to serialize the values used to generate it (like k, seed and double_hashing). |
Sticking with C++11 is fine by me if we're going the simple route via overloading the shift operators. If we went for something fancier, like the introspection concept I proposed, then C++11 is a bit bulky. I agree that starting with a simple approach is the right middle-ground to get started.
Exactly. Regarding the API, we have some design options. The low-hanging fruit would be to serialize each template <class Char, class Traits>
std::basic_ostream<Char, Traits>& operator<<(std::basic_ostream<Char, Traits>& os, const T& x) {
serialize(x); // implementation
return os;
}
template <class Char, class Traits>
std::basic_istream<Char, Traits>& operator<<(std::basic_istream<Char, Traits>& is, T& x) {
deserialize(x); // implementation
return is;
} This is the technically the way to parse and print custom types, but we would use it for binary serialization by writing to and reading from the underlying stream buffer. A downside would be that it's now possible to print a type to bloom_filter x;
std::ofstream file{...};
file << x; |
Hi @mavam Since the last message was in 2017, do you have a design how you would approach the serialization at the end of 2018 😄 ? Maybe you could write it down here, and after a discussion, I might have some time to implement it. (Please recall in the previous messages on this issue that hash functions need to be serialized as well -- so it needs to be an approach working for both the filter and the hash functions.) I personally prefer the more readable and maintainable, though also more verbose, approach to serialization used by protocol buffers (https://developers.google.com/protocol-buffers/docs/cpptutorial):
The can be written as an interface as a pure abstract class that the filter classes and hash functions implement. |
First of all nice and well designed library!
I am looking to add a export() method.
From my test looks like is enough to just move
bits_.data()
and
num_bits_
For this "test" I have just moved as public
bitvector bits_
;and
size_type num_bits_;
std::vector<block_type> bits_;
If is ok I will prepare a patch to add serialize / unserialize function.
They will be something like
std::string serialize();
bool unserialize(std::string raw)
The result "string" will be nothing more than
sizeof(size_type)
byte for thenum_bits_
and all the remaining the content ofbits_.data()
.Eventually a checksum can be added, there is no need to have a base64 version, I don't think someone is going to copy paste a filter.
The text was updated successfully, but these errors were encountered: