Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export the filter #17

Open
RoyBellingan opened this issue Jun 12, 2016 · 9 comments
Open

Export the filter #17

RoyBellingan opened this issue Jun 12, 2016 · 9 comments

Comments

@RoyBellingan
Copy link

RoyBellingan commented Jun 12, 2016

First of all nice and well designed library!

I am looking to add a export() method.
From my test looks like is enough to just move

bits_.data()
and
num_bits_

For this "test" I have just moved as public
bitvector bits_;
and
size_type num_bits_;
std::vector<block_type> bits_;

If is ok I will prepare a patch to add serialize / unserialize function.
They will be something like

std::string serialize();

bool unserialize(std::string raw)

The result "string" will be nothing more than sizeof(size_type) byte for the num_bits_ and all the remaining the content of bits_.data().

Eventually a checksum can be added, there is no need to have a base64 version, I don't think someone is going to copy paste a filter.

@mavam
Copy link
Owner

mavam commented Jun 17, 2016

Hi Roy,

thanks for dropping a note. An ability to serialize the Bloom filters is becoming increasingly requested. Unfortunately, it doesn't suffice to simply expose the bit vector internals. You also need to remember what hash function and what seed you have been used to fully reconstruct the Bloom filter. In a topic branch I will add the proposed N3980 hash_append functionality. This will give not only users a choice among other hash function implementations, but also improve hashing of custom types.

While doing so, I plan to improve support for serialization, which is currently lacking. It will very likely via free functions, however, and not member functions. I'd like to stay API-compatible with Boost Serialization:

template <class Processor>
void serialize(Processor& proc, basic_bloom_filter& bf) {
  // serialize members here
}

That said, I'm currently pretty backed up with other projects and cannot promise you when this will be available.

@RoyBellingan
Copy link
Author

Thank you for the response.
I have tried to read the N3980 but is way out of my knowledge.

From what I have seen if you use always the same random number in the initialization such portability is doable... remaining on the same machine.
If you think is a possibility I`ll do test between different machine.

BTW I`ll try to read again the doc after some sleep!

@caetanosauer
Copy link

This would be a very useful feature of this excellent library. Any news on that?

Thanks!

@mavam
Copy link
Owner

mavam commented Feb 13, 2017

Unfortunately I'm lacking the cycles to pursue this myself at the moment, but I'm happy to supervise contributions.

@amallia
Copy link
Contributor

amallia commented Jun 29, 2017

@mavam do we want to use Boost::serialization here?
If not, lets clarify if we want binary serialization or human-readable serialization with operator>> & operation<< override.

@mavam
Copy link
Owner

mavam commented Jun 30, 2017

@mavam do we want to use Boost::serialization here?

Adding a Boost dependency just for serialization would be overkill. I would like to keep the dependencies as minimal as possible: CMake plus a C++11 compiler. Actually, I think we can bump the requirement to C++14, since most compilers have a solid implementation by now. C++17 would be fun, but it's too cutting edge and we don't really reap the benefits in this library.

If not, lets clarify if we want binary serialization or human-readable serialization with operator>> & operation<< override.

Using shift operators is indeed the most common model:

std::istream is;
bloom_filter bf;
is >> bf; // throws exception on failure?

As hinted in the comment, the error handling is a bit awkward. So, let's take one step back and think about an introspection framework that we can then use to generate those overloads where needed. We have something really neat in CAF: http://actor-framework.readthedocs.io/en/stable/TypeInspection.html. A simple version of this (without annotations) would be a good fit, in my opinion. This would mean that all we need is to write one function per serializable Bloom filter BF (and all dependent types, like hashers, transitively):

template <class Inspector>
auto inspect(Inspector& f, BF& bf) {
  return f(bf.x, bf.y, bf.z); // x, y, z represent the serializable state
}

Then, we can use this introspection API to support I/O stream serialization, simple string serialization, or whatever we want. The main advantage is that we can reuse the same mechanism for the hashable concept: a type only needs to provide an inspect function and becomes both serializable and hashable. (This is how I designed the concepts in VAST, FWIW)

@amallia
Copy link
Contributor

amallia commented Jun 30, 2017

Adding a Boost dependency just for serialization would be overkill. I would like to keep the dependencies as minimal as possible: CMake plus a C++11 compiler. Actually, I think we can bump the requirement to C++14, since most compilers have a solid implementation by now. C++17 would be fun, but it's too cutting edge and we don't really reap the benefits in this library.

Totally agree that Boost is an overkill only for this, but if you consider all the other places where we could use it, then it might make sense to have it. But lets try to proceed without for now.
I don't agree with C++14, there are so many compilers that don't support it (Solaris and AIX are two examples). We might lose stakeholders :)

I will give a look at the framework that you pointed out, but I think that regarding the error handling we could set the ios_base::failbit, with something like:

stream.setstate(ios_base::failbit);

In this way a base bf could implement the extraction/insertion operator, every type of bf specialize a serialization/deserialization method. This may sound less elegant, but it is more pragmatic, postponing the introduction to the next generation of the library.

EDIT: looking better at the framework I think it is not too complicated. Probably we can just go directly with that. Do you have any suggestion on how to serialize a hasher? I was thinking to serialize the values used to generate it (like k, seed and double_hashing).

@mavam
Copy link
Owner

mavam commented Jul 1, 2017

I don't agree with C++14, there are so many compilers that don't support it (Solaris and AIX are two examples). We might lose stakeholders :)

Sticking with C++11 is fine by me if we're going the simple route via overloading the shift operators. If we went for something fancier, like the introspection concept I proposed, then C++11 is a bit bulky. I agree that starting with a simple approach is the right middle-ground to get started.

Do you have any suggestion on how to serialize a hasher? I was thinking to serialize the values used to generate it (like k, seed and double_hashing).

Exactly.

Regarding the API, we have some design options. The low-hanging fruit would be to serialize each T with an overload of this form:

template <class Char, class Traits>
std::basic_ostream<Char, Traits>& operator<<(std::basic_ostream<Char, Traits>& os, const T& x) {
  serialize(x); // implementation
  return os;
}

template <class Char, class Traits>
std::basic_istream<Char, Traits>& operator<<(std::basic_istream<Char, Traits>& is, T& x) {
  deserialize(x); // implementation
  return is;
}

This is the technically the way to parse and print custom types, but we would use it for binary serialization by writing to and reading from the underlying stream buffer. A downside would be that it's now possible to print a type to cout and get gibberish back. But that's the price we pay if we want an interface that works like this:

bloom_filter x;
std::ofstream file{...};
file << x;

@mristin
Copy link
Collaborator

mristin commented Dec 4, 2018

Hi @mavam
I'd like to give my vote to the serialization as well. The serialization is really necessary In order to use libbf in production with big data -- hoping that the filter can reside in memory is an option only for a very limited set of use cases.

Since the last message was in 2017, do you have a design how you would approach the serialization at the end of 2018 😄 ? Maybe you could write it down here, and after a discussion, I might have some time to implement it. (Please recall in the previous messages on this issue that hash functions need to be serialized as well -- so it needs to be an approach working for both the filter and the hash functions.)

I personally prefer the more readable and maintainable, though also more verbose, approach to serialization used by protocol buffers (https://developers.google.com/protocol-buffers/docs/cpptutorial):

  • bool SerializeToString(string* output) const serializes the message and stores the bytes in the given string. Note that the bytes are binary, not text; we only use the string class as a convenient container.
  • bool ParseFromString(const string& data) parses a message from the given string.
  • bool SerializeToOstream(ostream* output) const writes the message to the given C++ ostream.
  • bool ParseFromIstream(istream* input) parses a message from the given C++ istream.

The can be written as an interface as a pure abstract class that the filter classes and hash functions implement.

@mristin mristin mentioned this issue Dec 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants