Bloom Filters

Bloom-Filters are a data structure that uses a constant size in order to store information about elements in a set. It allows adding elements to the set and querying if an element is in a set. Bloom Filters work in such a way that false positives can occur, but false negatives cannot. It means that if you query it for a particular element, the result can be that it's very probably an element is in a set or it's definitely not in a set. Mathematical model describes exactly how to calculate probability of false positives, as well as gives methods to calculate size of bloom filter and number of hashes to generate in order to get proper false positive probability rate for a specified number of elements.

Bloom Filters are mainly effective in data synchronization where sending actual data to compare takes a lot of bandwidth and/or the data can be sensitive - such as synchronizing contact list in IM application. The idea is that one side (usually server) is able to determine which elements the other side has without sending the elements themselves. This algorithm works regardless of actual element size, and size of Bloom Filter required to gain assumed accuracy does not depend on element size.

The idea of a bloom filter is simple - storage of current state is a series of bits (m bits). Adding an element to a set implies enabling several bits if they are not already set, and querying if an element is in a set means checking if all bits are set. Mapping between each element and bits that represent it is done by calculating k number of hash values of that element. This means that each element is represented by an m-bit integer number with k bits set. Each bit is chosen as result of calculating a hash modulo number of bits.

For an example of implementation (requires Tcl 8.5), see [L1 ] (NEM - broken link as of 2015-03-24)

An alternative implementation is provided at A Simple Bloom Filter.

Or maybe write an interface to bloomd: https://github.com/armon/bloomd/blob/master/README.md