Sunday, October 26, 2014

Bloom Filters

A bloom filter is a data structure that can be used for checking whether an element is in a set when the set is too large to use conventional hashing techniques. It is probabilistic in the sense that false positives are possible, i.e. it might say that an element is in the set when it is really not. There are, however, never false negatives. You can not remove elements from a basic bloom filter so it is well-suited to data streaming problems where the number of elements only increases.

An example use case would be a web crawler that needs to determine whether is has already visited a URL. A small percentage of false positives will be acceptable because no significant portion of the web has only one link (read one incorrect bloom filter lookup) pointing to it.

The bloom filter is implemented as a large array of bits and several independent hash functions. Initially all bits are set to 0. An element to be added has all the hash functions applied to it. The bit corresponding to the result of each hash function is set to 1. To check whether an element has been seen before the same hashing process is used on the element. If all resulting bit positions are 1 then it has probably been seen before.

If m is the number of bits in the array and k is the number of hash functions and n is the number of elements, then the probability of a false positive is:


As an example, consider the case where m = 1 billion, n = 1 million, and k = 5. The probability of a false positive is about 0.94%. Note how each element requires more than one bit. m has to be bigger than n or else all bits would eventually be set to 1 and every lookup would be true, so the bloom filter only makes sense where the number of possible elements is large and a traditional bit array would not work.

Optimal number of hash functions also shows how to calculate m and k when n and the desired false positive rate are known.

UPDATE:

In an Ethereum JavaScript API dependency I found an implementation and they have a real life usage example.