Tuesday, October 30, 2012

Bloom Filter Implementation in Java on GitHub

A Bloom Filter is a type of Set Data Structure. For those unaware, a Set Data Structure only has one main method, contains. It's only used to determine if a specific element is included in a group of elements or not. Most Data Structures (like a Hash Map, Linked List, or Array) can create this function fairly easily. You simply need to search the data structure for the specific element.

However, these types of Data Structures can pose a problem when the number of elements in the set exceeds the amount of memory available, as these types of data structures store all of the elements in memory.

This is where the Bloom Filter becomes interesting. As the Bloom Filter doesn't actually store all of the elements of the set in memory.

Instead of placing each element into a the Data Structure, the Bloom Filter only stores an array of bytes. For each element added to the Bloom Filter, k bits are set in its array. These bits are typically determined by a hashing function.

To check if an element is within the set, you simply check if the bits that would normally be one for this item are actually one. If they all are one (instead of zero), then the item is within the set. If any of the bits are not one, then the item is not within the set.

With every Data Structure there is definitely a draw back to the Bloom Filter. By using the method above, the Bloom Filter can say an element is within the set when it actually isn't. False positives are possible in the set, and they depend on several factors, such as:
  • The size of the byte array
  • The number of bits (k) set per element
  • The number of items in the set
By tweaking the above values, you can easily get the false positive probability to respectable levels while still saving a large amount of space.

After I discovered the Bloom Filter, I went looking for an implementation in Java. Sadly, a standard implementation doesn't exist! So, I wrote a quick and simple version of the Bloom Filter for Java. You can find the source code on GitHub.

My implementation uses:
  • MD5 Hash
    • To add an Object, the set takes the value of the hashCode() method to compute the MD5 hash. For subsequent values of k, the filter uses the previously computed MD5 hash (converted to an int) to generate the new MD5 hash.
  • Backed by a simple byte array
  • Implements the Set<Object> interface, although some of the methods in the interface will not work properly.
Note that the project also used the SizeOf Library to get the number of byte used in memory.
I also did a few quick expirements to compare the filter to a standard ArrayList in Java and a few performance checks.
  • Time required to add an element to the set using different k values
  • Size of the set versus the array list at different levels
As to be expected, the larger the number of elements required to be in the set, the more useful the Bloom Filter becomes. It does get a bit tricky when determining how large the Bloom Filter should be and what the optimal k value is for a given set, especially if the set is continually growing.

For the tests, I simply added Objects (which have a size of 16 bytes) to each data structure, and I then use the SizeOf library to get the true amount of space used.


From the above graph, its easy to see that the Bloom Filter is much more efficient on size once the array becomes larger than 100 objects. That trend continues at 1500 objects, with the Bloom Filter requiring 22808 bytes less than the ArrayList to store the same amount of elements.


The above graph shows the time in seconds (on an early 2012 iMac) to add an element to the list with different numbers of bits set (k). As k increases, the time increases fairly slowly up to 10 bits. However, anything past 10 becomes very costly, with 100 bits set requiring a full second to complete.

Feel free to check out the source code for the tests and the Bloom Filter implementation itself on GitHub.

No comments:

Post a Comment