Saturday, December 3, 2011

Three papers on Bloom filters

If you should find yourself noticing that:
  • You're interested in these mysterious things called Bloom filters, that seem to be popping up over and over,
  • and you'd like to learn more about how Bloom filters work, but you're not sure where to start.
then you're roughly in the position I was in a little while ago, and maybe this article will be of interest to you.

After reading a dozen or so Bloom-filter-related papers spanning 40 years of interest in the subject, I've whittled down the list to three papers that I can recommend to anybody looking for a (fairly) quick and (somewhat) painless introduction to the world of Bloom filters:

  1. Space/Time Trade-offs in Hash Coding with Allowable Errors
  2. Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol
  3. Network Applications of Bloom Filters: A Survey

The first paper is Burton Bloom's original description of the technique that he devised back in the late 1960's for building a new data structure for rapidly computing set membership. The paper is remarkably clear, even though in those days we had not yet settled on standard terminology for describing computer algorithms and their behaviors.

The most striking part of Bloom's work is this description of its behavior:

In addition to the two computational factors, reject time and space (i.e. hash area size), this paper considers a third computational factor, allowable fraction of errors. It will be shown that allowing a small number of test messages to be falsely identified as members of the given set will permit a much smaller hash area to be used without increasing the reject time. In some practical applications, this reduction in hash area size may make the difference between maintaining the hash area in core, where it can be processed quickly, or having to put it on a slow access bulk storage device such as a disk.

This discussion of probabilistic, approximate answers to queries must have been truly startling and disturbing in the 1960's, when computer science was still very young: "falsely identified", "fraction of errors"? Horrors! Even nowadays, computer science has a hard time dealing with the notions of randomness and probabilistic computations, but we're getting better at reasoning about these things. Back then, Bloom felt the need to justify the approach by noting that:

In order to gain substantial reductions in hash area size, without introducing excessive reject times, the error-free performance associated with conventional methods is sacrificed. In application areas where error-free performance is a necessity, these new methods are not applicable.

In addition to suggesting the notion of a probabilistic solution to the set membership problem, the other important section of Bloom's original paper is his description of the Bloom filter's core algorithm:

Method 2 completely gets away from the conventional concept of organizing the hash area into cells. The hash area is considered as N individual addressable bits, with addresses 0 through N - 1. It is assumed that all bits in the hash area are first set to 0. Next, each message in the set to be stored is hash coded into a number of distinct bit addresses, say a1, a2, ..., ad. Finally, all d bits addressed by a1 through ad are set to 1.

To test a new message a sequence of d bit addresses, say a1', a2', ... ad', is generated in the same manner as for storing a message. If all d bits are 1, the new message is accepted. If any of these bits is zero, the message is rejected.

Bloom's work was, rather quietly, adopted in various areas, particularly in database systems. I'll return to that subject in a future posting, but for now, let's spin the clock ahead 25 years, to the mid-1990's, when the second paper, by a team working on Web proxy implementations, brought significant popularity and a new audience to the world of Bloom filters.

The Summary Cache paper describes an intriguing problem: suppose you want to implement a set of independent proxy caches, each one physically separate from the others, and you'd like your implementation to arrange to have a cache miss on one proxy be able to quickly decide if the item is available in the cache of another proxy:

ICP discovers cache hits in other proxies by having the proxy multicast a query message to the neighboring caches whenever a cache miss occurs. Suppose that N proxies [are] configured in a cache mesh. The average cache hit ration is H. The average number of requests received by one cache is R. Each cache needs to handle (N - 1) * (1 - H) * R inquiries from neighboring caches. There are a total [of] N * (N - 1) * (1 - H) * R ICP inquiries. Thus, as the number of proxies increases, both the total communication and the total CPU processing overhead increase quadratically.

How do they solve this problem? Well, it turns out that a Bloom filter is just the right tool for this:

We then propose a new cache sharing protocol called "summary cache." Under this protocol, each proxy keeps a compact summary of the cache directory of every other proxy. When a cache miss occurs, a proxy first probes all the summaries to see if the request might be a cache hit in other proxies, and sends a query messages [sic] only to those proxies whose summaries show promising results. The summaries do not need to be accurate at all times. If a request is not a cache hit when the summary indicates so (a false hit), the penalty is a wasted query message. If the request is a cache hit when the summary indicates otherwise (a false miss), the penalty is a higher miss ratio.

Bloom filters continued to spread in usage, and interesting varieties of Bloom filters started to emerge, such as Counting Bloom Filters, Compressed Bloom Filters, and Invertible Bloom Filters. In particular, Professor Michael Mitzenmacher of Harvard has been collecting, studying, and improving upon our understanding of Bloom filters and their usage for several decades.

About 10 years ago, Mitzenmacher collaborated with Andrei Broder, who is now at Yahoo! Research but was then with Digital Equipment Company Research to write the third paper, which is the best paper of the three papers I mention in this article. (Note that Broder was a co-author of the second paper as well.)

The Network Applications of Bloom Filters paper accomplishes two major tasks:

  1. First, it provides a clear, modern, and complete description and analysis of the core Bloom filter algorithm, including most importantly the necessary mathematics to understand the probabilistic behaviors of the algorithm and how to adjust its various parameters.
  2. Second, it provides a wide-ranging survey of a variety of different areas in which Bloom filters can be used, and have successfully been used.

Most importantly, for people considering future use of Bloom filters, the paper notes that:

The theme unifying these diverse applications is that a Bloom filter offers a succinct way to represent a set or a list of items. There are many places in a network where one might like to keep or send a list, but a complete list requires too much space. A Bloom filter offers a representation that can dramatically reduce space, at the cost of introducing false positives. If false positives do not cause significant problems, the Bloom filter may provide improved performance. We call this the Bloom filter principle, and we repeat is for emphasis below.

The Bloom filter principle: Whenever a list or set is used, and space is at a premium, consider using a Bloom filter if the effect of false positives can be mitigated..

So there you have it: three papers on Bloom filters. There is a lot more to talk about regarding Bloom filters, and hopefully I'll have the time to say more about these fascinating objects in the future. But this should be plenty to get you started.

If you only have time to read one paper on Bloom filters, read Broder and Mitzenmacher's Network Applications of Bloom Filters. If you have more time, also read the Summary Cache paper, and if you have gone that far I'm sure you'll take the time to dig up Bloom's original paper just for completeness (it's only 5 pages long, and once you've read Broder and Mitzenmacher, the original paper is easy to swallow).

No comments:

Post a Comment