Pages

Monday, January 2, 2017

Topic matching and bitmaps

Some very old ideas remain new. Inverted bitmaps are a wonderful example.

In a very nice recent article, Tyler Treat surveys various search structures employed by message queueing systems to solve the Fast Topic Matching problem: Fast Topic Matching

The naive hashmap solution is generally a poor choice due to its prohibitively expensive lookup time. Inverted bitmaps offer a better solution. The standard implementation is reasonable if the search space is finite, small, and known a priori, especially if read latency is critical. The space-optimized version is a better choice for scalability, offering a good balance between read and write performance while keeping a small memory footprint. The trie is an even better choice, providing lower latency than the optimized inverted bitmap and consuming less memory. It’s particularly good if the subscription tree is sparse and topics are not known a priori. Lastly, the concurrent subscription trie is the best option if there is high concurrency and throughput matters. It offers similar performance to the trie but scales better. The only downside is an increase in implementation complexity.

Treat's survey is clear and well-illustrated, and I'm really pleased that he worked hard to evaluate the choices across many different metrics: not just search time and update time, but also memory usage, concurrency, etc.

But I was particularly pleased that he included Bitmap indexes in the survey, for they are a fascinating data structure, and they don't get the coverage they deserve, particularly for work that is now 45 years old.

Although the original Model 204 implementation was hand-coded in IBM 370 Assembly Language, there are now widely-available bitmap indexing libraries, including Concise bitmaps, EWAH bitmaps, and Roaring bitmaps.

One (small) disappointment I have with Treat's great article is that he doesn't really investigate these libraries, instead simply noting that

While inverted bitmaps allow for efficient lookups, they are quite wasteful for sparse sets, even when using highly compressed bitmaps like Roaring bitmaps.

While I have no reason to disbelieve him, it would have been great to see him explain that conclusion more deeply.

At any rate, since Model 204's bitmap indexes were what got me into the software field, some 35 years ago, it's great to see them still finding favor and fairly widespread usage, decades later.

No comments:

Post a Comment