The subject of the paper is straightforwardly expressed:
Both reconciliation and deduplication can be abstracted as the problem of efficiently computing the set difference between two sets stored at two nodes across a communication link. The set difference is the set of keys that are in one set but not the other. In reconciliation, the difference is used to compute the set union; in deduplication, it is used to compute the intersection. Efficiency is measured primarily by the bandwidth used (important when the two nodes are connected by a wide-area or mobile link), the latency in round-trip delays, and the computation used at the two hosts. We are particularly interested in optimizing the case when the set difference is small (e.g., the two nodes have almost the same set of routing updates to reconcile, or the two nodes have a large amount of duplicate data blocks) and when there is no prior communication or context between the two nodes.
The paper itself is well-written and clear, and certainly worth your time. It's been particularly rewarding for me because it's taken me down a path of investigating a lot of new algorithms that I hadn't previously been studying. My head is swimming with
- Invertible Bloom Filters (a variation on counting Bloom filters, which in turn are a variation on basic Bloom filters, an algorithm that is now 40 years old!)
- Tornado codes
- Min-wise sketches
- Characteristic Polynomial Interpolation
- Approximate Reconciliation Trees
I hope to return to discussing a number of these sub-topics in later posts, whenever I find the time (heh heh). One of the things that's challenging about a lot of this work is that it's based on probabilistic algorithms, which take some time getting used to. I first studied these sorts of algorithms as an undergraduate in the early 1980's, but they still throw me when I encounter them. When studying probabilistic algorithms, you often encounter sections like the following (from the current paper):
The corollary implies that in order to decode an IBF that uses 4 independent hash functions with high probability, then one needs an overhead of k + 1 = 5. In other words, one has to use 5d cells, where d is the set difference. Our experiments later, however, show that an overhead that is somewhat less than 2 suffices.The question always arises: what happens to the algorithm in those cases where the probabilities fail, and the algorithm gives the wrong answer (a false positive, say)? I believe, that, in general, you can often structure the overall computation so that in these cases the algorithm still gives the correct answer, but does more work. For example, in the deduplication scenario, you could perhaps structure things so that the set difference code (which is trying to compute the blocks that are identical in both datasets, so that they can be eliminated from one set as redundant and stored only in the other set) fails gracefully on a false positive. Here, a false positive would need to cause the overall algorithm to conclude that two blocks which are in fact distinct, but which collide in the data structure and hence appear to be identical, are treated as distinct and retained in both datasets.
That is, the algorithm could be designed so that it errs on the side of safety when the probabilities cause a false positive to be returned.
Alternatively, some probabilistic algorithms instead fail entirely with very low probability, but fail in such as way as to allow the higher-level code to either simply re-try the computation (if it involves random behaviors, then with high probability it will work the next time), or to vary the computation in some crucial aspect, to ensure that it will succeed (which is the case in this particular implementation).
Most treatments of probabilistic algorithms describe these details, but I still find it important to always keep them in my head, in order to satisfy myself that such a probabilistic algorithm is safe to deploy in practice.
Often, the issue in using probabilistic algorithms is to figure out how to set the parameters so that the behavior of the algorithm performs well. In this particular case, the issue involves estimating the size of the set difference:
To efficiently size our IBF, the Strata Estimator provides an estimate for d. If the Strata Estimator over-estimates, the subsequent IBF will be unnecessarily large and waste bandwidth. However if the Strata Estimator under-estimates, then the subsequent IBF may not decode and cost an expensive transmission of a larger IBF. To prevent this, the values returned by the estimator should be scaled up so that under-estimation rarely occurs.
That is, in this particular usage of the probabilistic algorithms, the data structure itself (the Invertible Bloom Filter) is powerful enough that the code can detect when it fails to be decoded. Using a larger IBF solves that problem, but we don't want to use a wastefully-large IBF, so the main effort of the paper involves techniques to compute the smallest IBF that is needed for a particular pair of sets to be diff'd.
If you're interested in studying these sorts of algorithms, the paper is well-written and straightforward to follow, and contains an excellent reference section with plenty of information on the underlying work on which it is based.