Range-Based Set Reconciliation

Doug Hoyte, December 2023

Introduction

Range-Based Set Reconciliation (RBSR) is a method for synchronising data-sets, either because they were created independently or because they have drifted apart due to downtime, network partitions, misconfigurations, etc. RBSR works well with very large data-sets, and can be safely used in open networks of un-coordinated peers. It is complementary to Conflict-free Replicated Data Types (CRDTs).

RBSR is simple to explain and implement. Compared to alternatives such as bloom filters, it is more efficient and requires less tuning of parameters. Unlike systems based on merkle-trees, it does not require storing per-connection state and is resistant to DoS attacks.

This article describes the theory behind RBSR and introduces Log Periodic's Negentropy project, which specifies a wire-protocol and includes protocol conformance tests and reference implementations in several languages.

This work is based on Aljoscha Meyer's research (overview / paper / master's thesis).

Case-study: rsync

rsync is a popular command-line application for synchronising the contents of directories between machines. When the directories are mostly similar, it does this by transmitting less data than simply copying the whole directory. Because rsync is run frequently to keep mirror servers up to date, it has saved a lot of bandwidth since its introduction in 1996.

When invoked, rsync runs through two phases:

Phase 1: Figure out which files were changed
The full list of files in a directory and each file's modification time is transferred to the remote party. If the corresponding remote file's modification time is identical, the files are assumed to be equivalent.
Phase 2: Figure out which parts of a file were changed
For each file that Phase 1 determined was different, break it up into equal sized chunks, compute a checksum for each chunk, and send this list of checksums to the remote party. By comparing these checksums against the checksums of its own file, the remote party can determine which chunks are new/modified and queue them for download.
rsync also has a mechanism to detect chunks that have moved around within a file (more on that below).

Content-Addressing

In phase 1, rsync compares a local modification timestamp against a remote timestamp. Although this works well for the mirroring use-case, meta-data timestamps should generally be avoided in distributed systems.

Consider the case where you first connect to an unknown server. Comparing this server's timestamps against your own doesn't provide useful information. Even if you trust its clock has always been set correctly, exactly when it saved a bit of data doesn't indicate if its saved data is older or newer than your version.

A common protocol for replication is for every server to keep a log of updates it has processed. Every time a client connects, it asks the server for data that has come in since it last checked. While this scheme doesn't depend on clocks to be correct (and in fact, sequence IDs can be used instead of timestamps), it is inefficient in open networks where a client synchronises with many different servers. This is because the same updates will be downloaded from many servers, and when adding a new server, all updates or the full state must be transferred.

In dynamic/chaotic distributed environments, timestamps can become unreliable for many other reasons. Servers can unexpectedly re-build their databases, resetting timestamp meta-data. Multiple servers can be put behind a load balancer and not present consistent timestamps. Due to bugs or operator intervention, modifications can bypass the log and directly update the DB. Servers can be restored from backups, rewinding the logs.

Take-away 1: In a distributed environment, it is preferable for synchronisation methods to use the contents of data elements (or their hashes) to determine equivalence, and not meta-data such as timestamps.

Divide-and-Conquer

In phase 2, rsync transmits the list of checksums for each chunk in the file. This means that every time a file is synchronised, an amount of data linear in the full file-size must be transferred. To reduce this overhead rsync uses a larger chunk size as the file size increases. However, chunks should not be too large since whenever a difference is detected an entire chunk must be transferred.

In the rsync protocol, the full list of chunks is constructed and transferred in one communication round. If we are willing to relax this, and allow back-and-forth communication, then we can improve on the overhead described above.

Imagine taking the checksum of the full file and transferring that. If it matches, great, we are done. Otherwise, split the file into two chunks, and send a checksum for each. If either of those match the remote chunk, then we have confirmed that this entire half of the file is identical. Whenever there is a difference, recurse until we have an arbitrarily small chunk that can be transferred directly. (This algorithm was in fact proposed in 2001 by John Langford as Multiround rsync, but was never incorporated into the official rsync program).

The divide-and-conquer technique is well known to computer scientists as the basis of binary search, and works well in this case because the bandwidth overhead grows logarithmically (very slowly) with the total file size.

Take-away 2: Using divide-and-conquer, shared sub-sets of records can be identified early-on so that no further synchronisation work needs to be performed on them. If the databases are large, this can save substantial resources such as bandwidth, CPU, and IO.

The Protocol

Alternating Messages

In order to avoid transferring a potentially large list of checksums like rsync, RBSR uses back-and-forth communication rounds between the protocol participants. Each round tries to find common sub-sets of elements so that processing of these sub-sets can be halted. When one side has determined that the other needs an element, it can be sent immediately (without waiting for the sync to complete).

Ranges and Fingerprints

With RBSR, all the elements to be synchronised are sorted into a particular order. Various ordering criteria can be used, but for now let's just suppose that each element has a unique timestamp and this is what we sort by.

A contiguous (no gaps) sequence of these sorted elements is called a range. Whenever we communicate about our ranges with the outside world, we cannot use the indices of the elements in our sorted list because other protocol participants may have entirely different sets. So, ranges must be specified using lower and upper bounds, which are values that can be compared against elements using the sorting criteria (ie, timestamps).

Let's walk through an example of how RBSR works.

The initiator begins by computing a fingerprint over its full range of elements (from timestamp 0 to timestamp infinity), and sending it to the other remote side. We'll talk about fingerprints in more detail soon, but for now assume they are cryptographic hashes of all the sorted elements within a range.

In our example the two parties share most of the elements in common, except that the initiator (top) is missing one element that the remote side (bottom) has, and the remote side is missing two elements the initiator has:

Once the remote side receives the fingerprint, it computes the fingerprint for its own full range. If they match, then it immediately knows that both sides have identical sets, so it can reply to the initiator that nothing needs to be done, and the protocol can stop.

But in our example this is not the case, since the two sides have different sets. To make further progress, the remote side splits its range into two smaller sub-ranges, computes the fingerprints for each, and sends them along with corresponding bounds back to the initiator:

After receiving these fingerprints, the initiator computes its own fingerprints for each sub-range specified by the remote side. It is important that the receiver uses the exact same range bounds the other side sent rather than blindly split its own range in half (for instance).

In our example, the initiator sees that the first fingerprint matches its own for this range. This is great news, since it means that we have now determined half of the elements are identical and we no longer need to do any processing for them (illustrated by colouring these elements green). However, the second fingerprint indicates that the other sub-range has one or more differences, so we need to recurse further.

Next, the initiator splits the sub-range into two sub-sub-ranges, computes their fingerprints, and sends them to the remote side:

Similar to the previous step, the remote side computes its own fingeprints and determines that one of them is identical: nothing more needs to be done for that range.

Although not strictly required, most RBSR implementations will want to stop splitting ranges once the number of elements they contain becomes sufficiently small. In our example, the remote side decides that it doesn't want to split the 6 elements in the right-most range, so it sends those elements (or their hashes), along with the range boundaries, back to the initiator:

After receiving the set of elements in the range, the initiator immediately knows which elements in this range it has that the other side needs, and vice-versa, simply by comparing these received elements against the elements in its own corresponding range.

The initiator notices that it was missing one element, and adds it to its set (illustrated by colouring blue). It also notices that it has two elements that were not sent by the remote side, so it sends these elements and the remote side adds them to its set:

At this point, the two sets are fully reconciled and the synchronisation is complete.

Batched Recursion

In the previous diagrams, each pair of fingerprints had one matching and one non-matching fingerprint. But there is a possibility that both sub-ranges will have a difference, and therefore both should be recursed into. For this reason, each message transmission may include multiple ranges and their fingerprints:

This "batching" of multiple ranges has a significant implication: The number of rounds of the protocol only depends on the sizes of the data-sets being synced, and not on their number of differences.

For example, suppose we have a data-set of size 1 million. The number of messages required to reconcile this data-set is:

log(1_000_000)/log(2) = 19.9316

This is an upper-bound: In fact, it would be fewer than this because small sub-ranges are sent in their entirety, stopping the protocol early (or if the sets are equal, of course).

In our example above, we started with a single range and split non-matching ranges into two sub-ranges. However, if we want to minimise the number of messages needed, we can use a higher branching factor at the expense of increasing the size of each message. For instance, if we used a branching factor of 16, we would then require this many messages:

log(1_000_000)/log(16) = 4.98289

Another thing you may have noticed about the protocol is that messages in both directions (initiator to remote and vice versa) help narrow down the ranges. If instead of counting messages we want to count communication round-trips, we should divide this by 2:

log(1_000_000)/log(16)/2 = 2.49145

In summary, to reconcile a set of 1 million, we expect to need 3 round-trips (rounding up because the initiator must always get a final response before terminating the protocol). Suppose our data-set was of size 1 billion:

log(1_000_000_000)/log(16)/2 = 3.73717

In this case, it would take 4 round-trips. This very slow growth is characteristic of logarithmic functions, and is precisely the reason why the divide and conquer strategy is so effective.

So far we haven't even considered how many differences there are in the sets because it would not affect our analysis. However, if there are many differences distributed throughout then the messages can grow quite large and we may wish to fragment them. This may increase the number of round-trips.

Range Choice

Implementations have considerable flexibility when choosing their ranges. Although the descriptions above assumed splitting into evenly sized groups of 2 (or 16), there are various reasons why implementations may choose different divisions:

Partial syncing: If a client is only interested in syncing particular portions of the data-set, it can create ranges only for those portions. The un-sent ranges will be considered already synced by the protocol and will be ignored. For instance, if records are sorted by timestamp, clients could use this to sync just the latest week's worth of records. Conversely, clients could choose to not synchronise the most recent few seconds to avoid redundant double-downloads if records are also being streamed in real-time.
Variable sizing: Ranges can be split into more or fewer sub-ranges as the sync progresses. This could be directed by simple heuristics, or dynamically derived according to measured network conditions in order to optimise for a particular point in the bandwidth/latency trade-off space.
Weighted splitting: In some situations, clients may have reason to believe that certain ranges are more or less likely to be in sync than others. For example, clients who sync with each other relatively often may rely on the fact that old data ranges are likely to be in sync, and therefore that most of the differences will be with recent elements. In this case, ranges that cover recent time periods can be made smaller in order to speculatively pre-partition them and reduce latency without increasing bandwidth usage.
In order to protect against some attacks on fingerprints, implementations may choose to randomise their range boundaries.
In order to keep messages acceptably small, ranges can be abbreviated to stay within a frame size limit, postponing their resolution until a later round of the protocol.

In all the cases described above, clients can implement their desired range policies with no coordination from the other side of the sync. All of these and others are naturally emergent abilities arising from the protocol itself.

Fingerprints

The purpose of a fingerprint is to provide a compact and reliable summary of a set so that it can be used to check for equality with a remote set.

Previously we suggested that fingerprints could be the cryptographic hash of the sorted elements within a range. (Practically, you'd hash each element and use the hash of these hashes, if elements are variable in size.) Although RBSR will work with such fingerprints, it becomes much more efficient when fingerprints can be computed incrementally.

Incremental Hashing

Suppose you have the elements A, B, and C and you computed a cryptographic hash (ie SHA-256) of the concatentation of these elements:

h = sha256(A + B + C)

Now you receive a new element D and you'd like to update h to include it. Unfortunately, the only way to do this is to re-compute the hash from scratch using the entire new set:

h' = sha256(A + B + C + D)

With incremental hashing, we can compute the new hash value without having to start again from scratch. A simple incremental hash can be implemented by hashing each individual element and then combining them with bit-wise XOR (⊕):

h = sha256(A) ⊕ sha256(B) ⊕ sha256(C)

Since XOR is associative, we can incrementally add in the new value:

h' = h ⊕ sha256(D)

Most incremental hashes allow you to subtract elements out of the hash too. With XOR, every element is its own inverse so it is as simple as:

h' ⊕ -sha256(D) = h' ⊕ sha256(D) = h

Hashing and then combining with XOR is an example of the randomise then combine paradigm for creating incremental hash functions.

Tree-Friendly Functions

The advantage of using incremental hash functions is that we can use them to reduce the amount of work needed to compute fingerprints by storing elements in a tree data-structure. If an incremental hash function has the properties that make this possible, it is called "tree-friendly" (see Meyer's paper for the exact algebraic requirements).

The most obvious type of tree to use is the B-tree. These trees have large branching factors (many children) because this encourages related data to be stored together, exploiting the locality of reference benefits inherent to most types of storage.

As well as pointers to its children, each node of the tree contains the incremental hash of all elements below it in the tree:

To find an item in the tree, we start at the root node, and traverse down through the levels until we get to a leaf node (path indicated in red below).

If we wanted to compute the hash of all nodes preceding the item we have found (indicated in blue), we could iterate over all the leaf nodes and incrementally hash the elements. However, this would involve loading and processing an amount of elements linear in the DB size, which could be very large:

Fortunately, because nodes store the incremental hashes of all their sub-elements, this is unnecessary. Instead, as we traverse the tree, we can accumulate the tree nodes to the left of the path we traversed, and the elements in the leaf node that precede the element we found:

The result is that only a logarithmic number of hashes need to be incrementally combined.

This operation (computing the incremental hash of preceding elements) can be used to find the incremental hash of any range of elements. Suppose we have the elements 1 through 10, and we want to find the incremental hash of elements in the range [3,5) (in computer science, ranges always include the begin index and exclude the end index). To do so, use the operation two times -- one for each end of the range -- and subtract the results:

(1 + 2 + 3 + 4) - (1 + 2) = (3 + 4)

In fact, being able to subtract incremental hashes is not strictly necessary for an RBSR tree (see Meyer's paper).

Because tree-friendly hash functions can be combined and separated, combinations of multiple trees or sub-trees can be synced in the same protocol-round. This gives implementations considerable freedom in how they store their data and allows parties that use different configurations to remain sync-compatible.

Fingerprint Threat Model

Two parties synchronising must necessarily trust one another to some degree: At the very least enough to download data from each other. Regardless of the sync protocol, if one side wants to it could always pretend that it doesn't have some data, or insert some bogus data into its data-set. The validity of the data, and the rules about its acceptance, are outside the scope of a sync protocol.

However, with RBSR there is another threat model to worry about: If a malicious third-party is able to insert specially-crafted data elements into one of the data-sets, it could choose elements that cause a range to have an equal fingerprint as some different range. During the sync, if the two parties compared the fingerprints of these two sets, they would incorrectly consider them equivalent and would stop trying to sync them.

If the specially-crafted malicious data elements themselves are not synced, nobody is likely to care. However, if the malicious elements prevent innocent victim elements from syncing, an attacker could use this to censor users by preventing their events from propagating throughout the network.

In the case of XOR, consider what would happen if you found a set of malicious elements that when hashed and XORed together were equivalent to the hash of a victim element. When computing the fingerprint of a range that includes the malicious elements and the victim element, they would "cancel out" and the two ranges would incorrectly have the same fingerprint.

Note that there are many obstacles to actually performing this censorship attack:

The two fingerprints actually need to be compared in a sync. If one side of the sync is mostly empty, then the recursion may terminate early, and the elements will be transferred.
The malicious elements need to be adjacent to or at least close-by the victim element. If a range boundary ever lands in the middle of these malicious elements, the attack will fail.
If the ranges being compared also differ in any other elements, the ranges will be synced.
If any future elements are added to the range, they may correctly sync up at that point.

Security of Incremental Hash Functions

How difficult is it to find a set of elements that when hashed and then XORed together equal a specific target value? Surprisingly easy. We built a gaussian elimination script that does it in 2 seconds for 256-bit fingerprints. This is unfortunate because XOR can be computed extremely efficiently.

Another attractively-efficient alternative is addition mod 2N. For instance, hashing each element with SHA-256, and then adding them together (mod 2256). Finding collisions in this scheme is significantly harder than with XOR, because of the propagating carry bit. Still, it is feasible. We built a k-dimensional birthday-problem solver that can find arbitrary 256-bit collisions in about 28 hours using 8 cores, 60 GB of RAM, and 1.5 TB of storage.

There are other candidates, most notably Elliptic Curve Multiset Hash. As far as we know, finding ECMH collisions is computationally infeasible. However, it is significantly slower than other methods, even when using optimisations which would not be available in many production environments. Furthermore, there only seems to be a single reference implementation, and requiring it as a dependency would go against our goals of building a simple, general-purpose RBSR protocol.

Research into finding an optimal incremental hash function is on-going.

Extra Hardening

Because of the many difficulties in executing this attack, a cryptographically secure incremental hash function is not necessarily even required.

rsync is theoretically vulnerable to the same problem because it uses known problematic hash functions such as MD4/MD5/xxhash and to our knowledge this has never been a problem (the analogy here is two trusted servers rsyncing a tarball containing attacker-created files).

Additionally, there are hardening steps that we can take to make attacks even less reliable:

Range Randomisation: Collision attacks involve creating several hundred malicious records. If a range boundary ever falls in the middle of these events, then the attack will fail. Because of this, implementations may choose to randomise their sub-ranges when recursing. Either the number of ranges can be randomised, their exact offsets, or both.
Incorporating Set Size: In addition to the incremental hash, fingerprints can also incorporate the number of elements within the range. In cases where the attacker can only write to one side of a network-split, this can categorically prevent certain attack classes, even when a collision in the incremental hash function can be rapidly found. For example, this provides protection in the important scenario where mostly-offline clients periodically sync up with a wider network.

Comparison with Alternatives

There are many alternative methods of set reconciliation. Under specific circumstances, some of them may have advantages over RBSR. However, in our opinion, all of them have serious down-sides which make them inferior to RBSR as general-purpose synchronisation protocols.

Bloom filters

Bloom filters are probabilistic data-structures that encode membership information into hash-derived bit-fields. These bit-fields are transferred to peers who then compare them against their own data-sets.

The fundamental problem with bloom filters is that they have false positives. These manifest as records that one side believes the other side to have but in fact does not, and therefore are not reconciled. To deal with this, the protocol must somehow detect false positives outside of the protocol, or be parameterised in a way to make this very unlikely (which implies impracticably large filters).

Similar to RBSR's fingerprints, bloom filter designs are susceptible to censorship attacks where specially selected inputs can prevent a target record from being synced.

If a protocol has a mechanism to detect false positives, it can be vulnerable to a DoS attack where large amounts of colliding inputs are selected. To demonstrate this, we built a proof-of-concept that causes Automerge's sync system to degrade to a 97% false-positive rate. Fortunately, Automerge's data is stored in a hash-linked DAG format which can be relied on to reduce the impact of this DoS attack.

There is a variation of bloom filters called invertible bloom filters. These suffer from a related problem where when the filter occupancy gets too high then the reconciliation fails entirely, and you have to start over with a larger filter.

Previously, our opinion was that bloom filters were non-ideal for set reconciliation and RBSR represented an entirely superior approach. However, the paper Practical Rateless Set Reconciliation and the algorithm it describes, RIBLT, has changed our perception. Although RBSR still has several advantages, rateless invertible bloom filters also represent a compelling solution to set reconciliation. To learn more about this, Log Periodic has built an implementation of this algorithm called riblet.

CPI Sketches

There is a class of algorithms called "sketches" based on characteristic polynomial interpolation (CPI), of which the pinnacle implementation is probably Minisketch.

These algorithms are extremely impressive in that they can encode sets of elements into packets so small that they approach the information-theoretic limits.

Unfortunately, they do not seem to scale well to even modest-sized data-sets. According to minisketch's docs, the largest size currently of interest to the authors is 4096. As the size grows, the CPU requirements grow rapidly. In addition, you need to predict the "capacity" of the sketch in advance, which is the largest number of elements that can differ, otherwise the reconciliation will fail. Finally, the sizes of the elements themselves are limited and larger sizes mean more CPU usage. Minisketch only supports up to 64-bit elements.

We believe that sketches have their niche, but compared to RBSR they are in an entirely different category.

Rigid Structures

A rigid data-structure uses properties of its stored data to organise its storage layout. If the history of insertions and deletions is irrelevant to the resulting organisation, these data-structures have a "path-independence" that makes them useful for set reconciliation.

For instance, a rigid tree structure can compare the higher-level nodes between the two sides of the sync and eliminate common nodes early. Unfortunately, this means that implementations have limited freedom to re-arrange or re-balance their trees.

gzip --rsyncable

To better understand the consequences of rigid data-structures, let's return to our rsync case study.

In addition to the "strong" MD5 checksums that rsync uses for content-addressing, rsync also uses a special rolling checksum to detect when blocks of a file have been moved around. For example, if you insert a few bytes at the beginning of a large file, rsync will be able to detect that all chunks of the file have been "shifted", and will not need to retransmit them.

However, suppose you would like to store this large file in a compressed format. In this case, modifying a few bytes at the beginning of the uncompressed file could cause all the later data to be changed, meaning that rsync would not find any shifted blocks and would degrade to transferring the entire file.

To avoid this, gzip supports an --rsyncable flag. When provided, gzip will flush its output stream periodically, reseting the compressor's state, and preserving chunks of compressed data so rsync can sync it more effectively.

But how does gzip know when to flush? It can't be at fixed offsets, because of the shifting issue. It can't look for patterns in the compressed stream because that is what it's trying to control. Therefore, it must look for patterns in its uncompressed input data. Specifically, gzip queues a flush whenever the last 4096 bytes of its input data sum to 0 (mod 4096). There is nothing inherently special about this pattern, except that it is easy to test for and you can expect it to come around relatively often. In cryptanalysis, these arbitrary break-locations are called distinguished points.

When the input data is random, this works quite well. On the other hand, if malicious parties have some level of control over the input data, they can specially compute some data to have an undesired effect.

To demonstrate this, we built a script called gzip-unrsyncable. This will copy data from its input to its output, while looking for the same patterns that gzip --rsyncable does. When it detects one, it modifies the last byte to break the pattern, preventing gzip --rsyncable from ever flushing its stream.

This attack is possible and 100% reliable across all rsync implementations because the pattern used to determine break-points is unchanging and publicly known. In fact, it must be unchanging and publicly known since the protocol depends on this same rigid structure being computed on both sides of the sync.

Merkle-Search/Prolly Trees

Merkle Search and Prolly trees are rigid, probabilistic data-structures roughly based on Merkle trees. In order to determine the structure of their trees, they use unchanging, publicly known patterns in the cryptographic hashes of their elements.

When used for syncing, these rigid trees have average-case behaviour similar to RBSR. However, unlike RBSR, their worst-case behaviour degrades into transferring the entire data-set (or arbitrarily large portions of it). With random elements, this worst-case behaviour is overwhelmingly unlikely to occur. However, in an open-network environment, malicious data can be submitted at any time, and this is something that rigid data-structures are fundamentally vulnerable to.

Also, in many designs, rigid data-structures will require a busy server to implement copy-on-write behaviour to preserve a previous tree structure for ongoing sync sessions while the main version of the tree is being updated with new data. Because it is not rigid, RBSR can freely modify its single source of truth without invalidating sync sessions started in the past, and servers can be entirely stateless.

For these reasons, we believe that RBSR is a superior approach to syncing, and rigid data structures like merkle trees should only be used when membership proofs are required.

Do you like what you are reading? Subscribe to receive updates.

Unsubscribe anytime