Fast Blob Propagation

Fast Blob PropagationGet a published blob to a connected peer in about a second by caching the server's reconciliation set in a monoid tree keyed by inbound demand, plus codec canonicalization on a new /0.9.3 protocol.

Problem

When you publish something in Seed, a peer that's already connected to you and looking at the doc should see it in about a second. Right now it can take up to 20.

That's because of the discovery safety break. A live discovery task only reruns every 20 seconds. We made it that slow on purpose: every rerun is expensive. For each peer, and again for each round inside that sync, both sides rebuild their whole reconciliation set from the database from scratch — scope query, dedup, sort, the works. Doing that once a second would cook the CPU and the network.

So the cooldown isn't really the problem. The problem is that there's no cheap path for the case that's true almost all the time: nothing changed.

Solution

The fix is server-side, because that's where the cost is.

When a client discovers a doc, it builds its own "what do I have" set once and reuses it for every peer it talks to. That side is already amortized. The server is the one paying the bill. Every inbound ReconcileBlobs call rebuilds that set from scratch — scope query, dedup, sort, the whole thing — and the client drives a multi-round loop, so the server pays it per round, per peer. A gateway hosting one popular doc gets hit hundreds of times for the same set of bytes that didn't change.

Make the server's set build cheap and we can drop the cooldown to 1s without anything melting.

The way to make it cheap is a monoid tree — credit to Alex Burdiyan for pointing at it. RBSR already wants fingerprints of arbitrary ranges (that's literally what SplitRange produces). A monoid tree keeps the set sorted by (ts, hash) with each node caching the combine of its children's fingerprints. Any range you ask for, you get back in O(log N). The monoid is the one RBSR already uses — the commutative 256-bit sum plus a count, associative, zero identity — so the tree is just an index over the same math.

That replaces the per-round, per-peer rebuild with a tree lookup. The whole reconciliation gets cheap, in-sync or not: the existing first RBSR round already settles the "nothing changed" case, and the mismatch case (a real new blob — the case that actually matters) is fast too. No new protocol round is needed.

The negentropy reference implementation we copied RBSR from (hoytech/negentropy) ships both a memory-backed and an LMDB-backed variant of this tree. We'd write our own SQLite-backed version eventually, but not on day one (more on that below).

Canonicalization — the only wire change

The tree's fingerprints are over CIDs, and the CID includes the codec. Same multihash under raw=85 on one peer and dag-pb=112 on another comes out as two different items in the tree, so the fingerprints differ even though the bytes are identical. On the real network this is the single biggest source of false RBSR divergence. If we don't fix it, the tree's cheapness gets wasted answering "you have X, I have Y" when really we both have the same thing.

The fix is one rule, applied only to what we advertise — not to what we store:

func canonicalCodecFor(codec uint64) uint64 {
    switch codec {
    case multicodec.DagPb: // 112
        return multicodec.Raw // 85
    default:
        return codec
    }
}

raw is the IPFS standard for chunk leaves and what most peers already use, so picking it minimizes the population of CIDs that actually change. Everything else passes through. Nothing in the database moves. We look blobs up by their hash anyway, never by codec.

Because this changes what goes on the wire, it rides a new protocol version. That's the only wire change in this project.

| Protocol | Advertised set | Effect |
|---|---|---|
| `/seed/hypermedia/0.9.2` | as-is | what we do today; old peers don't change |
| `/seed/hypermedia/0.9.3` | codec-canonical | far fewer false differences across peers |

Two peers always pick the highest version they both speak, so a 0.9.3 ↔ 0.9.3 connection is fully aligned and 0.9.3 ↔ 0.9.2 behaves exactly like today. The negotiation machinery is already there.

Where the tree lives and what triggers building it

The tree is server-side, keyed by inbound demand. A gateway hosting a doc has no idea its users are looking at it — the only thing it sees is a stream of ReconcileBlobs calls coming in. So that's the signal. The first time a peer asks the server to reconcile a particular set of filters, the server builds the tree for that scope (one full scan, same cost as today's one-shot build) and stashes it in an in-memory cache. Every subsequent round in the same client's exchange reuses it. Every other peer asking for the same scope reuses it. If nobody asks for a while, an LRU drops it.

It's tempting to key the tree off the scopes a node is itself discovering, but that's the wrong signal on the server: the server isn't discovering, it's responding. Keying off our own discovery scopes would miss the actual load. What matters is "who's asking us to reconcile what."

A client-side version of the same tree, keyed off the client's own discovery scopes, would also help — but the client already amortizes across peers in a single discovery run, so it's a much smaller win and we're deferring it.

Keeping the tree fresh without rebuilding it

Once a scope's tree exists, we need to keep it correct as new blobs come in, or the cached tree would lie. We do that incrementally: every time the indexer commits a blob, we figure out which live cached scopes that blob belongs to and patch a single root-to-leaf path in each — about O(log N) node updates per scope. New blobs almost always sort to the right edge of the tree by (ts, hash), which is the cheapest case. We're not rebuilding the tree on writes, we're patching it.

The hard part — and the part Alex called the most intimidating, which I agree with — is knowing which scopes a new blob belongs to. A blob's "home" resource is easy. But scopes also include items reached through links and citations, plus inbound Contact blobs anchored to someone else's resource (a follower record lives under the follower, not the followee). So an incoming blob can land in a scope by way of an edge that wasn't obvious from the blob alone. Worse: a brand-new link blob can retroactively pull an existing blob into a scope.

Two things make this tractable here:

We only have to answer it for the cached scopes, which is a small LRU set, not the whole corpus.
When in doubt, mark it as a member. A false positive just patches an extra tree path. A false negative would silently leave a real new blob out of the fingerprint, which is the failure we cannot have — that's a blob that never propagates. So the rule of thumb is to over-include rather than under-include, and lean on tests that pin this down: a blob that enters a scope via any of those closure edges must show up in that scope's tree fingerprint; an unrelated blob must not.

The same logic handles private vs public visibility. The monoid is additive, so a peer authorized for private space P sees public_tree + P_tree. We keep a public tree per scope and small per-space trees alongside, and compose them per requesting peer. Public dominates by far (around 99% of blobs) so this is essentially free.

v1 in memory, v2 on disk

We're keeping the tree in memory for v1, even though the natural endgame is SQLite-backed.

The reason is the indexer. reindex() wipes its list of derived tables and rebuilds them from scratch, which is where the ~1-minute cost on a normal-sized node lives. If the tree were a derived table, it'd be in that list, the reindex would have to rebuild every scope's tree, and we'd ship a schema migration with a backfill. A node coming online after being offline for a while would also patch every scope-tree for every catching-up blob.

In-memory dodges all of it. Trees start empty on boot, fill lazily as inbound reconciles arrive, drop on eviction. No schema change, no migration, no reindex impact, no catch-up CPU spike. The lazy-build cost on first inbound reconcile is exactly the cost we already pay today (one loadStore call), and after that everything is O(log N).

The on-disk version is real and we'll want it eventually — it survives restarts and lets us hold trees for the whole corpus, not just what's currently being asked about. But it needs the membership oracle to work over every scope in the system, not just the small live set, and that's a bigger commitment than we're making in v1. Land v1, measure the oracle in production, then graduate.

Dropping the cooldown

With the server-side cost flattened, the 20-second discovery cooldown stops being load-bearing. We make it a config knob (HotCooldown, default 1 second) instead of a constant. The scheduler already reruns hot tasks at the cooldown cadence — drop the value, get 1-second cadence, no frontend change.

We do this last and behind a config knob on purpose. If anything in the tree path turns out slower than expected under real load, we can bump the cooldown back up without rolling anything back.

That's the whole project: codec canonicalization on /0.9.3, a server-side in-memory monoid tree keyed off inbound reconcile demand, and the cooldown knob.

Do you like what you are reading? Subscribe to receive updates.

Unsubscribe anytime

Range-Based Set Reconciliation

Document Changed(3 Jun)