Big picture: SQLite itself is not “randomly broken”

Big picture: SQLite itself is not “randomly broken”; the daemon is creating too many concurrent write attempts against a database that can only have one writer at a time.

What the debug page is showing:

  1. Only one SQLite writer can run

    • Every WithTx uses BEGIN IMMEDIATE.

    • That means: “I want the write lock now.”

    • If another writer has it, SQLite waits up to ~10s.

    • After that, it returns SQLITE_BUSY.

  2. Many callers are trying to write at once Main write sources:

    • blob.(*Index).PutMany-range1: indexing downloaded blobs

    • syncing.(*Server).loadStore: sync/discovery SQL work

    • hmnet.(*Node).connect: peer connection bookkeeping

    • hmnet.(*Node).onLibp2pIdentification / peerWriter: peer table updates

    • blob.(*DomainStore).*: domain cache updates

  3. Some writes are heavy PutMany and sync/indexing work can hold the writer for hundreds of ms. That is not 10s alone, but during sync bursts it happens repeatedly.

  4. Some writes are tiny but numerous hmnet.(*Node).connect writes only a small peer row update. But there can be many concurrent connects. Each one waits in SQLite for the writer lock. If the queue is long enough, they time out at 10s.

  5. So connect is mostly a victim, not the original hog Debug shows:

    • connect hold time is tiny

    • connect wait time is ~10s

    • therefore it is waiting behind other writers

    • but because there are many connect attempts, it also amplifies the storm

  6. The existing fixes help, but don’t fully solve fairness Existing fixes reduce some heavy holders and batch identify-event peer writes. But direct connect() still does its own synchronous WithTx, and the system still lets many goroutines race into BEGIN IMMEDIATE.

  7. Global diagnosis:

  8. sync/indexing writes hold SQLite writer
            ↓
    many peer/domain writes queue behind them
            ↓
    SQLite busy handler waits up to 10s per goroutine
            ↓
    some callers hit SQLITE_BUSY
            ↓
    connect/domain bookkeeping creates more noise and contention
    
  9. Best conceptual fix:

    • Treat SQLite writes as a single-lane road.

    • Don’t let every goroutine independently race for the lane.

    • Put low-priority/best-effort writes behind a Go-level queue/batcher.

    • Make heavy writers release fairly between batches.

  10. Immediate concrete fix:

    • Move hmnet.(*Node).connect peer-row update into peerWriter, same as identify events.

  11. Durable fix:

    • Add app-level writer admission/fairness around WithTx, especially for PutMany, peer writes, and domain writes.

Do you like what you are reading? Subscribe to receive updates.

Unsubscribe anytime