Where file import plugs in: Each format parser produces HMBlockNode[], then the existing flattenToOperations() converts those into DocumentOperation[] (ReplaceBlock + MoveBlocks ops). From there, the standard signing and publishing flow takes over — no changes needed downstream.
For updates (document update --replace-body), the CLI already has matchBlockIds() and computeReplaceOps() in block-diff.ts that diff old vs new block trees and emit minimal operations. File import can use the same path.
DOCX — mammoth
npm: mammoth
Bundle size: ~160 KB gzipped
Browser support: Yes
Structural quality: Excellent
Converts .docx to clean semantic HTML. Maps Word styles to HTML elements:
| Word Style | HTML Output | |-----------|-------------| | Heading 1 | <h1> | | Heading 2 | <h2> | | Bold | <strong> | | Italic | <em> | | Lists | <ol> / <ul> | | Tables | <table> | | Images | <img> (base64) |
Custom style mappings are supported.
Repo: https://github.com/mwilliamson/mammoth.js
SDK Integration
Two possible conversion paths:
Path A — mammoth → HTML → BlockNote's HTML-to-blocks
BlockNote (used in packages/editor) already has tryParseHTMLToBlocks() which converts HTML into its internal block model. From there, the existing editor-to-HMBlockNode conversion can be reused. This path gets the richest output with minimal new code but couples the import to the editor package.
Path B — mammoth → HTML → custom HMBlockNode[] mapper
Write a standalone HTML-to-HMBlockNode converter. Walk the DOM tree and map:
<h1>–<h6> → {type: 'Heading', text, annotations} with heading-level nesting
<p> → {type: 'Paragraph', text, annotations}
<strong>, <em>, <code>, <a> → annotation spans with starts/ends character positions
<ul> / <ol> → Paragraph blocks with childrenType: 'Unordered' | 'Ordered'
<table> → Code block (current CLI behavior) or future Table block type
<img> → extract base64 data, upload via blob storage, create {type: 'Image', link: ipfsUrl}
Then call flattenToOperations(blocks) and feed into the standard document create or document update flow.
Image handling note: mammoth extracts images as base64 <img> tags. These need to be uploaded to blob storage (UnixFS chunking, ~256KB chunks, via storeBlobs with 4MB message limit) and replaced with IPFS CID links before creating the Image blocks.
CLI usage would look like:
seed-cli document create z6Mk... --title "Imported Doc" --docx report.docx --key main
seed-cli document update hm://z6Mk.../doc --replace-docx updated.docx --key main
Alternative: officeparser
npm: officeparser
Produces a hierarchical AST (paragraphs, headings, tables, lists, bold/italic)
Also handles .pptx, .xlsx, .odt, .pdf, .rtf
Newer and less battle-tested than mammoth
XLSX — read-excel-file or xlsx (SheetJS)
Option A: read-excel-file (lightweight, recommended)
npm: read-excel-file
Bundle size: ~37 KB gzipped
Browser support: Yes (browser-first design)
Returns rows as arrays of cells (string/number/Date/boolean)
Supports schema-based parsing for typed JSON objects
Does not handle formulas
Option B: xlsx / SheetJS (full-featured)
npm: xlsx
Bundle size: ~300–500 KB gzipped
Browser support: Yes
Handles merged cells, formulas, number formatting
Caveat: npm version stuck at 0.18.5; newer versions distributed via cdn.sheetjs.com
Docs: https://docs.sheetjs.com/
SDK Integration
Spreadsheets don't map to the existing markdown parser at all — they need a dedicated converter.
Conversion flow:
.xlsx → read-excel-file/SheetJS → rows[][] per sheet
→ for each sheet:
Heading block (sheet name)
→ Table-like structure as child blocks
→ flattenToOperations(blocks)
→ standard signing/publish flow
Block mapping options:
As a Code block (simplest, matches current CLI behavior for tables): Render the spreadsheet as a markdown/CSV table string inside a {type: 'code-block', text: csvString} block. The CLI already renders markdown tables as Code blocks.
As nested Paragraph blocks (richer but verbose): Each row becomes a Paragraph block, cells separated by formatting. Poor UX for large sheets.
Multi-sheet handling: Each worksheet becomes a Heading block with the sheet's content as children. The flattenToOperations() function already handles nested parent-child block trees via MoveBlocks operations.
CLI usage would look like:
seed-cli document create z6Mk... --title "Q4 Report" --xlsx data.xlsx --key main
# Each sheet becomes a section under the document
Practical recommendation: Start with Option A (read-excel-file) rendering sheets as Code blocks containing markdown tables. This requires zero schema changes and works with the existing pipeline immediately.
PDF — pdfjs-dist
npm: pdfjs-dist
Bundle size: ~400–800 KB gzipped (heavy)
Browser support: Yes (this is Firefox's PDF engine)
Structural quality: Low — PDFs have no semantic structure
Extracts text items with position coordinates (x, y), font info, and text content per page via page.getTextContent(). Does not natively give you headings, paragraphs, or tables — you must reconstruct structure from:
Headings: larger font size
Paragraphs: spatial proximity grouping
Tables: grid-aligned text detection
Images: separate rendering pass
Repo: https://github.com/mozilla/pdf.js
SDK Integration
PDF is the hardest format because pdfjs-dist returns positioned text items, not semantic blocks. A heuristic layer is needed between the parser and HMBlockNode[].
Conversion flow:
.pdf → pdfjs getTextContent() → TextItem[] with {str, transform, fontName, ...}
→ heuristic grouping:
1. Group text items into lines (same Y coordinate within tolerance)
2. Group lines into paragraphs (vertical gap < threshold)
3. Detect headings (font size > body font size)
4. Detect lists (lines starting with "•", "-", "1.")
→ HMBlockNode[] (mostly Paragraph + Heading blocks)
→ flattenToOperations(blocks)
→ standard signing/publish flow
What maps cleanly to HMBlockNode:
Large-font text → {type: 'Heading', text} with annotations
Body text groups → {type: 'Paragraph', text} with bold/italic from font metadata
Bullet/numbered patterns → Paragraph with childrenType: 'Unordered' | 'Ordered'
What doesn't map well:
Multi-column layouts (text items interleave columns)
Tables (must detect grid alignment — complex heuristic)
Images (requires separate canvas rendering per page, then blob upload)
Headers/footers (appear as regular text items)
Mathematical formulas (not extractable as LaTeX)
The existing parseInlineFormatting() in markdown.ts builds annotation spans from character positions. The same pattern applies here — as you concatenate text items into a paragraph string, track font changes to create Bold/Italic annotations with correct starts/ends offsets.
CLI usage would look like:
seed-cli document create z6Mk... --title "Research Paper" --pdf paper.pdf --key main
# Best-effort text extraction with heuristic heading detection
Practical recommendation: Offer this as a "basic text import" rather than promising structural fidelity. The --verbose flag could show warnings about low-confidence heading detection or skipped elements.
Alternative: unpdf
npm: unpdf
Bundle size: ~390 KB gzipped
Cleaner async/await API wrapper around pdf.js
Methods: extractText, extractLinks, getMeta
Repo: https://github.com/unjs/unpdf
SDK note: extractText returns plain string per page — simpler but loses all font/position metadata needed for heading detection. Only viable for plain-text-only import.
Summary
All three formats converge at the same point: once you have HMBlockNode[], the existing flattenToOperations() → createChange() → createVersionRef() → client.publish() pipeline handles the rest unchanged.
DOCX is the clear quick win — mammoth's HTML output maps almost 1:1 to the block model. XLSX is straightforward but limited to table-like rendering. PDF requires the most custom heuristic code for the least structural fidelity.