File Import Libraries Research

File Import Libraries ResearchResearch into frontend libraries for importing .docx, .xlsx, and .pdf files into the Seed Hypermedia block model

Where file import plugs in: Each format parser produces HMBlockNode[], then the existing flattenToOperations() converts those into DocumentOperation[] (ReplaceBlock + MoveBlocks ops). From there, the standard signing and publishing flow takes over — no changes needed downstream.

For updates (document update --replace-body), the CLI already has matchBlockIds() and computeReplaceOps() in block-diff.ts that diff old vs new block trees and emit minimal operations. File import can use the same path.

DOCX — `mammoth`

npm: mammoth
Bundle size: ~160 KB gzipped
Browser support: Yes
Structural quality: Excellent
Converts .docx to clean semantic HTML. Maps Word styles to HTML elements:
| Word Style | HTML Output | |-----------|-------------| | Heading 1 | <h1> | | Heading 2 | <h2> | | Bold | <strong> | | Italic | <em> | | Lists | <ol> / <ul> | | Tables | <table> | | Images | <img> (base64) |
Custom style mappings are supported.
Repo: https://github.com/mwilliamson/mammoth.js
SDK Integration
Two possible conversion paths:
Path A — mammoth → HTML → BlockNote's HTML-to-blocks
BlockNote (used in packages/editor) already has tryParseHTMLToBlocks() which converts HTML into its internal block model. From there, the existing editor-to-HMBlockNode conversion can be reused. This path gets the richest output with minimal new code but couples the import to the editor package.
Path B — mammoth → HTML → custom HMBlockNode[] mapper
Write a standalone HTML-to-HMBlockNode converter. Walk the DOM tree and map:
- <h1>–<h6> → {type: 'Heading', text, annotations} with heading-level nesting
- <p> → {type: 'Paragraph', text, annotations}
- <strong>, <em>, <code>, <a> → annotation spans with starts/ends character positions
- <ul> / <ol> → Paragraph blocks with childrenType: 'Unordered' | 'Ordered'
- <table> → Code block (current CLI behavior) or future Table block type
- <img> → extract base64 data, upload via blob storage, create {type: 'Image', link: ipfsUrl}
Then call flattenToOperations(blocks) and feed into the standard document create or document update flow.
Image handling note: mammoth extracts images as base64 <img> tags. These need to be uploaded to blob storage (UnixFS chunking, ~256KB chunks, via storeBlobs with 4MB message limit) and replaced with IPFS CID links before creating the Image blocks.
CLI usage would look like:
```
seed-cli document create z6Mk... --title "Imported Doc" --docx report.docx --key main
seed-cli document update hm://z6Mk.../doc --replace-docx updated.docx --key main
```
Alternative: officeparser
- npm: officeparser
- Produces a hierarchical AST (paragraphs, headings, tables, lists, bold/italic)
- Also handles .pptx, .xlsx, .odt, .pdf, .rtf
- Newer and less battle-tested than mammoth

XLSX — `read-excel-file` or `xlsx` (SheetJS)

Option A: `read-excel-file` (lightweight, recommended)

npm: read-excel-file
Bundle size: ~37 KB gzipped
Browser support: Yes (browser-first design)
Returns rows as arrays of cells (string/number/Date/boolean)
Supports schema-based parsing for typed JSON objects
Does not handle formulas

Option B: `xlsx` / SheetJS (full-featured)

npm: xlsx
Bundle size: ~300–500 KB gzipped
Browser support: Yes
Handles merged cells, formulas, number formatting
Caveat: npm version stuck at 0.18.5; newer versions distributed via cdn.sheetjs.com
Docs: https://docs.sheetjs.com/

SDK Integration

Spreadsheets don't map to the existing markdown parser at all — they need a dedicated converter.

Conversion flow:

.xlsx → read-excel-file/SheetJS → rows[][] per sheet
  → for each sheet:
      Heading block (sheet name)
        → Table-like structure as child blocks
  → flattenToOperations(blocks)
  → standard signing/publish flow

Block mapping options:

As a Code block (simplest, matches current CLI behavior for tables): Render the spreadsheet as a markdown/CSV table string inside a {type: 'code-block', text: csvString} block. The CLI already renders markdown tables as Code blocks.
As nested Paragraph blocks (richer but verbose): Each row becomes a Paragraph block, cells separated by formatting. Poor UX for large sheets.

Multi-sheet handling: Each worksheet becomes a Heading block with the sheet's content as children. The flattenToOperations() function already handles nested parent-child block trees via MoveBlocks operations.

CLI usage would look like:

seed-cli document create z6Mk... --title "Q4 Report" --xlsx data.xlsx --key main
# Each sheet becomes a section under the document

Practical recommendation: Start with Option A (read-excel-file) rendering sheets as Code blocks containing markdown tables. This requires zero schema changes and works with the existing pipeline immediately.

PDF — `pdfjs-dist`

npm: pdfjs-dist
Bundle size: ~400–800 KB gzipped (heavy)
Browser support: Yes (this is Firefox's PDF engine)
Structural quality: Low — PDFs have no semantic structure
Extracts text items with position coordinates (x, y), font info, and text content per page via page.getTextContent(). Does not natively give you headings, paragraphs, or tables — you must reconstruct structure from:
- Headings: larger font size
- Paragraphs: spatial proximity grouping
- Tables: grid-aligned text detection
- Images: separate rendering pass
Repo: https://github.com/mozilla/pdf.js
SDK Integration
PDF is the hardest format because pdfjs-dist returns positioned text items, not semantic blocks. A heuristic layer is needed between the parser and HMBlockNode[].
Conversion flow:
```
.pdf → pdfjs getTextContent() → TextItem[] with {str, transform, fontName, ...}
  → heuristic grouping:
      1. Group text items into lines (same Y coordinate within tolerance)
      2. Group lines into paragraphs (vertical gap < threshold)
      3. Detect headings (font size > body font size)
      4. Detect lists (lines starting with "•", "-", "1.")
  → HMBlockNode[] (mostly Paragraph + Heading blocks)
  → flattenToOperations(blocks)
  → standard signing/publish flow
```
What maps cleanly to HMBlockNode:
- Large-font text → {type: 'Heading', text} with annotations
- Body text groups → {type: 'Paragraph', text} with bold/italic from font metadata
- Bullet/numbered patterns → Paragraph with childrenType: 'Unordered' | 'Ordered'
What doesn't map well:
- Multi-column layouts (text items interleave columns)
- Tables (must detect grid alignment — complex heuristic)
- Images (requires separate canvas rendering per page, then blob upload)
- Headers/footers (appear as regular text items)
- Mathematical formulas (not extractable as LaTeX)
The existing parseInlineFormatting() in markdown.ts builds annotation spans from character positions. The same pattern applies here — as you concatenate text items into a paragraph string, track font changes to create Bold/Italic annotations with correct starts/ends offsets.
CLI usage would look like:
```
seed-cli document create z6Mk... --title "Research Paper" --pdf paper.pdf --key main
# Best-effort text extraction with heuristic heading detection
```
Practical recommendation: Offer this as a "basic text import" rather than promising structural fidelity. The --verbose flag could show warnings about low-confidence heading detection or skipped elements.
Alternative: unpdf
- npm: unpdf
- Bundle size: ~390 KB gzipped
- Cleaner async/await API wrapper around pdf.js
- Methods: extractText, extractLinks, getMeta
- Repo: https://github.com/unjs/unpdf
- SDK note: extractText returns plain string per page — simpler but loses all font/position metadata needed for heading detection. Only viable for plain-text-only import.

Summary

All three formats converge at the same point: once you have HMBlockNode[], the existing flattenToOperations() → createChange() → createVersionRef() → client.publish() pipeline handles the rest unchanged.

DOCX is the clear quick win — mammoth's HTML output maps almost 1:1 to the block model. XLSX is straightforward but limited to table-like rendering. PDF requires the most custom heuristic code for the least structural fidelity.

Do you like what you are reading? Subscribe to receive updates.

Unsubscribe anytime

DOCX — `mammoth`

SDK Integration

Alternative: `officeparser`

XLSX — `read-excel-file` or `xlsx` (SheetJS)

Option A: `read-excel-file` (lightweight, recommended)

Option B: `xlsx` / SheetJS (full-featured)

SDK Integration

PDF — `pdfjs-dist`

SDK Integration

Alternative: `unpdf`

Summary

DOCX — mammoth

SDK Integration

Alternative: officeparser

XLSX — read-excel-file or xlsx (SheetJS)

Option A: read-excel-file (lightweight, recommended)

Option B: xlsx / SheetJS (full-featured)

SDK Integration

PDF — pdfjs-dist

SDK Integration

Alternative: unpdf

Summary

DOCX — `mammoth`

Alternative: `officeparser`

XLSX — `read-excel-file` or `xlsx` (SheetJS)

Option A: `read-excel-file` (lightweight, recommended)

Option B: `xlsx` / SheetJS (full-featured)

PDF — `pdfjs-dist`

Alternative: `unpdf`