How to show document metadata on the UI

>Projects>Enterprise Repositories>How to show document metadata on the UI

2 March 2026, 21:13

Some customers want our help with document processing. For example, emails or invoices. We need to extract the information and add structured information.

I’d treat this as an “ingestion → extraction → validation → structuring → publishing” pipeline, with humans-in-the-loop where it matters (accuracy, exceptions, and policy).

1. Define the target “structured record”

Start with a schema per document type (invoice, receipt, purchase order, email request, etc.). Keep it simple and extensible.

Invoice example (core fields)

Vendor: name, VAT ID, address

Buyer: name, VAT ID

Invoice: number, issue date, due date, currency

Totals: subtotal, tax breakdown, total

Line items: description, qty, unit price, tax rate

Payment: IBAN, payment terms

Provenance: source file/email id, received timestamp, page count

Evidence: “this value came from this snippet / bbox / page”

That last part (evidence) is crucial for trust and audits.

2. Ingest + normalize

Inputs: PDFs, scans, email bodies, email attachments, EDI-ish PDFs, images.

Steps:

Collect from sources (email inbox, upload folder, API).

Convert to a canonical “document bundle”:

text (best-effort)

layout (pages, blocks)

images (per page)

metadata (sender, dates, thread id)

De-duplicate (hashing) and classify.

Empty Mermaid block

3. Classify document type + route

Use a lightweight classifier:

Heuristics (sender, keywords like “Invoice”, “Factura”, amounts, IBAN)

ML/LLM classification as fallback

Route to an extractor specialized for:

Invoices

Receipts

Contracts

Emails (requests, approvals, complaints, support)

4) Extract with “hybrid” methods (best results in practice)

Don’t bet everything on one technique.

For digital PDFs (text-based):

Parse text + layout (tables, key-value zones)

Use deterministic patterns for high-signal fields (VAT/IVA IDs, dates, invoice number formats, IBAN)

For scanned PDFs/images:

OCR

Then the same as above, but with lower confidence

LLM step (structured):

Ask the model to output strict JSON that matches your schema

Provide the model with:

extracted text

layout hints (tables, page headings)

instructions like “return null if missing, don’t guess”

Have the model also return citations/evidence (snippet + page, or bbox id) for each field.

5) Validate and score confidence

Run validators after extraction:

Invoice number present?

Totals match: sum(line_items) ≈ subtotal, subtotal + taxes ≈ total

Dates are sensible (due date ≥ issue date)

VAT/IVA format valid per country

IBAN checksum valid

Currency matches symbols

Compute an overall confidence score and decide automation level:

High confidence → auto-ingest

Medium → “review required”

Low → “manual entry”

6) Human-in-the-loop review UI (where you win deals)

For medium confidence cases:

Show the document side-by-side with extracted fields

Highlight evidence snippets

One-click fix + “why” (so you can learn)

Every correction becomes training data:

vendor-specific templates

recurring line-item patterns

preferred mappings (e.g., account codes, cost centers)

7) Map to the customer’s systems

Structured output typically needs to flow into:

ERP/accounting (NetSuite, SAP, Odoo, QuickBooks, Xero)

CRM/ticketing (HubSpot, Zendesk, Jira)

Document repository / knowledge base

Use a canonical internal model → export adapters:

JSON (API)

CSV (legacy)

UBL / Factur-X / PEPPOL-like formats if needed

8) Store as “structured + source + provenance”

Keep:

Original document (immutable)

Extracted structured record (versioned)

Evidence map (field → snippet/page/bbox)

Processing log (model version, OCR version, rules triggered)

This makes audits, dispute resolution, and debugging straightforward.

9) Practical deployment approach

Phase 1 (2–4 weeks): one document type (invoices) + 2–3 pilot customers

Build schema, ingestion, extraction, validation, review UI, export to one target system.

Phase 2: vendor learning + email intake

Recognize repeat vendors and handle their quirks

Parse “invoice via email” workflows (threading, attachments, approvals)

Phase 3: multi-doc workflows

Purchase order ↔ invoice matching

Payment status updates

Exceptions and disputes

10) What I’d implement first (minimal but powerful)

Ingestion from email + uploads

Invoice classifier

Extraction pipeline:

PDF text extraction → LLM JSON

OCR fallback

Validation engine (math + formats)

Review UI with evidence highlights

Export to customer’s accounting system

Continuous learning from corrections

If you tell me:

which formats (Spanish “Factura” PDFs? scans?),

where the structured data needs to end up (ERP/CRM?), and

typical volume (10/day vs 10k/day),
I can sketch an architecture and a concrete “first pilot” plan with the exact fields and validators.

How to archive emails

Perfect — I’ll format this as a Seed-ready archive document, following the conventions you described:

One Email Thread document

One Email Message document per message

Attachments linked

Structured metadata

Full provenance-ready structure

Clean block structure for htmlToBlocks() or markdown import

Below is a Seed archive representation in Markdown form, ready for seed document create.

Invoice