Some customers want our help with document processing. For example, emails or invoices. We need to extract the information and add structured information.

    I’d treat this as an “ingestion → extraction → validation → structuring → publishing” pipeline, with humans-in-the-loop where it matters (accuracy, exceptions, and policy).

    1. Define the target “structured record”

      Start with a schema per document type (invoice, receipt, purchase order, email request, etc.). Keep it simple and extensible.

      Invoice example (core fields)

        Vendor: name, VAT ID, address

        Buyer: name, VAT ID

        Invoice: number, issue date, due date, currency

        Totals: subtotal, tax breakdown, total

        Line items: description, qty, unit price, tax rate

        Payment: IBAN, payment terms

        Provenance: source file/email id, received timestamp, page count

        Evidence: “this value came from this snippet / bbox / page”

    That last part (evidence) is crucial for trust and audits.

    2. Ingest + normalize

      Inputs: PDFs, scans, email bodies, email attachments, EDI-ish PDFs, images.

      Steps:

        Collect from sources (email inbox, upload folder, API).

        Convert to a canonical “document bundle”:

          text (best-effort)

          layout (pages, blocks)

          images (per page)

          metadata (sender, dates, thread id)

        De-duplicate (hashing) and classify.

    Empty Mermaid block

    3. Classify document type + route

      Use a lightweight classifier:

        Heuristics (sender, keywords like “Invoice”, “Factura”, amounts, IBAN)

        ML/LLM classification as fallback

      Route to an extractor specialized for:

        Invoices

        Receipts

        Contracts

        Emails (requests, approvals, complaints, support)

    4) Extract with “hybrid” methods (best results in practice)

      Don’t bet everything on one technique.

      For digital PDFs (text-based):

        Parse text + layout (tables, key-value zones)

        Use deterministic patterns for high-signal fields (VAT/IVA IDs, dates, invoice number formats, IBAN)

      For scanned PDFs/images:

        OCR

        Then the same as above, but with lower confidence

      LLM step (structured):

        Ask the model to output strict JSON that matches your schema

        Provide the model with:

          extracted text

          layout hints (tables, page headings)

          instructions like “return null if missing, don’t guess”

        Have the model also return citations/evidence (snippet + page, or bbox id) for each field.

    5) Validate and score confidence

      Run validators after extraction:

        Invoice number present?

        Totals match: sum(line_items) ≈ subtotal, subtotal + taxes ≈ total

        Dates are sensible (due date ≥ issue date)

        VAT/IVA format valid per country

        IBAN checksum valid

        Currency matches symbols

      Compute an overall confidence score and decide automation level:

        High confidence → auto-ingest

        Medium → “review required”

        Low → “manual entry”

    6) Human-in-the-loop review UI (where you win deals)

      For medium confidence cases:

        Show the document side-by-side with extracted fields

        Highlight evidence snippets

        One-click fix + “why” (so you can learn)

      Every correction becomes training data:

        vendor-specific templates

        recurring line-item patterns

        preferred mappings (e.g., account codes, cost centers)

    7) Map to the customer’s systems

      Structured output typically needs to flow into:

        ERP/accounting (NetSuite, SAP, Odoo, QuickBooks, Xero)

        CRM/ticketing (HubSpot, Zendesk, Jira)

        Document repository / knowledge base

      Use a canonical internal model → export adapters:

        JSON (API)

        CSV (legacy)

        UBL / Factur-X / PEPPOL-like formats if needed

    8) Store as “structured + source + provenance”

      Keep:

        Original document (immutable)

        Extracted structured record (versioned)

        Evidence map (field → snippet/page/bbox)

        Processing log (model version, OCR version, rules triggered)

      This makes audits, dispute resolution, and debugging straightforward.

    9) Practical deployment approach

      Phase 1 (2–4 weeks): one document type (invoices) + 2–3 pilot customers

        Build schema, ingestion, extraction, validation, review UI, export to one target system.

      Phase 2: vendor learning + email intake

        Recognize repeat vendors and handle their quirks

        Parse “invoice via email” workflows (threading, attachments, approvals)

      Phase 3: multi-doc workflows

        Purchase order ↔ invoice matching

        Payment status updates

        Exceptions and disputes

    10) What I’d implement first (minimal but powerful)

      Ingestion from email + uploads

      Invoice classifier

      Extraction pipeline:

        PDF text extraction → LLM JSON

        OCR fallback

      Validation engine (math + formats)

      Review UI with evidence highlights

      Export to customer’s accounting system

      Continuous learning from corrections

      If you tell me:

        which formats (Spanish “Factura” PDFs? scans?),

        where the structured data needs to end up (ERP/CRM?), and

        typical volume (10/day vs 10k/day),
        I can sketch an architecture and a concrete “first pilot” plan with the exact fields and validators.

    How to archive emails

      Perfect — I’ll format this as a Seed-ready archive document, following the conventions you described:

        One Email Thread document

        One Email Message document per message

        Attachments linked

        Structured metadata

        Full provenance-ready structure

        Clean block structure for htmlToBlocks() or markdown import

      Below is a Seed archive representation in Markdown form, ready for seed document create.

    Invoice