Syllabus Lesson 214 of 239 · Project: Document Intelligence Service
Project: Document Intelligence Service

Extract Fields from Messy Text

Welcome to your flagship portfolio project. Over four lessons you will build a Document Intelligence Service: the kind of pipeline a fintech or ops team runs to turn a pile of messy invoices into clean, validated, reconciled rows in a database. By the end you can tell a hiring manager you "built a schema-validated extraction pipeline with reconciliation and evals" -> and actually mean it.

Step one is the dirty part nobody likes but everybody needs: pulling structured fields out of unstructured text. A vendor emails an invoice as a blob of text. Your job is parse_invoice(text) -> a dict with vendor, date, total (a float), and line_items (a list of {"desc", "amount"} dicts).

The catch, and the reason this is real work: the documents are not uniform. One says Total:, the next says Amount Due, a third pads everything with extra spaces. Labels vary, casing varies, whitespace varies. A parser that only handles one exact layout is useless on real mail.

Vendor: Acme Corp
Date: 2026-01-15
Line: Widgets x10    120.00
Line: Shipping        8.50
Total: 128.50

The rules your parser must follow:

  • The vendor comes from a line labelled vendor (case-insensitive). Take everything after the colon and strip it.
  • The date comes from a line labelled date. Keep the raw string (a later lesson validates it).
  • The total comes from a line whose label is total OR amount due (both seen in the wild). Parse the number to a float, tolerating a leading $ and thousands commas like 1,250.00.
  • Each line item sits on a line labelled line or item: the text before the trailing number is the desc (stripped), and the trailing number is the amount (a float).

A clean way in: split on newlines, and for each line split once on the first : to get label, rest. Lower-case the label, strip it, and branch. For the trailing money on a line item, a small regex like re.search(r"(-?[\d,]+\.?\d*)\s*$", rest) grabs the last number. Missing fields default sensibly: no vendor line -> "", no total -> 0.0, no items -> [].

Production teams increasingly throw an LLM at this step, but the LLM's output still has to land in exactly this shape or the rest of the pipeline breaks -> which is why you build the deterministic parser first and trust it. Press Run to parse a sample invoice and see the structured dict come out.

Your turn

Write parse_invoice(text) that turns a messy invoice string into a dict with vendor (str), date (raw str), total (float), and line_items (a list of {"desc", "amount"} dicts). Be tolerant: labels are case-insensitive, total may instead read amount due, line items are labelled line or item, and numbers may carry a $ or thousands commas. Missing vendor -> "", missing total -> 0.0, no items -> [].

Spotted a problem in this lesson? Report it

Code · runs in your browser
Output