Syllabus Lesson 217 of 239 · Project: Document Intelligence Service
Project: Document Intelligence Service

Batch Eval Over Labeled Docs

You can parse, validate, and reconcile a single document. The question a hiring manager actually asks is: how good is it across thousands? You answer that the way every serious ML and data team does -> with an eval. You take a batch of documents you have hand-labelled (the gold set), run your extractor over them, and measure how often each field came out right. No eval, no credibility.

Write evaluate(predictions, gold). Both are lists of dicts of the same length, aligned by position -> predictions[i] is your extractor's output for the document whose true answer is gold[i]. Compare them field by field and return a summary:

{
  "accuracy": 0.83,          # overall fraction of fields that matched
  "per_field": {            # correct count AND total per field
    "vendor": {"correct": 5, "total": 6},
    "total":  {"correct": 4, "total": 6},
    ...
  },
  "n": 6                     # number of documents compared
}

The rules:

  • The set of fields to score is the union of every key seen in the gold records. (Field missing from a prediction counts as wrong for that field.)
  • A field is correct when the predicted value equals the gold value. Use .get(field) on the prediction so a missing key compares as None rather than crashing.
  • accuracy is total correct fields divided by total compared fields (documents times fields), rounded to 4 places.
  • Empty-set safe: if gold is empty, return {"accuracy": 0.0, "per_field": {}, "n": 0} -> never divide by zero.

pandas makes the bookkeeping tidy if you want it -> build a DataFrame of per-field hits and sum columns -> but a couple of plain dict loops are just as good. Seed nothing here; the batch is fixed, so your accuracy is a number the grader can check exactly.

This closes the loop on your Document Intelligence Service: extract -> repair -> validate -> reconcile -> measure. "I built a schema-validated extraction pipeline with reconciliation and a field-level eval harness" is a sentence that gets you a second interview. Press Run to score a small labelled batch.

Your turn

Write evaluate(predictions, gold) over two position-aligned lists of record dicts. Score every field in the union of the gold records' keys; a field is correct when prediction.get(field) == gold[field] (a missing key compares as None). Return {"accuracy", "per_field", "n"} where accuracy is total correct over total compared fields (rounded to 4 dp), per_field[field] is {"correct", "total"}, and n is the document count. If gold is empty, return {"accuracy": 0.0, "per_field": {}, "n": 0} (no division by zero).

Spotted a problem in this lesson? Report it

Code · runs in your browser
Output