Syllabus Lesson 215 of 239 · Project: Document Intelligence Service
Project: Document Intelligence Service

Validate Against a Schema + Repair JSON

Your parser produces a dict, but on a real pipeline the data often arrives as JSON text -> from an upstream service, a saved file, or an LLM that was told to "return JSON." Two things always go wrong, and this lesson handles both: the JSON is malformed, and the JSON is well-formed but wrong.

Part one: repair the JSON. Models love to wrap JSON in ```json code fences, leave a trailing comma, or use single quotes. Raw json.loads throws on all of these. Write repair_json(text) that cleans the text up, then parses it -> returning the Python object. The repair steps, in order:

  • Strip Markdown code fences: remove a leading ```json or ``` and a trailing ```.
  • Remove trailing commas before a } or ] (a regex like re.sub(r",\s*([}\]])", r"\1", text)).
  • Turn single quotes into double quotes so it is legal JSON.
  • Then json.loads the cleaned string and return the result.

The whole point is that repair_json must not raise on the messy inputs the tests feed it -> it repairs and parses instead of crashing.

Part two: validate against a schema. Now that you have an object, does it have the right shape? Write validate_record(obj, schema) -> a dict {"ok": bool, "errors": [..]}. The schema maps each required field name to a Python type, like:

schema = {"vendor": str, "total": float, "line_items": list}

For every field in the schema: if it is missing from obj, add an error string "missing: <field>"; if it is present but the wrong type, add "wrong type: <field>". ok is True only when errors is empty.

One sharp edge that bites everyone: in Python bool is a subclass of int, so isinstance(True, int) is True. A schema asking for an int must reject a JSON true. Guard it: when the expected type is int, also require not isinstance(value, bool). Float fields are commonly fed whole numbers, so accept an int where a float is expected (but still reject bool).

Together these two functions are the gatekeeper of the pipeline: nothing malformed and nothing mis-shaped gets through. Press Run to repair a fenced, trailing-comma blob and validate the result.

Your turn

Write two functions. repair_json(text) strips Markdown code fences, removes trailing commas before }/], converts single quotes to double quotes, then json.loads the result and returns the object (it must NOT raise on those messy inputs). validate_record(obj, schema) returns {"ok": bool, "errors": [..]}: a "missing: <field>" error per absent schema field and a "wrong type: <field>" error per type mismatch, with ok true only when there are no errors. Reject a JSON true where an int is required (bool is not an int here), and accept an int where a float is expected.

Spotted a problem in this lesson? Report it

Code · runs in your browser
Output