Validate Against a Schema + Repair JSON
Your parser produces a dict, but on a real pipeline the data often arrives as JSON text -> from an upstream service, a saved file, or an LLM that was told to "return JSON." Two things always go wrong, and this lesson handles both: the JSON is malformed, and the JSON is well-formed but wrong.
Part one: repair the JSON. Models love to wrap JSON in ```json code fences, leave a trailing comma, or use single quotes. Raw json.loads throws on all of these. Write repair_json(text) that cleans the text up, then parses it -> returning the Python object. The repair steps, in order:
- Strip Markdown code fences: remove a leading
```jsonor```and a trailing```. - Remove trailing commas before a
}or](a regex likere.sub(r",\s*([}\]])", r"\1", text)). - Turn single quotes into double quotes so it is legal JSON.
- Then
json.loadsthe cleaned string and return the result.
The whole point is that repair_json must not raise on the messy inputs the tests feed it -> it repairs and parses instead of crashing.
Part two: validate against a schema. Now that you have an object, does it have the right shape? Write validate_record(obj, schema) -> a dict {"ok": bool, "errors": [..]}. The schema maps each required field name to a Python type, like:
schema = {"vendor": str, "total": float, "line_items": list}For every field in the schema: if it is missing from obj, add an error string "missing: <field>"; if it is present but the wrong type, add "wrong type: <field>". ok is True only when errors is empty.
One sharp edge that bites everyone: in Python bool is a subclass of int, so isinstance(True, int) is True. A schema asking for an int must reject a JSON true. Guard it: when the expected type is int, also require not isinstance(value, bool). Float fields are commonly fed whole numbers, so accept an int where a float is expected (but still reject bool).
Together these two functions are the gatekeeper of the pipeline: nothing malformed and nothing mis-shaped gets through. Press Run to repair a fenced, trailing-comma blob and validate the result.
Write two functions. repair_json(text) strips Markdown code fences, removes trailing commas before }/], converts single quotes to double quotes, then json.loads the result and returns the object (it must NOT raise on those messy inputs). validate_record(obj, schema) returns {"ok": bool, "errors": [..]}: a "missing: <field>" error per absent schema field and a "wrong type: <field>" error per type mismatch, with ok true only when there are no errors. Reject a JSON true where an int is required (bool is not an int here), and accept an int where a float is expected.
This lesson is locked
Lessons open one at a time. Finish the previous lesson to unlock this one.