Syllabus Lesson 151 of 239 · Build a RAG Pipeline
Build a RAG Pipeline

Chunking Long Documents

Retrieval-augmented generation (RAG) is how you point a general model at your data: a policy handbook, a product manual, last quarter's support tickets. The model never memorised any of it, so at question time you retrieve the relevant passages and paste them into the prompt. The whole pipeline is just five honest pieces: chunk the documents, embed each chunk, store the vectors, retrieve the top matches for a query, and build a grounded prompt. You will build all five across this module. The model that writes the final sentence can be the on-device WebLLM in your browser, but every step around it is plain, testable Python, which is exactly the part employers screen for.

Step one is chunking. A 40-page document is too big to embed as one vector and too big to stuff into a prompt, so you cut it into overlapping windows. Why overlapping? Because a fact you need might straddle a hard boundary. If you cut cleanly at character 500 and the sentence "refunds take five business days" spans 495 to 515, neither chunk contains the whole answer. A few characters of overlap let the same sentence live intact in both neighbours.

The shape of it: you slide a window of size characters forward by a step of size - overlap each time.

def chunk(text, size, overlap):
    step = size - overlap
    chunks = []
    i = 0
    while i < len(text):
        chunks.append(text[i:i + size])
        if i + size >= len(text):
            break          # this window already reached the end
        i += step
    return chunks

That break matters: without it you keep stepping past the end and emit a tiny scrap like "g" as a final "chunk". Stopping as soon as a window reaches the end keeps the last chunk a real one.

What to build. Write chunk(text, size, overlap) returning a list of overlapping windows. Rules the tests check:

  • Validate inputs: raise ValueError if size <= 0, or if overlap is negative or >= size (an overlap that big never advances).
  • Empty text returns an empty list [].
  • The last overlap characters of one chunk equal the first overlap characters of the next.
  • Full coverage: stitching chunks[0] plus chunks[i][overlap:] for every later chunk reproduces the original text exactly, with no characters lost or doubled.

Off-by-one errors live here, so press Run and watch a paragraph split into windows you can eyeball.

Your turn

Write chunk(text, size, overlap) that slides a window of size characters forward by size - overlap each step and returns the list of windows. Raise ValueError when size <= 0 or when overlap is negative or >= size. Return [] for empty text. Adjacent chunks must share exactly overlap characters, and stitching the first chunk with chunk[overlap:] of every later chunk must rebuild the original text.

Spotted a problem in this lesson? Report it

Code · runs in your browser
Output