The ML Around the LLM

Intent Classification (TF-IDF + LogisticRegression)

Not every "AI feature" needs a large language model. A huge fraction of production LLM apps put a tiny, boring classifier in front of the model: it reads the incoming message, decides what kind of request it is (billing? a bug report? an account change?), and routes accordingly. That decision is fast, free, runs offline, and is deterministic -> the same input always gives the same answer. You only spend an expensive LLM call once you actually need one.

The workhorse for short text is TfidfVectorizer + LogisticRegression. TF-IDF turns each message into a sparse vector of weighted word counts; logistic regression learns a linear boundary between the intent classes. On a few dozen labelled examples it trains in milliseconds and classifies in microseconds.

You will wire the two into one object with a Pipeline so the vectorizer and the classifier are fit and applied together as a unit:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ("tfidf", TfidfVectorizer()),
    ("clf", LogisticRegression(random_state=0, max_iter=1000)),
])
pipe.fit(texts, labels)
pipe.predict(["can I get a refund"])   # -> array(['billing'])

Why random_state=0? Logistic regression's solver has a pseudo-random component. Pinning the seed makes training reproducible, which is the whole point of the deterministic layer: your routing must not flip between runs. The course follows this rule everywhere.

This is exactly the contrast with the LLM. A model call is non-deterministic, costs money, and adds latency. Here is the kind of fallback you might wire downstream once you know the intent (shown for context, never graded):

# Pseudocode: only AFTER the cheap classifier is unsure do you reach for the model.
intent = pipe.predict([msg])[0]
if intent == "general":
    reply = await window._floatiTutor.complete(msg)   # the on-device LLM

Build two functions. train_classifier(texts, labels) returns a fitted Pipeline (TF-IDF + LogisticRegression with random_state=0). predict(model, new_texts) takes that model and a list of strings and returns a list of predicted labels. Press Run to train on a small support-ticket set and watch it route held-out messages to the right team.

Your turn

Write train_classifier(texts, labels) returning a fitted sklearn Pipeline of TfidfVectorizer() then LogisticRegression(random_state=0, ...), and predict(model, new_texts) returning a list of predicted labels for a list of input strings. The seed makes it reproducible; the classifier must send distinct held-out messages to their correct intents.

Spotted a problem in this lesson? Report it

Code · runs in your browser

Output

Back Next lesson

Intent Classification (TF-IDF + LogisticRegression)

This lesson is locked

Best on a laptop