A Toxicity Moderator With an Abstain Band
Keyword screens catch the obvious attacks, but they cannot read tone. For "is this message toxic?" you want a trained classifier that learned the patterns from labelled examples. This is exactly what a content-moderation endpoint does under the hood, and you can build a respectable one in a dozen lines with scikit-learn. No GPU, no cloud, no LLM.
The recipe is the workhorse text-classification pipeline:
- TfidfVectorizer turns each message into a vector of word importances. Common words across all messages get down-weighted; words that are distinctive to a message get up-weighted. That is the "TF-IDF" idea.
- LogisticRegression learns a weight per word and outputs a probability that the message is toxic. We pin
random_state=0so the fit is reproducible.
The twist that makes it production-shaped is an abstain band. A raw classifier always commits to a label, even when it is 51% sure. In moderation that is reckless: a borderline call should go to a human, not be auto-decided. So instead of taking the model's hard predict, you read its probability and apply your own two-sided rule:
proba = model.predict_proba(X)[:, toxic_col] # P(toxic) per message
if proba >= high: label = "toxic"
elif proba <= low: label = "clean"
else: label = "abstain" # send to a humanYou will train on a small labelled set, then classify a held-out batch into "toxic", "clean", or "abstain" using a low edge and a high edge. The clearly-rude messages should come back toxic, the clearly-friendly ones clean, and anything the model is unsure about should land in abstain rather than guess. A small training set keeps probabilities huddled near the middle, so a tight band like (0.45, 0.55) is what actually separates the classes here.
Why a classifier and not the LLM itself
Hosted moderation models exist (OpenAI's moderation endpoint, Perspective API, and so on), and in production you would often call one of those. We build a local one here for three reasons: it runs offline in this sandbox, it costs nothing per call, and it makes the moving parts visible, features, a learned weight per word, a probability, and a human-set threshold. Those concepts transfer directly to whatever moderation service you later plug in.
Build train_moderator(texts, labels) that fits a Pipeline(TfidfVectorizer, LogisticRegression(random_state=0)) and returns it. Then write classify(model, texts, low, high) that reads predict_proba for the toxic class and returns a list of "toxic" (proba >= high), "clean" (proba <= low), or "abstain" (in between). Finally set preds = classify(model, HELD_OUT, 0.45, 0.55).
This lesson is locked
Lessons open one at a time. Finish the previous lesson to unlock this one.