Responsible AI: Safety, Moderation & Red-Teaming

A Toxicity Moderator With an Abstain Band

Keyword screens catch the obvious attacks, but they cannot read tone. For "is this message toxic?" you want a trained classifier that learned the patterns from labelled examples. This is exactly what a content-moderation endpoint does under the hood, and you can build a respectable one in a dozen lines with scikit-learn. No GPU, no cloud, no LLM.

The recipe is the workhorse text-classification pipeline:

TfidfVectorizer turns each message into a vector of word importances. Common words across all messages get down-weighted; words that are distinctive to a message get up-weighted. That is the "TF-IDF" idea.
LogisticRegression learns a weight per word and outputs a probability that the message is toxic. We pin random_state=0 so the fit is reproducible.

The twist that makes it production-shaped is an abstain band. A raw classifier always commits to a label, even when it is 51% sure. In moderation that is reckless: a borderline call should go to a human, not be auto-decided. So instead of taking the model's hard predict, you read its probability and apply your own two-sided rule:

proba = model.predict_proba(X)[:, toxic_col]   # P(toxic) per message
if proba >= high:        label = "toxic"
elif proba <= low:       label = "clean"
else:                    label = "abstain"   # send to a human

You will train on a small labelled set, then classify a held-out batch into "toxic", "clean", or "abstain" using a low edge and a high edge. The clearly-rude messages should come back toxic, the clearly-friendly ones clean, and anything the model is unsure about should land in abstain rather than guess. A small training set keeps probabilities huddled near the middle, so a tight band like (0.45, 0.55) is what actually separates the classes here.

Why a classifier and not the LLM itself

Hosted moderation models exist (OpenAI's moderation endpoint, Perspective API, and so on), and in production you would often call one of those. We build a local one here for three reasons: it runs offline in this sandbox, it costs nothing per call, and it makes the moving parts visible, features, a learned weight per word, a probability, and a human-set threshold. Those concepts transfer directly to whatever moderation service you later plug in.

Your turn

Build train_moderator(texts, labels) that fits a Pipeline(TfidfVectorizer, LogisticRegression(random_state=0)) and returns it. Then write classify(model, texts, low, high) that reads predict_proba for the toxic class and returns a list of "toxic" (proba >= high), "clean" (proba <= low), or "abstain" (in between). Finally set preds = classify(model, HELD_OUT, 0.45, 0.55).

Spotted a problem in this lesson? Report it

Code · runs in your browser

Output

Back Next lesson

A Toxicity Moderator With an Abstain Band

Why a classifier and not the LLM itself

This lesson is locked

Best on a laptop