Syllabus Lesson 91 of 239 · Your First Machine Learning Models
Your First Machine Learning Models

Regression on Noisy Data

In an earlier lesson you fit a line to data that was a perfect line (y = 2x) and the error came out at zero. That felt great, but it was a lie of sorts: real data is never a clean line. Sensors jitter, people behave unpredictably, measurements round. The honest version of regression is fitting a line through a cloud of points that only roughly follows a trend, and then reporting how far off you still are. This lesson is about reading those error numbers without flinching.

Making realistic data

We start from a true relationship, y = 3x + 7, and then add gaussian noise -> small random wobbles drawn from a bell curve -> so no point sits exactly on the line:

rng = np.random.RandomState(42)
x = np.linspace(0, 10, 200)
noise = rng.normal(0, 2.0, size=x.shape)   # mean 0, spread 2.0
y = 3 * x + 7 + noise
X = x.reshape(-1, 1)                        # sklearn wants 2-D features

The fixed seed means everyone draws the same noise, so the numbers below are reproducible and checkable. The model never sees the true slope of 3 or intercept of 7; its job is to recover them from the noisy points.

Fit, predict, and measure honestly

Same ritual as always, but we grade on a held-out test set:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0)
model = LinearRegression().fit(X_train, y_train)
preds = model.predict(X_test)
mae = mean_absolute_error(y_test, preds)
r2  = r2_score(y_test, preds)

What the numbers mean

  • The recovered line. model.coef_[0] comes out near 3 and model.intercept_ near 7. The model dug the real trend out from under the noise, but not perfectly: it lands close, not exact.
  • MAE is no longer zero. Here it is roughly 1.8, meaning a typical prediction misses the true value by about 1.8 in y's units. That is not a bug. With noise of spread 2.0 baked in, some error is unavoidable -> a positive MAE is the honest cost of real data. Contrast the perfect line earlier, where MAE was ~0 by construction.
  • R squared sits below 1. r2_score measures the fraction of the variation the model explains: 1.0 is perfect, 0 is no better than guessing the mean. Here it is around 0.95 -> high, because there is a strong linear signal, but strictly under 1.0, because the noise can never be explained away. A reported R squared of exactly 1.0 on real data is a red flag, usually a sign you accidentally tested on the training data.
  • Residuals tell you if the miss is fair. A residual is actual - predicted for one point. Their mean should sit near 0: that says the model is not systematically over- or under-shooting, just scattering above and below the line the way symmetric noise should. A residual mean far from 0 would mean a biased model.

So the deliverable here is a set of honest numbers, not a magic zero. Press Run to fit the line and print the recovered slope, the MAE, the R squared, and the residual mean.

Your turn

Build noisy data from y = 3x + 7 plus gaussian noise (rng = np.random.RandomState(42), x = np.linspace(0, 10, 200), noise = rng.normal(0, 2.0, size=x.shape), X = x.reshape(-1, 1)). Split with train_test_split(X, y, test_size=0.2, random_state=0), fit a LinearRegression as model, predict on the test set, then store mae (mean absolute error), r2 (r2_score), and resid_mean (the mean of y_test - preds) as floats.

Spotted a problem in this lesson? Report it

Code · runs in your browser
Output