Regression on Noisy Data
In an earlier lesson you fit a line to data that was a perfect line (y = 2x) and the error came out at zero. That felt great, but it was a lie of sorts: real data is never a clean line. Sensors jitter, people behave unpredictably, measurements round. The honest version of regression is fitting a line through a cloud of points that only roughly follows a trend, and then reporting how far off you still are. This lesson is about reading those error numbers without flinching.
Making realistic data
We start from a true relationship, y = 3x + 7, and then add gaussian noise -> small random wobbles drawn from a bell curve -> so no point sits exactly on the line:
rng = np.random.RandomState(42)
x = np.linspace(0, 10, 200)
noise = rng.normal(0, 2.0, size=x.shape) # mean 0, spread 2.0
y = 3 * x + 7 + noise
X = x.reshape(-1, 1) # sklearn wants 2-D featuresThe fixed seed means everyone draws the same noise, so the numbers below are reproducible and checkable. The model never sees the true slope of 3 or intercept of 7; its job is to recover them from the noisy points.
Fit, predict, and measure honestly
Same ritual as always, but we grade on a held-out test set:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=0)
model = LinearRegression().fit(X_train, y_train)
preds = model.predict(X_test)
mae = mean_absolute_error(y_test, preds)
r2 = r2_score(y_test, preds)What the numbers mean
- The recovered line.
model.coef_[0]comes out near 3 andmodel.intercept_near 7. The model dug the real trend out from under the noise, but not perfectly: it lands close, not exact. - MAE is no longer zero. Here it is roughly 1.8, meaning a typical prediction misses the true value by about 1.8 in
y's units. That is not a bug. With noise of spread 2.0 baked in, some error is unavoidable -> a positive MAE is the honest cost of real data. Contrast the perfect line earlier, where MAE was ~0 by construction. - R squared sits below 1.
r2_scoremeasures the fraction of the variation the model explains: 1.0 is perfect, 0 is no better than guessing the mean. Here it is around 0.95 -> high, because there is a strong linear signal, but strictly under 1.0, because the noise can never be explained away. A reported R squared of exactly 1.0 on real data is a red flag, usually a sign you accidentally tested on the training data. - Residuals tell you if the miss is fair. A residual is
actual - predictedfor one point. Their mean should sit near 0: that says the model is not systematically over- or under-shooting, just scattering above and below the line the way symmetric noise should. A residual mean far from 0 would mean a biased model.
So the deliverable here is a set of honest numbers, not a magic zero. Press Run to fit the line and print the recovered slope, the MAE, the R squared, and the residual mean.
Build noisy data from y = 3x + 7 plus gaussian noise (rng = np.random.RandomState(42), x = np.linspace(0, 10, 200), noise = rng.normal(0, 2.0, size=x.shape), X = x.reshape(-1, 1)). Split with train_test_split(X, y, test_size=0.2, random_state=0), fit a LinearRegression as model, predict on the test set, then store mae (mean absolute error), r2 (r2_score), and resid_mean (the mean of y_test - preds) as floats.
This lesson is locked
Lessons open one at a time. Finish the previous lesson to unlock this one.