Train/Test Split: Honest Measurement
Here is the single most important habit in machine learning. If you train a model and then test it on the same data, you learn nothing useful. A model can score perfectly just by memorizing the answers it already saw. That is like grading a student on the exact questions they studied: it measures memory, not understanding.
The fix is to hold some data back. You train on most of it, then test on rows the model has never seen. Test performance is your honest estimate of how it will do on real, future data.
scikit-learn gives you one function for this:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=0
)It returns four pieces, always in that order: train features, test features, train labels, test labels. Two arguments matter:
test_size=0.2holds out 20 percent for testing and keeps 80 percent for training.random_state=0fixes the shuffle so you get the same split every run. Without it the split is random each time, and your results would jiggle and could not be checked. In this whole course, always passrandom_state=0.
From now on a strict rule: you may train on the training set, but you only ever grade on the test set. Never report a score from data the model trained on.
Arrays X (10 rows, 2 columns) and y (10 labels) are given. Use train_test_split with test_size=0.2 and random_state=0 to get X_train, X_test, y_train, y_test. Then set n_train to the number of training rows and n_test to the number of test rows.
This lesson is locked
Lessons open one at a time. Finish the previous lesson to unlock this one.