Syllabus Lesson 85 of 239 · Your First Machine Learning Models
Your First Machine Learning Models

Train/Test Split: Honest Measurement

Here is the single most important habit in machine learning. If you train a model and then test it on the same data, you learn nothing useful. A model can score perfectly just by memorizing the answers it already saw. That is like grading a student on the exact questions they studied: it measures memory, not understanding.

The fix is to hold some data back. You train on most of it, then test on rows the model has never seen. Test performance is your honest estimate of how it will do on real, future data.

scikit-learn gives you one function for this:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0
)

It returns four pieces, always in that order: train features, test features, train labels, test labels. Two arguments matter:

  • test_size=0.2 holds out 20 percent for testing and keeps 80 percent for training.
  • random_state=0 fixes the shuffle so you get the same split every run. Without it the split is random each time, and your results would jiggle and could not be checked. In this whole course, always pass random_state=0.

From now on a strict rule: you may train on the training set, but you only ever grade on the test set. Never report a score from data the model trained on.

Your turn

Arrays X (10 rows, 2 columns) and y (10 labels) are given. Use train_test_split with test_size=0.2 and random_state=0 to get X_train, X_test, y_train, y_test. Then set n_train to the number of training rows and n_test to the number of test rows.

Spotted a problem in this lesson? Report it

Code · runs in your browser
Output