Syllabus Lesson 93 of 239 · Your First Machine Learning Models
Your First Machine Learning Models

Underfitting and Overfitting

This is the central tension in all of machine learning, and now you have the tools to see it directly. A model can fail in two opposite ways.

  • Underfitting: the model is too simple to capture the pattern. It does poorly on both the training data and the test data. Think of drawing a straight line through data that clearly curves.
  • Overfitting: the model is too complex and memorizes the training data, including its noise and quirks. It scores great on training but poorly on test, because it learned the specific examples instead of the general rule.

A decision tree's complexity is controlled by max_depth, the number of question-layers it is allowed:

shallow = DecisionTreeClassifier(max_depth=1, random_state=0)   # too simple
deep    = DecisionTreeClassifier(max_depth=None, random_state=0) # unlimited

A depth-1 tree gets to ask just one question, so it is forced to be crude and tends to underfit. An unlimited-depth tree keeps splitting until it isolates the training rows, so it can drive training accuracy to a perfect 1.0 while doing worse on unseen test rows.

The tell-tale sign of overfitting is a gap: training accuracy much higher than test accuracy. The whole craft of tuning a model is finding the middle complexity that does well on test data, which is the only score that counts.

You will train both a shallow and a deep tree, record train and test accuracy for each, and compute the overfitting gap so the relationship is right there in the numbers.

Your turn

A noisy dataset and a train/test split are already prepared. Train shallow = DecisionTreeClassifier(max_depth=1, random_state=0) and deep = DecisionTreeClassifier(max_depth=None, random_state=0) on the training data. Record four numbers: shallow_train, shallow_test, deep_train, deep_test (accuracy on the matching set for each tree). Then set overfit_gap = deep_train - deep_test.

Spotted a problem in this lesson? Report it

Code · runs in your browser
Output