Underfitting and Overfitting
This is the central tension in all of machine learning, and now you have the tools to see it directly. A model can fail in two opposite ways.
- Underfitting: the model is too simple to capture the pattern. It does poorly on both the training data and the test data. Think of drawing a straight line through data that clearly curves.
- Overfitting: the model is too complex and memorizes the training data, including its noise and quirks. It scores great on training but poorly on test, because it learned the specific examples instead of the general rule.
A decision tree's complexity is controlled by max_depth, the number of question-layers it is allowed:
shallow = DecisionTreeClassifier(max_depth=1, random_state=0) # too simple
deep = DecisionTreeClassifier(max_depth=None, random_state=0) # unlimitedA depth-1 tree gets to ask just one question, so it is forced to be crude and tends to underfit. An unlimited-depth tree keeps splitting until it isolates the training rows, so it can drive training accuracy to a perfect 1.0 while doing worse on unseen test rows.
The tell-tale sign of overfitting is a gap: training accuracy much higher than test accuracy. The whole craft of tuning a model is finding the middle complexity that does well on test data, which is the only score that counts.
You will train both a shallow and a deep tree, record train and test accuracy for each, and compute the overfitting gap so the relationship is right there in the numbers.
A noisy dataset and a train/test split are already prepared. Train shallow = DecisionTreeClassifier(max_depth=1, random_state=0) and deep = DecisionTreeClassifier(max_depth=None, random_state=0) on the training data. Record four numbers: shallow_train, shallow_test, deep_train, deep_test (accuracy on the matching set for each tree). Then set overfit_gap = deep_train - deep_test.
This lesson is locked
Lessons open one at a time. Finish the previous lesson to unlock this one.