visit
This tree of course is much bigger and more complex for bigger datasets when compared to our simple example above. The tree grows and forms according to the data we provide to it (training the model). However, this simple diagram also shows how simple actually the algorithm works. You can already imagine that to be able to split the data properly, you need to ask the right questions starting from the top node. This means which features and what conditions to use are crucial to building a good performing decision tree. Well, how is this possible?
We can use the DecisionTreeClassifier model from the scikit learn library (DecisionTreeClassifier documentation):
# Import train_test_split function and the dataset
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier
cancer = load_breast_cancer()
y = cancer.target
X = cancer.data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf = DecisionTreeClassifier(criterion='entropy', random_state=0)
clf.fit(X_train, y_train)
print("Training accuracy:{:.2f}".format(clf.score(X_train,y_train)))
print("Test accuracy: {:.2f}".format(clf.score(X_test,y_test)))
Our training accuracy being higher than our test accuracy shows us that our model is overfitting to the data. Let’s plot our decision tree and examine it’s complexity.
from sklearn import tree
import matplotlib.pyplot as plt
fig, axes = plt.subplots(nrows = 1,ncols = 1,figsize = (16,8), dpi=100)
tree.plot_tree(clf, feature_names = cancer.feature_names, class_names=cancer.target_names, filled = True, fontsize = 5);
Changing the asm criterion from gine to entropy:
clf = DecisionTreeClassifier(criterion='entropy', random_state=0)
clf.fit(X_train, y_train)
print("Training accuracy:{:.2f}".format(clf.score(X_train,y_train)))
print("Test accuracy: {:.2f}".format(clf.score(X_test,y_test)))
Our test accuracy improved while using the entropy attribute selection measure as the splitting criterion. We can have a look at the tree again to see if there are any changes to the splits done.
fig, axes = plt.subplots(nrows = 1,ncols = 1,figsize = (16,8), dpi=100)
tree.plot_tree(clf, feature_names = cancer.feature_names, class_names=cancer.target_names, filled = True, fontsize = 5);
As we can see, our tree is pretty complex which is resulting in overfitting the data. We can reduce the complexity of the tree by providing a maximum depth (max_depth) and prevent overfitting.
clf = DecisionTreeClassifier(max_depth=3)
clf.fit(X_train, y_train)
print("Training accuracy:{:.2f}".format(clf.score(X_train,y_train)))
print("Test accuracy: {:.2f}".format(clf.score(X_test,y_test)))
By reducing the maximum depth of our decision tree to three, we were able to decrease overfitting and increase our test accuracy slightly. If we examine our tree diagram now, we will be seeing a much simpler tree…
By setting the max_depth equal to 3, we reduced the complexity of the Decisions tree by pruning it. This pruned model is less complex and a little easier to understand in comparison to the previous model where the tree kept splitting until all leaves are pure (gini impurtiy = 0).
import numpy as np
np.random.seed(5)
X = np.sort(5 * np.random.rand(40, 1), axis=0)
y = np.sin(X).ravel()
# Add noise to targets
y[::5] += 1 * (0.5 - np.random.rand(8))
from sklearn.tree import DecisionTreeRegressor
dt_reg = DecisionTreeRegressor(criterion="squared_error", random_state=0)
dt_reg.fit(X, y)
fig, axes = plt.subplots(nrows = 1, ncols = 1, figsize=(16,8), dpi=100)
tree.plot_tree(dt_reg, feature_names='X', filled=True, fontsize=5);
#generate a random Test data
T = np.linspace(0, 5, 100)[:, np.newaxis]
#creating two regression trees with different depths
dt_reg_1 = DecisionTreeRegressor(max_depth = 10, random_state=0)
dt_reg_2 = DecisionTreeRegressor(max_depth = 3, random_state=0)
#training the models
dt_reg_1.fit(X, y)
dt_reg_2.fit(X, y)
#making predictions for the random test data we generated above
y_pred_1 = dt_reg_1.predict(T)
y_pred_2 = dt_reg_2.predict(T)
#comparison plot to see the effect of tree depth
plt.figure()
plt.scatter(X, y, s=40, c="orange", label="actual")
plt.plot(T, y_pred_1, color="b", label="max_depth=10", linewidth=2)
plt.plot(T, y_pred_2, color="g", label="max_depth=3", linewidth=2)
plt.xlabel("X")
plt.ylabel("y")
plt.title("Decision Tree Regression")
plt.legend()
plt.show()
Looking at the figure, we can see how the decision tree regressor model with max depth set to ten, overfits the data capturing all the noise in the data. Moreover, the tree with max depth set to three is much better in generalizing and creating a better fit to the data without capturing all the noise.
Advantages
Disadvantages