What is a Decision Tree?

A decision tree is a machine learning model which resembles a tree-like graph (refer to Pradyun’s post: https://thecodestories.com/2020/03/29/graphs-an-overview/ ). It is used either to classify a set of features into a target class or to predict a target value based on several input values. The latter is called regression, and this post will only contain an implementation of a decision tree regressor and not a classifier. A decision tree is formed by a root, nodes, branches and leaves. The root is the first node that has no branches coming into it – it is the first node of the tree. The root divides into 2 nodes connected by branches. Each node can then successively divide into 2 more nodes. A leaf is a node that contains a prediction – it has no branches coming out of it. The nodes in the tree contain certain conditions, and based on whether those conditions are fulfilled or not, the algorithm moves towards a leaf, or prediction. Below is an example of a decision tree with 2 layers:

A sample decision tree with a depth of 2. This is the decision tree obtained upon fitting a model on the Boston Housing dataset. The root and nodes contain conditions on certain parameters. The leaves contain the values that are predicted.

The Boston Housing Dataset

The Boston Housing dataset is a built-in dataset in sklearn, meant for regression. It contains 506 observations of houses in Boston across 13 training features such as crime rate, tax, rooms etc and one target feature, median value of house in $1000.

You can read more about the Boston housing dataset here: https://www.kaggle.com/c/boston-housing

Exploring the Boston Dataset using Pandas

Pandas is a dataframe exploration and manipulation module in python. It can be used to convert dictionary like objects to pandas dataframe objects. To convert the Boston Dataset into a pandas dataframe, run the following code:

import pandas as pd
from sklearn.datasets import load_boston

boston = load_boston()
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = pd.DataFrame(boston.target, columns=["target"])

The dataframe X has all the features of the houses and y has the target values. Try using commands like X.shape, X.describe and X.columns to get to know your data.

Training and Testing a Decision Tree Regressor Using scikit-learn

Decision trees are available to us in the module sklearn.tree, from which we can import DecisionTreeClassifier. We then split the features and targets into training and testing data using train_test_split() from sklearn.model_selection. We create and then fit the model just like we did for the k-NN classifier.

import pandas as pd
from sklearn.datasets import load_boston
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split

boston = load_boston()
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = pd.DataFrame(boston.target, columns=["target"])
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

dtr_model = DecisionTreeRegressor(random_state=0, depth = 2)
dtr_model.fit(X_train, y_train)

Now, we need to measure our model’s accuracy. One way to do this is using the mean_absolute_error from sklearn.metrics, which calculates the average absolute value of deviations of the predicted target values from the actual target values.

from sklearn.metrics import mean_absolute_error

y_preds = dtr_model.predict(X_test)
print(mean_absolute_error(y_test, y_preds), dtr_model.get_depth())
#get_depth returns the depth of the tree
The output obtained upon running the above code. The mean error is about 3.9, or $3900 (since the target is in thousands of dollars), when the depth of the tree is 2.

Also try changing the max_depth parameter to see how the mean absolute error changes with the changing tree depth.

Optimizing the Model

Clearly, the mean absolute error is dependent on the depth of the tree. A low tree depth means that the model cannot capture trends in the data and is said to be underfit. If the depth is very high, the model reads from outlying exceptions in the data and is said to be overfit. So, how do we choose the appropriate max depth value? Hello, data visualisation. Add the following code to view the mean absolute error for different values of max_depth as a graph:

import matplotlib.pyplot as plot

maeList = []
depthList = []

for depths in range(1, 19):
    dtr_model = DecisionTreeRegressor(random_state=0, max_depth=depths)
    dtr_model.fit(X_train, y_train)
    preds = dtr_model.predict(X_test)
    this_mae = mean_absolute_error(y_test, preds)
    maeList.append(this_mae)
    depthList.append(depths)

plot.figure()
plot.plot(depthList, maeList)
plot.title("Mean Absolute Error for different max depth values")
plot.xlabel("Depth")
plot.ylabel("MAE")
plot.show()
The output obtained upon running the code. From this, we can infer that our first model was underfit, and models with depths > 9 appear to be overfit. 5 and 6 seem to be the ideal depths for the tree.

Visualising the Predictions Made by the Model

It might help us to know how far the models predictions are from the actual values. We shall set the max_depth to 6, now that we know it is the optimal depth for the tree. Run the following code:

import pandas as pd
from sklearn.datasets import load_boston
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
import matplotlib.pyplot as plot

boston = load_boston()
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = pd.DataFrame(boston.target, columns=["target"])
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

dtr_model = DecisionTreeRegressor(random_state=0, max_depth=6)
dtr_model.fit(X_train, y_train)
y_preds = dtr_model.predict(X_test)
print(mean_absolute_error(y_test, y_preds), dtr_model.get_depth())

plot.figure()
plot.plot(y_test, y_test)
plot.scatter(y_test, y_preds, s=10, c="red")
plot.title("Actual vs Predicted values")
plot.xlabel("Actual values")
plot.ylabel("Predicted values")
plot.show()
The output obtained upon running the above code. The blue line represents the actual values of the testing targets and the red dots are the model’s predicted values.

Visualising the Decision Tree

NOTE: You will have to install the modules graphviz and pydotplus to run the following code. GraphViz is a software used to visualise graph data structures of all types. Make sure graphviz is included in the PATH system variable.

It is possible for us to visualise the tree with all its nodes and leaves. We do so using the export_graphviz() method from sklearn.tree which encodes the data in a dot format. Then, we convert the dot data into a graph format using pydotplus. We then write the graph data to a .png file, which we can find and view in a file explorer. The code for doing so is as follows:

import pandas as pd
from sklearn.datasets import load_boston
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.tree import export_graphviz
import pydotplus

boston = load_boston()
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = pd.DataFrame(boston.target, columns=["target"])
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

dtr_model = DecisionTreeRegressor(random_state=0, max_depth=6)
dtr_model.fit(X_train, y_train)
y_preds = dtr_model.predict(X_test)
print(mean_absolute_error(y_test, y_preds), dtr_model.get_depth())

dot_data = export_graphviz(dtr_model, feature_names=boston.feature_names, out_file=None,
                           filled=True,
                           rounded=True)
graph = pydotplus.graph_from_dot_data(dot_data)
graph.write_png("boston.png")

To find and view the tree, search for the filename you have saved the tree under. A link to the image obtained is given below. You might have to zoom in to the image to read anything meaningful.

The Complete Code

import pandas as pd
from sklearn.datasets import load_boston
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
import matplotlib.pyplot as plot
from sklearn.tree import export_graphviz
import pydotplus

maeList = []
depthList = []

boston = load_boston()
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = pd.DataFrame(boston.target, columns=["target"])
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

dtr_model = DecisionTreeRegressor(random_state=0, max_depth=6)
dtr_model.fit(X_train, y_train)
y_preds = dtr_model.predict(X_test)
print(mean_absolute_error(y_test, y_preds), dtr_model.get_depth())

dot_data = export_graphviz(dtr_model, feature_names=boston.feature_names, out_file=None,
                           filled=True,
                           rounded=True)
graph = pydotplus.graph_from_dot_data(dot_data)
graph.write_png("boston.png")

for depths in range(1, 19):
    dtr_model = DecisionTreeRegressor(random_state=0, max_depth=depths)
    dtr_model.fit(X_train, y_train)
    preds = dtr_model.predict(X_test)
    this_mae = mean_absolute_error(y_test, preds)
    maeList.append(this_mae)
    depthList.append(depths)

plot.figure(1)
plot.plot(y_test, y_test)
plot.scatter(y_test, y_preds, s=10, c="red")
plot.title("Actual vs Predicted values")
plot.xlabel("Actual values")
plot.ylabel("Predicted values")

plot.figure(2)
plot.plot(depthList, maeList)
plot.title("Mean Absolute Error for different max depth values")
plot.xlabel("Depth")
plot.ylabel("MAE")

plot.show()

The github links for the complete program and the tree image are:

https://github.com/adityapentyala/Python/blob/master/decisiontreeregressor_Boston.py

https://github.com/adityapentyala/Python/blob/master/boston.png


To learn about a Decision Tree Classifier, visit Saarang’s post:

Visits: 273

Leave a Reply

Your email address will not be published. Required fields are marked *