Understanding Decision Trees in Machine Learning

Photo by Yugal Srivastava on Pexels.com

The math behind decision trees and how to implement them using Python and sklearn


Decision Trees

The decision tree is a type of supervised machine learning that is mostly used in classification problems.

The decision tree is basically greedy, top-down, recursive partitioning. “Greedy” because at each step we pick the best split possible. “Top-down” because we start with the root node, which contains all the records, and then will do the partitioning.

diagram showing the relationships of root node, decision node, and leaf/terminals nodes
Image by Author

Root Node →The topmost node in the decision tree is known as the root node.
Decision Node→ The subnode which splits into further subnodes is known as the decision node.
Leaf/Terminal Node →A node that does not split is known as a leaf node/terminal node.


Data Set

three-column table of sample data for the variables Age, BMI, and Diabetes

Image by Author

I have taken a small data set that contains the features BMI and Age and the target variable Diabetes.

Let’s predict if a person of a given age and BMI will have diabetes or not.


Data Set Representation

Chart showing the prediction of diabetes plotted for the sample data, with BMI as the x-axis, Age as the y-axis

Image by Author

We can’t draw a single line to get a decision boundary. We are splitting the data again and again to get the decision boundary. This is how the decision tree algorithm works.

Diabetes prediction chart with BMI as the x-axis, Age as the y-axis, showing boundary lines

Image by Author

This is how partitioning happens in the decision tree.


Important Terms in Decision Tree Theory

Entropy

Entropy is a measure of randomness or uncertainty. Entropy level ranges between o and 1. If entropy is 0, it means this is a pure subset (no randomness). If entropy is 1, it means high randomness. Entropy is denoted by H(S).

Formula

Entropy = -(p(0) * log(P(0)) + p(1) * log(P(1)))

P(0) → Probability of class 0

P(1) → Probability of class 1


Relationship Between Entropy and Probability

Image by Author

If entropy is 0, it means this is a pure subset (no randomness) (either all yes or all no). If entropy is 1, it means high randomness

Let’s plot a graph P(1)-Probability of Class 1 vs. Entropy.

From the above explanation, we know that
If P(1) is 0, Entropy =0
If P(1) is 1, Entropy =0
If P(1) is 0.5,Entropy =1

diagram with probability of class1 on x-axis and entropy on y-axis, showing how entropy varies between 0 and 1

Image by Author

The entropy level always ranges between 0 and 1.


Information Gain

Information gain is calculated for a split by subtracting the weighted entropies of each branch from the original entropy. We will use it to decide the ordering of attributes in the nodes of a decision tree.

Image by Author

H(S) → Entropy
A →Attribute
S →Set of examples {x}
V →Possible values of A
Sv →Subset


How Does a Decision Tree Work?

In our data set, we have two attributes, BMI and Age. Our sample data set has seven records.

Let’s start building a decision tree with this data set.

Step 1. Root node

In the decision tree, we start with the root node. Let’s take all the records (seven in our given data set) as our training samples.

three-column table of sample data for the variables Age, BMI, and Diabetes

Image by Author

It has three yes and four no.
The probability of class 0 is 4/7. Four out of seven records belong to class 0
P(0)=4/7
The probability of class 1 is 3/7. Three out of seven records belong to class 1.
P(1)=3/7

Calculate the Entropy of the root node

Image by Author

Step 2. How does splitting occur?

We have two attributes BMI and Age. Based on these attributes, how does splitting occur? How do we check the effectiveness of the split?

  1. If we select attribute BMI as the splitting variable and ≤30 as the splitting point, we get one pure subset.

[Splitting point is considered at each data point in the dataset. So if the data points are unique, there will be n-1 split points for n data points. So depending on which splitting variable and splitting point, we get high information gain, that split is selected. If it is a large dataset, it is common to consider only split points at certain percentiles like (10%,20%,30%)of the distribution of values Since it’s a small dataset, by seeing the data points, I have selected ≤30 as the split point.]

diagram showing how sample data is split if BMI is less than or equal to 30

Image by Author

The entropy of pure subset=0.

Let’s calculate the entropy of the other subset. Here we get three yes and one no.
P(0)=1/4 [one out of four records)
P(1)=3/4 [three out of four records)

Image by Author

We have to calculate the information gain to decide which attribute to chose for splitting.

Image by Author

2. Let’s select the attribute Age as the splitting variable and ≤45 as the splitting point.

diagram showing how sample data is split if Age is less than or equal to 45

Image by Author

First, let’s calculate the entropy of the True subset. It has one yes and one no. It means a high level of uncertainty. The entropy is 1.

Let’s calculate the entropy of the False subset. It has two yes and three no.

Image By Author

Let’s calculate the information gain.

Image by Author

We have to choose the attribute which has high information gain. In our example, only the BMI attribute has high information gain. So the BMI attribute is chosen as the splitting variable.

After splitting by the attribute BMI, we get one pure subset (leaf node) and one impure subset. Let’s split that impure subset again based on the attribute Age. Then we have two pure subsets (leaf node).

diagram showing how sample data is split if BMI is less than or equal to 30 and Age is less than or equal to 43

Image by Author

Now we have created a decision tree with pure subsets.


Python Implementation of a Decision Tree Using sklearn

  1. Import the libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

2. Load the data.

df=pd.read_csv("Diabetes1.csv")
df.head()

3. Split x and y variables.

The BMI and Age attributes are taken as the x variable.
The Diabetes attribute (target variable) is taken as the y variable.

x=df.iloc[:,:2]
y=df.iloc[:,2:]
x.head(3)
y.head(3)

4. Model building with sklearn

from sklearn import tree
model=tree.DecisionTreeClassifier(criterion="entropy")
model.fit(x,y)

Output: DecisionTreeClassifier (criterion=“entropy”)

5. Model score

model.score(x,y)

Output: 1.0

(Since we took a very small data set, the score is 1.)

6. Model prediction

Let’s predict whether a person of Age 47, BMI 29 will have diabetes or not. The same data is there in the data set.

model.predict([[29,47]])

Output: array([‘no’], dtype=object)

The prediction is no, which is the same as in the data set.

Let’s predict whether a person of Age 47, BMI 45 will have diabetes or not. This data is not in the data set.

model.predict([[45,47]])

Output: array([‘yes’], dtype=object)

Predicted as yes.

7. Visualize the model

tree.plot_tree(model)

GitHub Link

The code and data set used in this article are available in my GitHub link.

My other blogs on Machine learning

Linear Regression in Python

Logistic Regression in Python

Naive Bayes Classifier in Machine Learning

An Introduction to Support Vector Machine

An Introduction to K-Nearest Neighbors Algorithm


I hope that you have found this article helpful. Thanks for reading!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s