How to calculate information gain in decision trees

A Simple Explanation of Information Gain and Entropy

Information Gain, like Gini Impurity, is a metric used to train Decision Trees. Specifically, these metrics measure the quality of a split. For example, say we have the following data:

The Dataset

What if we made a split at x=1.5x = 1.5x=1.5?

An Imperfect Split

This imperfect split breaks our dataset into these branches:

  • Left branch, with 4 blues.
  • Right branch, with 1 blue and 5 greens.

It’s clear this split isn’t optimal, but how good is it? How can we quantify the quality of a split?

That’s where Information Gain comes in.

Confused? Not sure what Decision Trees are or how they’re trained? Read the beginning of my introduction to Random Forests and Decision Trees.

Note: this post looks better in Light Mode. If you’re in Dark Mode, scroll up and use the toggle in the top right to switch!

Before we get to Information Gain, we have to first talk about Information Entropy. C p_i \log_2 p_iE=−i∑C​pi​log2​pi​

where pip_ipi​ is the probability of randomly picking an element of class iii (i.e. the proportion of the dataset made up of class iii).

The easiest way to understand this is with an example. Consider a dataset with 1 blue, 2 greens, and 3 reds: . Then

E=−(pblog⁡2pb+pglog⁡2pg+prlog⁡2pr)E = -(p_b \log_2 p_b + p_g \log_2 p_g + p_r \log_2 p_r)E=−(pb​log2​pb​+pg​log2​pg​+pr​log2​pr​)

We know pb=16p_b = \frac{1}{6}pb​=61​ because 16\frac{1}{6}61​ of the dataset is blue. Similarly, pg=26p_g = \frac{2}{6}pg​=62​ (greens) and pr=36p_r = \frac{3}{6}pr​=63​ (reds). Thus,

E=−(16log⁡2(16)+26log⁡2(26)+36log⁡2(36))=1.46\begin{aligned} E &= -(\frac{1}{6} \log_2(\frac{1}{6}) + \frac{2}{6} \log_2(\frac{2}{6}) + \frac{3}{6} \log_2(\frac{3}{6})) \\ &= \boxed{1.46} \\ \end{aligned}E​=−(61​log2​(61​)+62​log2​(62​)+63​log2​(63​))=1.46​​

What about a dataset of all one color? Consider 3 blues as an example: . The entropy would be

E=−(1log⁡21)=0E = -(1 \log_2 1) = \boxed{0}E=−(1log2​1)=0​

It’s finally time to answer the question we posed earlier: how can we quantify the quality of a split?

Let’s consider this split again:

An Imperfect Split

Before the split, we had 5 blues and 5 greens, so the entropy was

Ebefore=−(0.5log⁡20.5+0.5log⁡20.5)=1\begin{aligned} E_{before} &= -(0.5 \log_2 0.5 + 0.5 \log_2 0.5) \\ &= \boxed{1} \\ \end{aligned}Ebefore​​=−(0.5log2​0.5+0.5log2​0.5)=1​​

After the split, we have two branches.

Left Branch has 4 blues, so Eleft=0E_{left} = \boxed{0}Eleft​=0​ because it’s a dataset of all one color.

Right Branch has 1 blue and 5 greens, so

Eright=−(16log⁡2(16)+56log⁡2(56))=0.65\begin{aligned} E_{right} &= -(\frac{1}{6} \log_2 (\frac{1}{6}) + \frac{5}{6} \log_2 (\frac{5}{6})) \\ &= \boxed{0. 65} \\ \end{aligned}Eright​​=−(61​log2​(61​)+65​log2​(65​))=0.65​​

Now that we have the entropies for both branches, we can determine the quality of the split by weighting the entropy of each branch by how many elements it has. Since Left Branch has 4 elements and Right Branch has 6, we weight them by and, respectively:

Esplit=0.4∗0+0.6∗0.65=0.39\begin{aligned} E_{split} &= 0.4 * 0 + 0.6 * 0.65 \\ &= \boxed{0.39} \\ \end{aligned}Esplit​​=0.4∗0+0.6∗0.65=0.39​​

We started with Ebefore=1E_{before} = 1Ebefore​=1 entropy before the split and now are down to 0.390.390.39! Information Gain = how much Entropy we removed, so

Gain=1−0.39=0.61\text{Gain} = 1 - 0.39 = \boxed{0.61}Gain=1−0.39=0.61​

This makes sense: higher Information Gain = more Entropy removed, which is what we want. In the perfect case, each branch would contain only one color after the split, which would be zero entropy!


Information Entropy can be thought of as how unpredictable a dataset is. C p_i \log_2 p_iE=−i∑C​pi​log2​pi​

Information Gain is calculated for a split by subtracting the weighted entropies of each branch from the original entropy. When training a Decision Tree using these metrics, the best split is chosen by maximizing Information Gain.

Want to learn more? Check out my explanation of Gini Impurity, a similar metric, or my in-depth guide Random Forests for Complete Beginners.

Entropy and Information Gain in Decision Trees | by Jeremiah Lutes

Photo by AbsolutVision on Unsplash

A simple look at some key Information Theory concepts and how to use them when building a Decision Tree Algorithm.

What criteria should a decision tree algorithm use to split variables/columns?

Before building a decision tree algorithm the first step is to answer this question. Let’s take a look at one of the ways to answer this question. To do so we will need to understand a use a few key concepts from information theory.

Let’s examine this method by taking the following steps:

  1. Take a very brief look at what a Decision Tree is.
  2. Define and examine the formula for Entropy.
  3. Discuss what a Bit is in information theory.
  4. Define Information Gain and use entropy to calculate it.
  5. Write some basic Python functions using the above concepts.

In data science, the decision tree algorithm is a supervised learning algorithm for classification or regression problems. Our end goal is to use historical data to predict an outcome. Unlike linear regression, decision trees can pick up nonlinear interactions between variables in the data.

Let’s look at a very simple decision tree. Below is a workflow that can be used to make a decision on whether or not to eat a peanut butter cookie.

A decision tree example on whether or not to eat a cookie

In this example, a decision tree can pick up on the fact that you should only eat the cookie if certain criteria are met. This is the ultimate goal of a decision tree. We want to keep making decisions(splits) until certain criteria are met. Once met we can use it to classify or make a prediction. This example is very basic using only two variables ( allergy, ruining dinner). But, if you have a dataset with thousands of variables/columns how do you decide which variables/columns are the most efficient to split on? A popular way to solve this problem, especially if using an ID3 algorithm, is to use entropy and information gain.

Let’s say we have some data and we want to use it to make an online quiz that predicts something about the quiz taker. After looking at the relationships in the data we have decided to use a decision tree algorithm. If you have never been sucked into an online quiz, you can see hundreds of examples here. The goal of the quiz will be to guess if the quiz taker is from one of America’s midwest states. The questions in the quiz will revolve around if they like a certain type of food or not. Below has a small fictional dataset with fifteen entries. Each entry has answers to a series of questions. Most questions are about if they liked a certain type of food, in which the participant answered (1) for yes or (0) for now. The last column(“midwest?”) is our target column, meaning that once the decision tree is built, this is the classification we are trying to guess.

To get us started we will use an information theory metric called entropy. In data science, entropy is used as a way to measure how “mixed” a column is. Specifically, entropy is used to measure disorder. Let’s start by finding the entropy of our target column, “midwest?”.

Our target column, “midwest?”

There are ten people who live in the midwest and five people who don’t. If someone was going to ask you how mixed the column is, you could say it was sort of mixed, with a majority(2/3) of the people from the midwest. Entropy gives us a way to quantify the answer” sort of mixed”. The more mixed the (1)s and (0)s in the column are, the higher the entropy. If “midwest?” had equal amounts of (1)s and (0)s our entropy would be 1. If “midwest?” consisted only of (1)s the entropy would be 0.

We can use the following formula to calculate entropy:

the formula for entropy

Let’s go through each step of the formula and calculate the entropy for the “midwest?” column.

  1. We need to iterate through each unique value in a single column and assign it to i. For this example, we have 2 cases(c) in the “midwest?” column, either (0) or (1).
  2. We then compute the probability of that value occurring in the data. For the case of (1), the probability is 10/15. For the case of (0), the probability is 5/15.
  3. We take the probability of each case and multiply it by the logarithm base 2 of the probability. 2 is the most common base because entropy is measured in bits(more on that later). The full explanation of why 2 is used is out of the scope of this post, but a user on stack exchange offers a good explanation. For the case of(1), we get 10/15*log2(10/15). For the case of (0), we get 5/15*log2(5/15).
  4. Next, we take our product from each case above and sum it together. For this example, 10/15*log2(10/15) + 5/15*log2(5/15).
  5. Finally, we negate the total sum from above, — (10/15*log2(10/15) + 5/15*log2(5/15)).

Once we put the steps all together we get the below:

Our final entropy is .918278. So, what does that really mean?

Moving forward it will be important to understand the concept of bit. In information theory, a bit is thought of as a binary number representing 0 for no information and 1 for a full bit of information. We can represent a bit of information as a binary number because it either has the value (1) or (0). Suppose there’s an equal probability of it raining tomorrow (1) or not raining(0). If I tell you that it will rain tomorrow, I’ve given you one bit of information.

We can also think of entropy as information. Suppose we have a loaded six-sided die which always lands on (3). Each time we roll the die, we know upfront that the result will be (3). We gain no new information by rolling the die, so entropy is 0. On the other hand, if the die is far and we roll a (3) there was a 1/6 chance in rolling the (3). Now we have gained information. Thus, rolling the die gives us one bit of information — which side the number landed on.

For a deeper dive into the concept of a bit of information, you can read more here.

We get less than one “bit” of information — only .918278 — because there are more (1)s in the “midwest?” column than (0)s. This means that if we were predicting a new value, we could guess that the answer is (1) and be right more often than wrong (because there’s a 2/3 probability of the answer being 1). Due to this prior knowledge, we gain less than a full “bit” of information when we observe a new value.

Our goal is to find the best variable(s)/column(s) to split on when building a decision tree. Eventually, we want to keep splitting the variables/columns until our mixed target column is no longer mixed.

For instance, let’s look at the entropy of the “midwest?” column after we have split our dataset on the “potato_salad?” column.

split on the “potato_salad?” column

Above, our dataset is split in two sections. On the left side, everyone who likes potato salad. On the right side everyone who doesn’t. We fill focus on the left side which now has seven people from the midwest and two people who aren’t. By using the formula for entropy on the left split midwest column the new entropy is .764204. This is great! Our goal is to lower the entropy and we went from .918278 to .764204. But, we can’t stop there, if we look at the right column our entropy went up as there are an equal amount of (1)s and (0)s. What we need is a way to see how the entropy changes on both sides of the split. The formula for information gain will do that. It gives us a number to quantify how many bits of information we have gained each time we split our data.

Earlier we established we want splits that lower the entropy of our target column. When we split on “potato_salad?” we saw that entropy in the “midwest?” went down on the left side. Now we need to understand the total entropy lowered when we look at both sides of the split. Let’s take a look at information gain.

Information gain will use the following formula:

Let’s breakdown what is going here.

We’ll go back to our “potato_salad?” example. The variables in the above formula will represent the following:

  • T = Target, our “midwest?” column
  • A = the variable(column) we are testing, “potato_salad?”
  • v = each value in A, each value in the “potato_salad?” column
  1. First, we’ll calculate the orginal entropy for (T) before the split , .918278
  2. Then, for each unique value (v) in variable (A), we compute the number of rows in which (A) takes on the value (v), and divide it by the total number of rows. For the “potato_salad?” column we get 9/15 for the unique value of (1) and 6/15 for the unique value of (0).
  3. Next, we multiply the results by the entropy of the rows where (A) is (v). For the left split( split on 1 for “potato_salad?”) we get 9/15 * .764204. For the right side of the split ( split on 0 for “potato_salad?”) we get 6/15 * 1.
  4. We add all of these subset products together, 9/14*.764204 + 6/15 = .8585224.

5. We then subtract from the overall entropy to get information gain, .918278 -.8585224 = .059754

Our information gain is .059754. What does that tell us?

Here’s an alternate explanation. We’re finding the entropy of each set post-split, weighting it by the number of items in each split, then subtracting from the current entropy. If the result is positive, we’ve lowered entropy with our split. The higher the result is, the more we’ve lowered entropy.

We end up with .059754, which means that we gain .059754 bits of information by splitting our data set on the “potato_salad?” variable/column. Our information gain is low, but it’s still positive which is because we lowered the entropy on the left side of the split.

Now we need to repeat this process for every column we are using. Instead of doing this by hand let’s write some Python code.

Now that we understand information gain, we need a way to repeat this process to find the variable/column with the largest information gain. To do this, we can create a few simple functions in Python.

Importing the Data

Let’s turn our above table into a DataFrame using the Python pandas library. We will import pandas and use the read_csv() function to make a DataFrame named “midwest”.

import pandas as pd
midwest = pd.read_csv('midwes.csv')

A Python Function for Entropy

For this function, we will need the NumPy library to use the bincount() function and the math module to use the log() function.

import numpy
import math

Next, we will define our function with one parameter. The argument given will be the series, list, or NumPy array in which we are trying to calculate the entropy.

def calc_entropy(column):

We will need to find the percentage of each case in the column. We can use the numpy.bincount() function for this. The return value is a NumPy array which will store the count of each unique value from the column that was passed as an argument.

counts = numpy.bincount(column)

We’ll store the probabilities of each unique value by dividing the “counts” array by the length of the column.

probabilities = counts / len(column)

We can then initialize a variable named “entropy” and set it to 0.

entropy = 0

Next, we can use a “for loop” to loop through each probability in our probabilities array and multiply it by the logarithm base 2 of probability using the math.log() function. Then, add each case to our stored entropy variable. *be sure to check your probability is great than 0 otherwise log(0) will return undefined

for prob in probabilities:
if prob > 0:
endtropy += prob * math. log(prob,2)

Finally, we’ll return our negated entropy variable.

return -entropy

All together now:

Great! Now we can build a function to calculate information gain.

A Python Function for Information Gain

We’ll need to define a function that will have three parameters, one for the entire dataset, one for the name of the column we want to split on, and one for the name of our target column.

def calc_information_gain(data, split_name, target_name):

Next, we can use the entropy function from earlier to calculate the original entropy of our target column.

orginal_entropy = calc_entropy(data[target_name])

Now we need to split our column.

*For this example we will only use the variables/columns with two unique. If you want to split on a variable/column such as “age”, there are several ways to do this. One way is to split on every unique value. Another way is to simplify the calculation of information gain and make splits simpler by not splitting for each unique value. Instead, the median is found for the variable/coumn being split on. Any rows where the value of the variable is below the median will go to the left branch, and the rest of the rows will go to the right branch. To compute information gain, we’ll only have to compute entropies for two subsets. We won’t be walking through this method but once the split on the median is performed the rest of steps would be the same as outlined below.

Since the columns we are working with only have two unique values we will make a left split and a right split.

We’ll start by using the pandas.Series.unique() to give us an array of the unique values in the column

values = data[split_name].unique()

Next, we will create a left and right split using “values”.

left_split = data[data[split_name] == values[0]]
right_split = data[data[split_name] == values[1]]

Now we can initiate a variable to subtract from our original entropy.

to_subtract = 0

Then we’ll iterate through each subset created by our split, calculate the probability of the subset, and then add the product of the probability and the subsets target column’s entropy.

for subset in [left_split, right_split]:
prob = (subset.shape[0] / data.shape[0])
to_subtract += prob * calc_entropy(subset[target_name])

Finally, we can return the difference of to_subract being subtracted from the original entropy.

return original_entropy - to_subtract

The entire function is below.

A Python Function for the Highest Information Gain

Our final function will be one that will return the variable/column name with the highest information gain.

As mentioned earlier we are only using the columns with two unique values for this example. We’ll store those column names in a list to use in the function. To get to the point we’ll hard code this for this example but in a large dataset, it would be best to write code to build this list dynamically based on the criteria we use to choose the columns.

columns = ['apple_pie?', 'potato_salad?', 'sushi?']

Let’s wrap the final step in a function so we can reuse it as needed. It will have one parameter, the list of columns we want to find the highest information gain for.

def highest_info_gain(columns):

We’ll intialize an empty dictionary to store our information gains.

information_gains = {}

And then we can iterate through the list of columns and store the result in our information_gains dictionary.

for col in columns:
information_gain = calc_information_gain(midwest, col, 'midwest?)
information_gains[col] = information_gain

Finally, we can return the key of the highest value in our dictionary.

return max(information_gains, key=information_gains.get)

All together now:

Once we execute our final function

print(highest_info_gain(midwest, columns, 'midwest?'))

we see the variable/column with the highest information gain is ‘sushi?’.

We can visualize a split on sushi below:

Splitting our dataset on the sushi column

Our left split has two people out of six from the midwest. The right split has eight out of the nine people from the midwest. This was an efficient split and lowered our entropy on both sides. If we were to continue we would use recursion to keep splitting each split with a goal to end each branch with an entropy of zero.

Decision trees can be a useful machine learning algorithm to pick up nonlinear interactions between variables in the data. In this example, we looked at the beginning stages of a decision tree classification algorithm. We then looked at three information theory concepts, entropy, bit, and information gain. By using these concepts we were able to build a few functions in Python to decide which variables/columns were the most efficient to split on. With a firm grasp on these concepts, we can move forward to build a decision tree.

The right decision tree for the right choice

We show you how to use the right decision tree for machine learning, data visualization and visual demonstration of the decision-making process. It will be useful not only for data analysts, but also for those who want to find a methodology that helps to make more balanced decisions in life and business.

This article is based on the translation: How to Create a Perfect Decision Tree by Upasana Priyadarshiny.

The main tasks that a decision tree solves in machine learning, data analysis and statistics:

  • Classification - when you need to attribute an object to a specific category, taking into account its features.
  • Regression - using data to predict a quantitative trait, taking into account the influence of other traits.

Decision trees can also help visualize the decision-making process and make the right choice in situations where the results of one decision affect the results of subsequent decisions. Let's try to explain how it works with simple real-life examples.

What is a decision tree

Visually, a decision tree can be represented as a map of possible outcomes from a series of interrelated choices. This helps to compare possible actions based on their cost (cost), likelihood and benefit.

As the name suggests, this uses a tree decision model. Such tree diagrams can be useful both in the process of discussing something, and for compiling an algorithm that mathematically determines the best choice.

Typically, a decision tree starts with a single node that branches off into possible outcomes. Each of them continues the scheme and creates additional nodes that continue to develop along the same lines. This gives the model a tree shape.

Decision tree: to play or not to play. The weather forecast is taken into account

There can be three different types of nodes in the decision tree:

  • Decision nodes - decision nodes, they show the decision to be made.
  • Chance nodes - probability nodes, show the probability of certain outcomes.
  • End nodes - closing nodes, show the final result of the solution path.

Advantages and Disadvantages of the Decision Tree Method

Advantages of the Method :

  1. Decision trees are created according to clear rules, they are easy to apply and interpret.
  2. Both continuous and qualitative (discrete) variables can be processed.
  3. You can work with data gaps, decision trees allow you to fill in an empty field with the most likely value.
  4. Helps you determine which fields are more important for prediction or classification.

Disadvantages of the method:

  1. There is a possibility of errors in classification problems with a large number of classes and a relatively small number of training examples.
  2. Process instability: a change in one node can lead to the construction of a completely different tree, due to the hierarchical nature of its structure.
  3. The process of "growing" a decision tree can be quite computationally expensive, because at each node each attribute must be expanded until the best solution or branch is found. Some algorithms use combinations of fields, in which case you have to look for the optimal combination by the "weight" of the coefficients. The pruning (or "pruning") algorithm is also expensive, since a large number of potential branches must be formed and compared.

How to create a decision tree

As an example, let's take a scenario where a group of astronomers is studying a planet to see if it can become a new Earth.

There are N decisive factors that need to be carefully considered in order to make a reasonable decision. These factors can be the presence of water on the planet, the temperature range, whether the surface is subject to constant storms, whether flora and fauna can survive in this climate, and hundreds of other parameters.

We will explore the problem through the decision tree.

  1. Is the habitable temperature between 0 and 100 degrees Celsius?
  1. Is there water there?
  1. Flora and fauna flourish?
  1. Is the surface of the planet prone to storms?

Thus, we have a complete decision tree.

Classification rules

First, let's define the terms and their meanings.

Classification rules are cases in which all scenarios are considered and a class variable is assigned to each.

The class variable is the final result that our solution leads to. A class variable is assigned to each leaf, or leaf, node.

Here are the classification rules from the example decision tree about exploring a new planet:

  1. If the temperature is not between -0.15°C and 99.85°C, then survival is difficult.
  2. If the temperature is between -0.15°C and 99.85°C, but there is no water, then survival is difficult.
  3. If the temperature is between -0.15°C and 99.85°C, there is water but no flora and fauna, then survival is difficult.
  4. If the temperature is between -0.15 °C and 99.85 °C, there is water, there is flora and fauna, the surface is not prone to storms, then survival is possible.
  5. If the temperature is between -0. 15°C and 99.85°C, there is water, there is flora and fauna, but the surface is prone to storms, then survival is difficult.

Why is it difficult to build an ideal decision tree?

  • Internal Node, or internal node - nodes with one incoming and two or more outgoing connections.
  • Leaf Node is the final element without outgoing connections.
  • When the decision tree is created, we start at the root node, check the test conditions, and assign the control to one of the outgoing connections. Then we test the conditions again and assign the next node. To consider the tree complete, you need to bring all the conditions to the leaf node (Leaf Node). The leaf node will contain all the class labels that influence the pros and cons of the decision.

    Please note that we started with the “temperature” feature and considered it to be the root. If you choose a different feature, then the tree will turn out different. This is the principle of the method - you need to choose the optimal root and use it to build a tree, decide which tree is needed to complete the task.

    There are different ways to find the most appropriate decision tree for a particular situation. Below we will talk about one of them.

    How to use a greedy algorithm to build a decision tree

    The essence of the approach is the principle of the so-called greedy maximization of information gain. It is based on the concept of heuristic problem solving - making an optimal local choice at each node, thus reaching a solution that is most likely to be the most optimal.

    The algorithm can be simplified as follows:

    • On each node, select the best verification method.
    • Then split the node into all possible outcomes (internal nodes).
    • Repeat steps until all conditions result in end nodes.

    The main question: "How to choose the initial conditions for testing?" . The answer lies in the values ​​of entropy and information gain (information gain). We tell you what they are and how they affect the creation of our tree:

    1. Entropy - in a decision tree, this means uniformity. If the data is completely homogeneous, it is 0; otherwise, if the data is split (50-50%), the entropy is 1. The simplest way to explain this term is how much information we don't know that is relevant to making a decision.
    2. Information gain is the reciprocal of entropy, the higher the information gain, the lower the entropy, less unaccounted data and better solution.

    Total - we select the attribute that has the highest information gain to go to the next stage of splitting. This helps to choose the best solution at any node.

    See example. We have a large amount of such data:

    9 (0002 Looking for the optimal solution with a greedy algorithm

    First step : create two conditional classes:

    • "Yes, a person can buy a computer.
    • "No", not available.

    The second step is to calculate the probability value for each of them.

    1. For the result "Yes", "buys_computer=yes", the formula is:
    2. $$P(buys = yes) = {9 \over 14}$$

    3. For the result "No", "buys_computer=no", like this:
    4. $$P(buys = no) = {5 \over 14}$$

    Third step : Calculate the entropy value by putting the probability values ​​in the formula.

    $$Info(D) = I(9,5) = -{9 \over 14}log_{2}({9 \over 14})-{5 \over 14}log_{2}({5 \ over 14})=0.940$$

    This is a fairly high entropy.

    Fourth step : go deep and calculate the increment of information for each case to calculate the appropriate root attribute for the decision tree.

    An explanation is needed here. For example, what will the indicator of information growth mean if we take the “Age” attribute as a basis? This data shows how many people who fall into a certain age group buy and don't buy a product.

    For example, among people aged 30 and under, two people buy (Yes) and three people don't buy (No). Info(D) is calculated for the last three categories of people - accordingly, Info(D) will be calculated by the sum of these three ranges of age values.

    As a result, the difference between the total entropy Info(D) and the entropy for the age Info(D) age (let it be equal to 0.694) will be the desired value of the information gain. Here is the formula:


    Let's compare the information gain rates for all attributes:

    • Age = 0.246.
    • Income = 0.029.
    • Student = 0.151.
    • Credit score = 0.048.

    It turns out that the increase in information for the age attribute is the most significant, so it's worth using it. Similarly, we compare the gain in information at each split to find out whether to take this attribute or not.

    Thus, the optimal tree looks like this:

    The classification rules for this tree can be written as follows:

    • If a person is less than 30 years old and is not a student, then he will not buy a computer.
    • If a person's age is less than 30 years old and he is a student, he will buy a computer.
    • If a person is between 31 and 40 years old, they are more likely to buy a computer.
    • If a person is over 40 and has an excellent credit score, they will not buy a computer.
    • If a person is over 40 years old, with an average credit score, they are likely to buy a computer.

    Here we have reached the optimal decision tree. Ready.

    From Decision Tree to Random Forest

    Decision trees are a set of very popular supervised classification algorithms. They are very popular for several reasons: they are quite good at classification problems, the decision path is relatively easy to interpret, and the algorithm for building (training) them is fast and simple.

    Download VI program

    Solution Demos

    Table of contents

    Decision trees are a set of very popular supervised classification algorithms. They are very popular for several reasons: they are quite good at classification problems, the decision path is relatively easy to interpret, and the algorithm for building (training) them is fast and simple.

    There is also an ensemble version of the decision tree: random forest. A random forest is essentially a collection of N decision trees, which increases the reliability of the predictions.

    In this article, we provide a brief overview of the decision tree growth algorithm, its performance metrics, techniques to avoid overfitting the training set, and the enhancements introduced by random forest of decision trees.

    What is a decision tree?

    A decision tree is a flowchart-like structure consisting of nodes and branches (Figure 1). At each node, data splitting is performed based on one of the input features, generating two or more branches as output. More and more splits are performed in the upcoming nodes, and more branches are generated to split the original data. This continues until a node is generated in which all or almost all of the data belongs to the same class, and no further splits - or branches - are possible anymore.

    This whole process generates a tree structure. The first cleavage node is called the root node. The leaf nodes are called leaves and are associated with a class label. Paths from root to leaf create classification rules. If only binary partitions are possible, we are talking about binary trees. Here, however, we want to deal with a more general instance of non-binary decision trees.

    Let's use an example. We have collected data on a person's past swimming plans, i.e., whether the person has retired from swimming or not, based on various external conditions - or "input characteristics" - such as wind speed in knots, maximum temperature, forecast, and also on that, the boat was in winter storage. Input features are often also referred to as non-target attributes or explanatory variables. Now we want to build a decision tree that will predict the outcome of the swim (yes or no). The swim outcome function is also known as the target or dependent variable.

    If we knew the exact classification rules, we could build a decision tree by hand. But this is rarely the case. We usually have data: input features on one side and prediction targets on the other side. A number of automated procedures can help us extract rules from the data to build a decision tree such as C4.5, ID3, or the CART algorithm (J. Ross Quinlan). In all of them, the goal is to train a decision tree to define rules for predicting the target variable, which in our example is whether we go to a new day or not.

    Building a decision tree

    Let's see how to automatically build a decision tree by following one of the algorithms listed above.

    The goal of any of these algorithms is to subdivide the training set until each subset is "clean" in terms of the target class or small enough. To be clear:

    • A pure subset is a subset that contains only instances of one class.
    • Each split operation is implemented by a rule that splits the incoming data based on the values ​​of one of the input functions.

    In summary, a decision tree is made up of three different building blocks: nodes, branches, and leaves. The nodes identify the split function and implement the split operation on the input data subset; branches extend from the node and identify different subsets of the data obtained by the split operation; and the leaves at the end of the path identify the final data subset and associate the class with that particular decision path.

    For example, in the tree in Figure 1, the split at the first node turns on the "Storage" function and splits the input subset into two subsets: one where "Storage" is "yes" and one where "Storage" is "no" . If we follow the path to data rows where Storage = yes, we find the second splitting rule. Here, the input dataset is split into two datasets, one where Wind > 6 and another where Wind is

    The goal of any of these algorithms is to recursively divide the training set into subsets, as long as each section will not become as clean as possible in terms of the output class. Therefore, at each step, the algorithm uses the function that results in the cleanest output subsets.

    Quality measures

    At each iteration, in order to decide which feature leads to the cleanest subset, we must be able to measure the purity of the data set. Various metrics and indices have been proposed in the past. Here we describe some of them, perhaps the most commonly used: information gain, Gini index, and gain factor.

    During training, the chosen measure of quality is calculated for all candidate possibilities to determine which one will give the best separation.


    Entropy is a concept that is used to measure information or disorder. And, of course, we can use it to measure the purity of a dataset.

    If we consider the target classes as the possible states of a point in the dataset, the entropy of the dataset can be mathematically defined as the sum over all probability classes of each class, multiplied by its logarithm. Thus, for a binary classification problem, the entropy range is between 0 and 1.

    Where p is the entire dataset, N is the number of classes, and p i is the frequency of class i in the same dataset.

    To better understand entropy, let's work on two different example datasets, both with two classes, respectively represented as blue dots and red crosses (Figure 2). In the example dataset on the left, we have a mixture of blue dots and red crosses. In the example dataset on the right, we only have red crosses. This second situation - a dataset with only samples from one class - is what we are aiming for: a "pure" subset of the data.

    Let's now calculate the entropy for these two sets of binary data.

    For the example on the left, the probability is 7/13 for the class with red crosses and 6/13 for the class with blue dots. Note that here we have almost as many data points for one class as for the other. The above formula results in an entropy value of 0.99.

    For the example on the right, the probability is 13/13 = 1.0 for the red cross class and 0/13 = 0.0 for the blue dot class. Note that here we only have red intersection points. In this case, the formula above results in an entropy value of 0.0.

    Entropy can be a measure of purity, disorder, or information. Due to the mixed classes, the dataset on the left is less pure and more confusing (more disorder, i.e. higher entropy). However, more confusion also means more information. Indeed, if there are only points of the same class in the data set, there is not much information that can be extracted from it, no matter how long you try. By comparison, if a dataset has points from both classes, it also has a higher potential for extracting information. So the higher entropy value of the dataset on the left can also be seen as more potential information.

    The goal of each split in a decision tree is to move from an obfuscated data set to two (or more) cleaner subsets. Ideally, the division should result in subsets with an entropy of 0.0. In practice, however, it is sufficient if the split results in subsets with an overall lower entropy than the original dataset.

    Figure 3. A split at a tree node must move from a dataset with higher entropy to subsets with lower total entropy.

    Get Information (ID3)

    To evaluate how good a feature is for splitting, the difference in entropy before and after splitting is calculated.

    That is, first we calculate the entropy of the dataset before splitting, and then we calculate the entropy for each subset after splitting. Finally, the sum of the output entropies, weighted by subset size, is subtracted from the entropy of the dataset before splitting. This difference measures the gain in information or the decrease in entropy. If the information gain is a positive number, it means we are moving from a confusing dataset to a series of cleaner subsets.

    Where "before" is the data set before the split, K is the number of subsets created by the split, and (j, after) is the subset of j after the split.

    At each step, we would then decide to split the data on the object with the highest value in getting information, as this results in the cleanest subsets. The algorithm that applies this measure is the ID3 algorithm. The disadvantage of the ID3 algorithm is the preference for functions with a large number of values, which leads to the creation of large decision trees.

    Gain (C4.5)

    The gain measure used in the C4.5 algorithm introduces the SplitInfo concept. SplitInfo is defined as the sum of the weights multiplied by the logarithm of the weights, where the weights are the ratio of the number of data points in the current subset to the number of data points in the parent dataset.

    The gain is then calculated by dividing the ID3 algorithm information gain by the SplitInfo value.

    Where "before" is the data set before the split, K is the number of subsets created by the split, and (j, after) is the subset of j after the split.

    Gini index (CART)

    Another indicator of purity - or actually impurity - used by the CART algorithm is the Gini index.

    The Gini index is based on Gini impurities. The Gini impurity is defined as 1 minus the sum of the squared probabilities of the classes in the dataset.

    Where p is the entire dataset, N is the number of classes, and p i is the frequency of class i in the same dataset.

    The Gini index is then defined as the weighted sum of the Gini impurity of the various subsets after splitting, where each part is weighted by the ratio of the subset size to the size of the parent data set.

    For a dataset with two classes, the Gini index ranges from 0 to 0.5: 0 if the dataset is clean and 0.5 if the two classes are equally distributed. So the function with the lowest Gini index is used as the next split function.

    Where K is the number of subsets generated by the split and (j, after) is the subset of j after the split.

    Split detection

    For the nominal function we have two different split options. We can either create a child node for each selected feature value in the training set, or do a binary split. In the case of binary splitting, all possible subsets of feature values ​​are checked. In this latter case, the process is more computationally expensive, but still relatively simple.

    In numerical terms, identifying the best separation is more difficult. All numeric values ​​can actually be shared by candidates. But that would make calculating quality scores too expensive! Hence, for numeric features, the division is always binary. In the training data, the candidate split points are taken between every two consecutive values ​​of the selected numerical characteristic. Again, a binary partition producing a measure of the best quality is adopted. The split point can then be the average between the two sections in that function, the largest point of the lower section, or the smallest point of the higher section.

    Size and overfit

    Decision trees, like many other machine learning algorithms, can be overfitted on the training data. Trees that are too deep can cause models to be too detailed and not generalize to new data. On the other hand, trees that are too small can lead to overly simple models that do not fit into the data. You see, the size of the decision tree is critical.

    Figure 4. The size of the decision tree is important. A tree that is large and too detailed (right) may fit the training data, while a tree that is too small (left) may be too simple to accommodate the data.

    There are two ways to avoid an overspecialized tree: pruning and/or stopping early.


    Pruning is applied to the decision tree after the training phase. In essence, we are allowing the tree to grow freely as much as its settings allow, without applying any explicit restrictions. At the end, we traverse those branches that are underfilled to avoid overfitting the training data. Indeed, branches that are underfilled are likely to over-concentrate on singular data points. That's why removing them should help generalize the new unseen data.

    There are many different cropping methods. Here we want to explain the two most commonly used: error reduction and minimum description length reduction, MDL for short.

    As the number of errors decreases, each iteration reduces the sparsely populated branch and the tree is reapplied to the training data. If branch pruning does not reduce accuracy on the training set, the branch is removed permanently.

    The MDL reduction uses the description length to decide whether the tree should be deleted. The description length is defined as the number of bits needed to encode the tree plus the number of bits needed to encode misclassified training data samples. When a tree branch is pruned, the description lengths of the unpruned tree and the pruned tree are calculated and compared. If the length of the description of the pruned tree is shorter, the pruning is preserved.

    Early stop

    Another way to avoid overfitting is to stop early based on a stop criterion.

    One of the common stopping criteria is the minimum number of samples per node. The branch will stop growing when a node is created containing less or equal number of data samples as the minimum set number. So a higher value of this minimum number results in smaller trees, while a smaller value results in deeper trees.

    Random Decision Tree Forest

    As we said at the beginning, the evolution of the decision tree to work more reliably has led to the random forest. Let's see how the innovative random forest model compares to the original decision tree algorithms.

    Many are better than one. This is, simply put, the concept of the random forest algorithm. That is, many decision trees can make more accurate predictions than one single decision tree. Indeed, the random forest algorithm is a supervised classification algorithm that builds N trained decision trees that are slightly different in different ways and combines them together to get more accurate and stable predictions.

    Let's emphasize this concept a second time. The whole idea relies on several decision trees, which are all trained in slightly different ways, and all of which are taken into account for the final decision.

    Figure 5. The random forest algorithm is based on many decision trees that are trained in slightly different ways; they are all taken into account for the final classification.

    Bootstrap training sets

    Let's focus on "a little different".

    The random forest learning algorithm applies a general bagging technique to tree learners. One decision tree is trained alone on the entire training set. In a random forest, N decision trees are trained each on a subset of the original training set obtained by bootstrapping the original dataset, that is, by random sampling with replacement.

    In addition, the input features can also differ from tree to tree as random subsets of the original set of features. As a rule, if m is the number of input features in the original dataset, a subset of randomly extracted [square root of m] input features is used to train each decision tree.

    Figure 6. Random forest decision trees all train slightly differently on a loaded subset of the original dataset. The set of input features also varies for each decision tree in the random forest.

    Majority rule

    N slightly different trees will give N slightly different predictions for the same input vector. Usually, majority rule is used to make the final decision. The forecast offered by the majority of N trees is taken as final.

    The advantage of this strategy is obvious. While the predictions for a single tree are very sensitive to the noise in the training set, the predictions for most of the many trees are not—assuming the trees are not correlated. The bootstrap pattern is a way to decorrelate trees by training them on different training sets.

    Out of Bag (OOB) Error

    A popular metric for measuring random forest prediction error is out-of-bag error.

    The out-of-bag error is the mean prediction error calculated for all xᵢ training samples, using only trees that did not have xᵢ in their initial training set. Batch error estimates avoid the need for an independent validation dataset, but may underestimate the actual performance improvement.


    In this article, we looked at several important aspects of decision trees: how a decision tree is trained to implement a classification problem, what quality measures are used to select input features for splitting, and techniques to avoid the overfitting problem.

    We also tried to explain the strategy of the random forest algorithm to make the decision tree predictions more reliable; i.e. limit the dependence on noise in the training set. Indeed, by using a set of N decreled decision trees, the random forest improves the accuracy and reliability of the final classification.

    Now let's use it and see if we can sail tomorrow!

    Read more

    Structured and unstructured data

    With all the hype around big data and how companies use it, you might be asking, “What

    Discrete vs Continuous Data - What's the difference?

    For such a simple word, "data" is quite a complicated topic. For example, "love" or "news". There are structured

    Rules for Efficient Prediction

    Intuition is very important. With its help, a large number of good forecasts were created. But you always need

    Magic Quadrant for Machine Learning and Data Science Platforms

    Data scientists and application developers require professional capabilities to build, deploy, and manage

    A Brief Introduction to Forecasting

    The marketer may notice that this formula does not directly depend on sample size. In other words, a person who has

    Sample. Sample types

    The total number of objects of observation (people, households, enterprises, settlements, etc.) with a certain set of features

    Overview of the main types of segmentation

    Download Software VI Solution Demonstrations Business Analytics Contents Brand Segmentation Segmentation helps you make better decisions

    Taguchi method

    Japanese scientist G.

    Number Age (Age) Income (Income) Student (Student) Credit rating (CRys6 Computer) Computer purchase (CRys9)
    1 <30 Low (LOW) Yes Average (Fair) Yes
    2 30-40

    Learn more