How to calculate information gain in decision trees
A Simple Explanation of Information Gain and Entropy
Information Gain, like Gini Impurity, is a metric used to train Decision Trees. Specifically, these metrics measure the quality of a split. For example, say we have the following data:
The DatasetWhat if we made a split at x=1.5x = 1.5x=1.5?
An Imperfect SplitThis imperfect split breaks our dataset into these branches:
- Left branch, with 4 blues.
- Right branch, with 1 blue and 5 greens.
It’s clear this split isn’t optimal, but how good is it? How can we quantify the quality of a split?
That’s where Information Gain comes in.
Confused? Not sure what Decision Trees are or how they’re trained? Read the beginning of my introduction to Random Forests and Decision Trees.
Note: this post looks better in Light Mode. If you’re in Dark Mode, scroll up and use the toggle in the top right to switch!
Before we get to Information Gain, we have to first talk about Information Entropy. C p_i \log_2 p_iE=−i∑Cpilog2pi
where pip_ipi is the probability of randomly picking an element of class iii (i.e. the proportion of the dataset made up of class iii).
The easiest way to understand this is with an example. Consider a dataset with 1 blue, 2 greens, and 3 reds: . Then
E=−(pblog2pb+pglog2pg+prlog2pr)E = -(p_b \log_2 p_b + p_g \log_2 p_g + p_r \log_2 p_r)E=−(pblog2pb+pglog2pg+prlog2pr)
We know pb=16p_b = \frac{1}{6}pb=61 because 16\frac{1}{6}61 of the dataset is blue. Similarly, pg=26p_g = \frac{2}{6}pg=62 (greens) and pr=36p_r = \frac{3}{6}pr=63 (reds). Thus,
E=−(16log2(16)+26log2(26)+36log2(36))=1.46\begin{aligned} E &= -(\frac{1}{6} \log_2(\frac{1}{6}) + \frac{2}{6} \log_2(\frac{2}{6}) + \frac{3}{6} \log_2(\frac{3}{6})) \\ &= \boxed{1.46} \\ \end{aligned}E=−(61log2(61)+62log2(62)+63log2(63))=1.46
What about a dataset of all one color? Consider 3 blues as an example: . The entropy would be
E=−(1log21)=0E = -(1 \log_2 1) = \boxed{0}E=−(1log21)=0
It’s finally time to answer the question we posed earlier: how can we quantify the quality of a split?
Let’s consider this split again:
An Imperfect SplitBefore the split, we had 5 blues and 5 greens, so the entropy was
Ebefore=−(0.5log20.5+0.5log20.5)=1\begin{aligned} E_{before} &= -(0.5 \log_2 0.5 + 0.5 \log_2 0.5) \\ &= \boxed{1} \\ \end{aligned}Ebefore=−(0.5log20.5+0.5log20.5)=1
After the split, we have two branches.
Left Branch has 4 blues, so Eleft=0E_{left} = \boxed{0}Eleft=0 because it’s a dataset of all one color.
Right Branch has 1 blue and 5 greens, so
Eright=−(16log2(16)+56log2(56))=0.65\begin{aligned} E_{right} &= -(\frac{1}{6} \log_2 (\frac{1}{6}) + \frac{5}{6} \log_2 (\frac{5}{6})) \\ &= \boxed{0. 65} \\ \end{aligned}Eright=−(61log2(61)+65log2(65))=0.65
Now that we have the entropies for both branches, we can determine the quality of the split by weighting the entropy of each branch by how many elements it has. Since Left Branch has 4 elements and Right Branch has 6, we weight them by 0.40.40.4 and 0.60.60.6, respectively:
Esplit=0.4∗0+0.6∗0.65=0.39\begin{aligned} E_{split} &= 0.4 * 0 + 0.6 * 0.65 \\ &= \boxed{0.39} \\ \end{aligned}Esplit=0.4∗0+0.6∗0.65=0.39
We started with Ebefore=1E_{before} = 1Ebefore=1 entropy before the split and now are down to 0.390.390.39! Information Gain = how much Entropy we removed, so
Gain=1−0.39=0.61\text{Gain} = 1 - 0.39 = \boxed{0.61}Gain=1−0.39=0.61
This makes sense: higher Information Gain = more Entropy removed, which is what we want. In the perfect case, each branch would contain only one color after the split, which would be zero entropy!
Recap
Information Entropy can be thought of as how unpredictable a dataset is. C p_i \log_2 p_iE=−i∑Cpilog2pi
Information Gain is calculated for a split by subtracting the weighted entropies of each branch from the original entropy. When training a Decision Tree using these metrics, the best split is chosen by maximizing Information Gain.
Want to learn more? Check out my explanation of Gini Impurity, a similar metric, or my in-depth guide Random Forests for Complete Beginners.
Entropy and Information Gain in Decision Trees | by Jeremiah Lutes
Photo by AbsolutVision on UnsplashA simple look at some key Information Theory concepts and how to use them when building a Decision Tree Algorithm.
What criteria should a decision tree algorithm use to split variables/columns?
Before building a decision tree algorithm the first step is to answer this question. Let’s take a look at one of the ways to answer this question. To do so we will need to understand a use a few key concepts from information theory.
Let’s examine this method by taking the following steps:
- Take a very brief look at what a Decision Tree is.
- Define and examine the formula for Entropy.
- Discuss what a Bit is in information theory.
- Define Information Gain and use entropy to calculate it.
- Write some basic Python functions using the above concepts.
In data science, the decision tree algorithm is a supervised learning algorithm for classification or regression problems. Our end goal is to use historical data to predict an outcome. Unlike linear regression, decision trees can pick up nonlinear interactions between variables in the data.
Let’s look at a very simple decision tree. Below is a workflow that can be used to make a decision on whether or not to eat a peanut butter cookie.
A decision tree example on whether or not to eat a cookieIn this example, a decision tree can pick up on the fact that you should only eat the cookie if certain criteria are met. This is the ultimate goal of a decision tree. We want to keep making decisions(splits) until certain criteria are met. Once met we can use it to classify or make a prediction. This example is very basic using only two variables ( allergy, ruining dinner). But, if you have a dataset with thousands of variables/columns how do you decide which variables/columns are the most efficient to split on? A popular way to solve this problem, especially if using an ID3 algorithm, is to use entropy and information gain.
Let’s say we have some data and we want to use it to make an online quiz that predicts something about the quiz taker. After looking at the relationships in the data we have decided to use a decision tree algorithm. If you have never been sucked into an online quiz, you can see hundreds of examples here. The goal of the quiz will be to guess if the quiz taker is from one of America’s midwest states. The questions in the quiz will revolve around if they like a certain type of food or not. Below has a small fictional dataset with fifteen entries. Each entry has answers to a series of questions. Most questions are about if they liked a certain type of food, in which the participant answered (1) for yes or (0) for now. The last column(“midwest?”) is our target column, meaning that once the decision tree is built, this is the classification we are trying to guess.
To get us started we will use an information theory metric called entropy. In data science, entropy is used as a way to measure how “mixed” a column is. Specifically, entropy is used to measure disorder. Let’s start by finding the entropy of our target column, “midwest?”.
Our target column, “midwest?”There are ten people who live in the midwest and five people who don’t. If someone was going to ask you how mixed the column is, you could say it was sort of mixed, with a majority(2/3) of the people from the midwest. Entropy gives us a way to quantify the answer” sort of mixed”. The more mixed the (1)s and (0)s in the column are, the higher the entropy. If “midwest?” had equal amounts of (1)s and (0)s our entropy would be 1. If “midwest?” consisted only of (1)s the entropy would be 0.
We can use the following formula to calculate entropy:
the formula for entropyLet’s go through each step of the formula and calculate the entropy for the “midwest?” column.
- We need to iterate through each unique value in a single column and assign it to i. For this example, we have 2 cases(c) in the “midwest?” column, either (0) or (1).
- We then compute the probability of that value occurring in the data. For the case of (1), the probability is 10/15. For the case of (0), the probability is 5/15.
- We take the probability of each case and multiply it by the logarithm base 2 of the probability. 2 is the most common base because entropy is measured in bits(more on that later). The full explanation of why 2 is used is out of the scope of this post, but a user on stack exchange offers a good explanation.
For the case of(1), we get 10/15*log2(10/15). For the case of (0), we get 5/15*log2(5/15).
- Next, we take our product from each case above and sum it together. For this example, 10/15*log2(10/15) + 5/15*log2(5/15).
- Finally, we negate the total sum from above, — (10/15*log2(10/15) + 5/15*log2(5/15)).
Once we put the steps all together we get the below:
Our final entropy is .918278. So, what does that really mean?
Moving forward it will be important to understand the concept of bit. In information theory, a bit is thought of as a binary number representing 0 for no information and 1 for a full bit of information. We can represent a bit of information as a binary number because it either has the value (1) or (0). Suppose there’s an equal probability of it raining tomorrow (1) or not raining(0). If I tell you that it will rain tomorrow, I’ve given you one bit of information.
We can also think of entropy as information. Suppose we have a loaded six-sided die which always lands on (3). Each time we roll the die, we know upfront that the result will be (3). We gain no new information by rolling the die, so entropy is 0. On the other hand, if the die is far and we roll a (3) there was a 1/6 chance in rolling the (3). Now we have gained information. Thus, rolling the die gives us one bit of information — which side the number landed on.
For a deeper dive into the concept of a bit of information, you can read more here.
We get less than one “bit” of information — only .918278 — because there are more (1)s in the “midwest?” column than (0)s. This means that if we were predicting a new value, we could guess that the answer is (1) and be right more often than wrong (because there’s a 2/3 probability of the answer being 1). Due to this prior knowledge, we gain less than a full “bit” of information when we observe a new value.
Our goal is to find the best variable(s)/column(s) to split on when building a decision tree. Eventually, we want to keep splitting the variables/columns until our mixed target column is no longer mixed.
For instance, let’s look at the entropy of the “midwest?” column after we have split our dataset on the “potato_salad?” column.
split on the “potato_salad?” columnAbove, our dataset is split in two sections. On the left side, everyone who likes potato salad. On the right side everyone who doesn’t. We fill focus on the left side which now has seven people from the midwest and two people who aren’t. By using the formula for entropy on the left split midwest column the new entropy is .764204. This is great! Our goal is to lower the entropy and we went from .918278 to .764204. But, we can’t stop there, if we look at the right column our entropy went up as there are an equal amount of (1)s and (0)s. What we need is a way to see how the entropy changes on both sides of the split. The formula for information gain will do that. It gives us a number to quantify how many bits of information we have gained each time we split our data.
Earlier we established we want splits that lower the entropy of our target column. When we split on “potato_salad?” we saw that entropy in the “midwest?” went down on the left side. Now we need to understand the total entropy lowered when we look at both sides of the split. Let’s take a look at information gain.
Information gain will use the following formula:
Let’s breakdown what is going here.
We’ll go back to our “potato_salad?” example. The variables in the above formula will represent the following:
- T = Target, our “midwest?” column
- A = the variable(column) we are testing, “potato_salad?”
- v = each value in A, each value in the “potato_salad?” column
- First, we’ll calculate the orginal entropy for (T) before the split , .918278
- Then, for each unique value (v) in variable (A), we compute the number of rows in which (A) takes on the value (v), and divide it by the total number of rows. For the “potato_salad?” column we get 9/15 for the unique value of (1) and 6/15 for the unique value of (0).
- Next, we multiply the results by the entropy of the rows where (A) is (v). For the left split( split on 1 for “potato_salad?”) we get 9/15 * .764204. For the right side of the split ( split on 0 for “potato_salad?”) we get 6/15 * 1.
- We add all of these subset products together, 9/14*.764204 + 6/15 = .8585224.
5. We then subtract from the overall entropy to get information gain, .918278 -.8585224 = .059754
Our information gain is .059754. What does that tell us?
Here’s an alternate explanation. We’re finding the entropy of each set post-split, weighting it by the number of items in each split, then subtracting from the current entropy. If the result is positive, we’ve lowered entropy with our split. The higher the result is, the more we’ve lowered entropy.
We end up with .059754, which means that we gain .059754 bits of information by splitting our data set on the “potato_salad?” variable/column. Our information gain is low, but it’s still positive which is because we lowered the entropy on the left side of the split.
Now we need to repeat this process for every column we are using. Instead of doing this by hand let’s write some Python code.
Now that we understand information gain, we need a way to repeat this process to find the variable/column with the largest information gain. To do this, we can create a few simple functions in Python.
Importing the Data
Let’s turn our above table into a DataFrame using the Python pandas library. We will import pandas and use the read_csv() function to make a DataFrame named “midwest”.
import pandas as pd
midwest = pd.read_csv('midwes.csv')
A Python Function for Entropy
For this function, we will need the NumPy library to use the bincount() function and the math module to use the log() function.
import numpy
import math
Next, we will define our function with one parameter. The argument given will be the series, list, or NumPy array in which we are trying to calculate the entropy.
def calc_entropy(column):
We will need to find the percentage of each case in the column. We can use the numpy.bincount() function for this. The return value is a NumPy array which will store the count of each unique value from the column that was passed as an argument.
counts = numpy.bincount(column)
We’ll store the probabilities of each unique value by dividing the “counts” array by the length of the column.
probabilities = counts / len(column)
We can then initialize a variable named “entropy” and set it to 0.
entropy = 0
Next, we can use a “for loop” to loop through each probability in our probabilities array and multiply it by the logarithm base 2 of probability using the math.log() function. Then, add each case to our stored entropy variable. *be sure to check your probability is great than 0 otherwise log(0) will return undefined
for prob in probabilities:
if prob > 0:
endtropy += prob * math.log(prob,2)
Finally, we’ll return our negated entropy variable.
return -entropy
All together now:
Great! Now we can build a function to calculate information gain.
A Python Function for Information Gain
We’ll need to define a function that will have three parameters, one for the entire dataset, one for the name of the column we want to split on, and one for the name of our target column.
def calc_information_gain(data, split_name, target_name):
Next, we can use the entropy function from earlier to calculate the original entropy of our target column.
orginal_entropy = calc_entropy(data[target_name])
Now we need to split our column.
*For this example we will only use the variables/columns with two unique. If you want to split on a variable/column such as “age”, there are several ways to do this. One way is to split on every unique value. Another way is to simplify the calculation of information gain and make splits simpler by not splitting for each unique value. Instead, the median is found for the variable/coumn being split on. Any rows where the value of the variable is below the median will go to the left branch, and the rest of the rows will go to the right branch. To compute information gain, we’ll only have to compute entropies for two subsets. We won’t be walking through this method but once the split on the median is performed the rest of steps would be the same as outlined below.
Since the columns we are working with only have two unique values we will make a left split and a right split.
We’ll start by using the pandas.Series.unique() to give us an array of the unique values in the column
values = data[split_name].unique()
Next, we will create a left and right split using “values”.
left_split = data[data[split_name] == values[0]]
right_split = data[data[split_name] == values[1]]
Now we can initiate a variable to subtract from our original entropy.
to_subtract = 0
Then we’ll iterate through each subset created by our split, calculate the probability of the subset, and then add the product of the probability and the subsets target column’s entropy.
for subset in [left_split, right_split]:
prob = (subset.shape[0] / data.shape[0])
to_subtract += prob * calc_entropy(subset[target_name])
Finally, we can return the difference of to_subract being subtracted from the original entropy.
return original_entropy - to_subtract
The entire function is below.
A Python Function for the Highest Information Gain
Our final function will be one that will return the variable/column name with the highest information gain.
As mentioned earlier we are only using the columns with two unique values for this example. We’ll store those column names in a list to use in the function. To get to the point we’ll hard code this for this example but in a large dataset, it would be best to write code to build this list dynamically based on the criteria we use to choose the columns.
columns = ['apple_pie?', 'potato_salad?', 'sushi?']
Let’s wrap the final step in a function so we can reuse it as needed. It will have one parameter, the list of columns we want to find the highest information gain for.
def highest_info_gain(columns):
We’ll intialize an empty dictionary to store our information gains.
information_gains = {}
And then we can iterate through the list of columns and store the result in our information_gains dictionary.
for col in columns:
information_gain = calc_information_gain(midwest, col, 'midwest?)
information_gains[col] = information_gain
Finally, we can return the key of the highest value in our dictionary.
return max(information_gains, key=information_gains.get)
All together now:
Once we execute our final function
print(highest_info_gain(midwest, columns, 'midwest?'))
//sushi
we see the variable/column with the highest information gain is ‘sushi?’.
We can visualize a split on sushi below:
Splitting our dataset on the sushi columnOur left split has two people out of six from the midwest. The right split has eight out of the nine people from the midwest. This was an efficient split and lowered our entropy on both sides. If we were to continue we would use recursion to keep splitting each split with a goal to end each branch with an entropy of zero.
Decision trees can be a useful machine learning algorithm to pick up nonlinear interactions between variables in the data. In this example, we looked at the beginning stages of a decision tree classification algorithm. We then looked at three information theory concepts, entropy, bit, and information gain. By using these concepts we were able to build a few functions in Python to decide which variables/columns were the most efficient to split on. With a firm grasp on these concepts, we can move forward to build a decision tree.
The right decision tree for the right choice
We show you how to use the right decision tree for machine learning, data visualization and visual demonstration of the decision-making process. It will be useful not only for data analysts, but also for those who want to find a methodology that helps to make more balanced decisions in life and business.
This article is based on the translation: How to Create a Perfect Decision Tree by Upasana Priyadarshiny.
The main tasks that a decision tree solves in machine learning, data analysis and statistics:
- Classification - when you need to attribute an object to a specific category, taking into account its features.
- Regression - using data to predict a quantitative trait, taking into account the influence of other traits.
Decision trees can also help visualize the decision-making process and make the right choice in situations where the results of one decision affect the results of subsequent decisions. Let's try to explain how it works with simple real-life examples.
What is a decision tree
Visually, a decision tree can be represented as a map of possible outcomes from a series of interrelated choices. This helps to compare possible actions based on their cost (cost), likelihood and benefit.
As the name suggests, this uses a tree decision model. Such tree diagrams can be useful both in the process of discussing something, and for compiling an algorithm that mathematically determines the best choice.
Typically, a decision tree starts with a single node that branches off into possible outcomes. Each of them continues the scheme and creates additional nodes that continue to develop along the same lines. This gives the model a tree shape.
Decision tree: to play or not to play. The weather forecast is taken into accountThere can be three different types of nodes in the decision tree:
- Decision nodes - decision nodes, they show the decision to be made.
- Chance nodes - probability nodes, show the probability of certain outcomes.
- End nodes - closing nodes, show the final result of the solution path.
Advantages and Disadvantages of the Decision Tree Method
Advantages of the Method :
- Decision trees are created according to clear rules, they are easy to apply and interpret.
- Both continuous and qualitative (discrete) variables can be processed.
- You can work with data gaps, decision trees allow you to fill in an empty field with the most likely value.
- Helps you determine which fields are more important for prediction or classification.
Disadvantages of the method:
- There is a possibility of errors in classification problems with a large number of classes and a relatively small number of training examples.
- Process instability: a change in one node can lead to the construction of a completely different tree, due to the hierarchical nature of its structure.
- The process of "growing" a decision tree can be quite computationally expensive, because at each node each attribute must be expanded until the best solution or branch is found. Some algorithms use combinations of fields, in which case you have to look for the optimal combination by the "weight" of the coefficients.
The pruning (or "pruning") algorithm is also expensive, since a large number of potential branches must be formed and compared.
How to create a decision tree
As an example, let's take a scenario where a group of astronomers is studying a planet to see if it can become a new Earth.
There are N decisive factors that need to be carefully considered in order to make a reasonable decision. These factors can be the presence of water on the planet, the temperature range, whether the surface is subject to constant storms, whether flora and fauna can survive in this climate, and hundreds of other parameters.
We will explore the problem through the decision tree.
- Is the habitable temperature between 0 and 100 degrees Celsius?
- Is there water there?
- Flora and fauna flourish?
- Is the surface of the planet prone to storms?
Thus, we have a complete decision tree.
Classification rules
First, let's define the terms and their meanings.
Classification rules are cases in which all scenarios are considered and a class variable is assigned to each.
The class variable is the final result that our solution leads to. A class variable is assigned to each leaf, or leaf, node.
Here are the classification rules from the example decision tree about exploring a new planet:
- If the temperature is not between -0.15°C and 99.85°C, then survival is difficult.
- If the temperature is between -0.15°C and 99.85°C, but there is no water, then survival is difficult.
- If the temperature is between -0.15°C and 99.85°C, there is water but no flora and fauna, then survival is difficult.
- If the temperature is between -0.15 °C and 99.85 °C, there is water, there is flora and fauna, the surface is not prone to storms, then survival is possible.
- If the temperature is between -0.
15°C and 99.85°C, there is water, there is flora and fauna, but the surface is prone to storms, then survival is difficult.
Why is it difficult to build an ideal decision tree?
When the decision tree is created, we start at the root node, check the test conditions, and assign the control to one of the outgoing connections. Then we test the conditions again and assign the next node. To consider the tree complete, you need to bring all the conditions to the leaf node (Leaf Node). The leaf node will contain all the class labels that influence the pros and cons of the decision.
Please note that we started with the “temperature” feature and considered it to be the root. If you choose a different feature, then the tree will turn out different. This is the principle of the method - you need to choose the optimal root and use it to build a tree, decide which tree is needed to complete the task.
There are different ways to find the most appropriate decision tree for a particular situation. Below we will talk about one of them.
How to use a greedy algorithm to build a decision tree
The essence of the approach is the principle of the so-called greedy maximization of information gain. It is based on the concept of heuristic problem solving - making an optimal local choice at each node, thus reaching a solution that is most likely to be the most optimal.
The algorithm can be simplified as follows:
- On each node, select the best verification method.
- Then split the node into all possible outcomes (internal nodes).
- Repeat steps until all conditions result in end nodes.
The main question: "How to choose the initial conditions for testing?" . The answer lies in the values of entropy and information gain (information gain). We tell you what they are and how they affect the creation of our tree:
- Entropy - in a decision tree, this means uniformity.
If the data is completely homogeneous, it is 0; otherwise, if the data is split (50-50%), the entropy is 1. The simplest way to explain this term is how much information we don't know that is relevant to making a decision.
- Information gain is the reciprocal of entropy, the higher the information gain, the lower the entropy, less unaccounted data and better solution.
Total - we select the attribute that has the highest information gain to go to the next stage of splitting. This helps to choose the best solution at any node.
See example. We have a large amount of such data:
Number | Age (Age) | Income (Income) | Student (Student) | Credit rating (CRys6 Computer) | Computer purchase (CRys9) | |
1 | <30 | Low (LOW) | Yes | Average (Fair) | Yes | |
2 | 30-40 | 9 (0002 Looking for the optimal solution with a greedy algorithm