How decision tree classifier works


Decision Tree Algorithm, Explained - KDnuggets

 

Introduction

Classification is a two-step process, learning step and prediction step, in machine learning. In the learning step, the model is developed based on given training data. In the prediction step, the model is used to predict the response for given data. Decision Tree is one of the easiest and popular classification algorithms to understand and interpret.

 

Decision Tree Algorithm

Decision Tree algorithm belongs to the family of supervised learning algorithms. Unlike other supervised learning algorithms, the decision tree algorithm can be used for solving regression and classification problems too.

The goal of using a Decision Tree is to create a training model that can use to predict the class or value of the target variable by learning simple decision rules inferred from prior data(training data).

In Decision Trees, for predicting a class label for a record we start from the root of the tree. We compare the values of the root attribute with the record’s attribute. On the basis of comparison, we follow the branch corresponding to that value and jump to the next node.

 

Types of Decision Trees

Types of decision trees are based on the type of target variable we have. It can be of two types:

  1. Categorical Variable Decision Tree: Decision Tree which has a categorical target variable then it called a Categorical variable decision tree.
  2. Continuous Variable Decision Tree: Decision Tree has a continuous target variable then it is called Continuous Variable Decision Tree.

Example:- Let’s say we have a problem to predict whether a customer will pay his renewal premium with an insurance company (yes/ no). Here we know that the income of customers is a significant variable but the insurance company does not have income details for all customers. Now, as we know this is an important variable, then we can build a decision tree to predict customer income based on occupation, product, and various other variables. In this case, we are predicting values for the continuous variables.

 

Important Terminology related to Decision Trees

 

  1. Root Node: It represents the entire population or sample and this further gets divided into two or more homogeneous sets.
  2. Splitting: It is a process of dividing a node into two or more sub-nodes.
  3. Decision Node: When a sub-node splits into further sub-nodes, then it is called the decision node.
  4. Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node.
  5. Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say the opposite process of splitting.
  6. Branch / Sub-Tree: A subsection of the entire tree is called branch or sub-tree.
  7. Parent and Child Node: A node, which is divided into sub-nodes is called a parent node of sub-nodes whereas sub-nodes are the child of a parent node.

Decision trees classify the examples by sorting them down the tree from the root to some leaf/terminal node, with the leaf/terminal node providing the classification of the example.

Each node in the tree acts as a test case for some attribute, and each edge descending from the node corresponds to the possible answers to the test case. This process is recursive in nature and is repeated for every subtree rooted at the new node.

 

Assumptions while creating Decision Tree

Below are some of the assumptions we make while using Decision tree:

  • In the beginning, the whole training set is considered as the root.
  • Feature values are preferred to be categorical. If the values are continuous then they are discretized prior to building the model.
  • Records are distributed recursively on the basis of attribute values.
  • Order to placing attributes as root or internal node of the tree is done by using some statistical approach.

Decision Trees follow Sum of Product (SOP) representation. The Sum of product (SOP) is also known as Disjunctive Normal Form. For a class, every branch from the root of the tree to a leaf node having the same class is conjunction (product) of values, different branches ending in that class form a disjunction (sum).

The primary challenge in the decision tree implementation is to identify which attributes do we need to consider as the root node and each level. Handling this is to know as the attributes selection. We have different attributes selection measures to identify the attribute which can be considered as the root note at each level.

 

How do Decision Trees work?

The decision of making strategic splits heavily affects a tree’s accuracy. The decision criteria are different for classification and regression trees.

Decision trees use multiple algorithms to decide to split a node into two or more sub-nodes. The creation of sub-nodes increases the homogeneity of resultant sub-nodes. In other words, we can say that the purity of the node increases with respect to the target variable. The decision tree splits the nodes on all available variables and then selects the split which results in most homogeneous sub-nodes.

The algorithm selection is also based on the type of target variables. Let us look at some algorithms used in Decision Trees:

ID3 → (extension of D3)
C4.5 → (successor of ID3)
CART → (Classification And Regression Tree)
CHAID → (Chi-square automatic interaction detection Performs multi-level splits when computing classification trees)
MARS → (multivariate adaptive regression splines)

The ID3 algorithm builds decision trees using a top-down greedy search approach through the space of possible branches with no backtracking. A greedy algorithm, as the name suggests, always makes the choice that seems to be the best at that moment.

Steps in ID3 algorithm:

  1. It begins with the original set S as the root node.
  2. On each iteration of the algorithm, it iterates through the very unused attribute of the set S and calculates Entropy(H) and Information gain(IG) of this attribute.
  3. It then selects the attribute which has the smallest Entropy or Largest Information gain.
  4. The set S is then split by the selected attribute to produce a subset of the data.
  5. The algorithm continues to recur on each subset, considering only attributes never selected before.

 

Attribute Selection Measures

If the dataset consists of N attributes then deciding which attribute to place at the root or at different levels of the tree as internal nodes is a complicated step. By just randomly selecting any node to be the root can’t solve the issue. If we follow a random approach, it may give us bad results with low accuracy.

For solving this attribute selection problem, researchers worked and devised some solutions. They suggested using some criteria like :

Entropy,
Information gain,
Gini index,
Gain Ratio,
Reduction in Variance
Chi-Square

These criteria will calculate values for every attribute. The values are sorted, and attributes are placed in the tree by following the order i.e, the attribute with a high value(in case of information gain) is placed at the root.
While using Information Gain as a criterion, we assume attributes to be categorical, and for the Gini index, attributes are assumed to be continuous.

 

Entropy

Entropy is a measure of the randomness in the information being processed. The higher the entropy, the harder it is to draw any conclusions from that information. Flipping a coin is an example of an action that provides information that is random.

From the above graph, it is quite evident that the entropy H(X) is zero when the probability is either 0 or 1. The Entropy is maximum when the probability is 0.5 because it projects perfect randomness in the data and there is no chance if perfectly determining the outcome.

ID3 follows the rule — A branch with an entropy of zero is a leaf node and A brach with entropy more than zero needs further splitting.

Mathematically Entropy for 1 attribute is represented as:

Where S → Current state, and Pi → Probability of an event of state S or Percentage of class i in a node of state S.

Mathematically Entropy for multiple attributes is represented as:

where T→ Current state and X → Selected attribute

 

Information Gain

Information gain or IG is a statistical property that measures how well a given attribute separates the training examples according to their target classification. Constructing a decision tree is all about finding an attribute that returns the highest information gain and the smallest entropy.

Information Gain

 

 

Information gain is a decrease in entropy. It computes the difference between entropy before split and average entropy after split of the dataset based on given attribute values. ID3 (Iterative Dichotomiser) decision tree algorithm uses information gain.

Mathematically, IG is represented as:

In a much simpler way, we can conclude that:

Information Gain

Where “before” is the dataset before the split, K is the number of subsets generated by the split, and (j, after) is subset j after the split.

 

Gini Index

You can understand the Gini index as a cost function used to evaluate splits in the dataset. It is calculated by subtracting the sum of the squared probabilities of each class from one. It favors larger partitions and easy to implement whereas information gain favors smaller partitions with distinct values.

Gini Index

 

 

Gini Index works with the categorical target variable “Success” or “Failure”. It performs only Binary splits.

Higher value of Gini index implies higher inequality, higher heterogeneity.

Steps to Calculate Gini index for a split

  1. Calculate Gini for sub-nodes, using the above formula for success(p) and failure(q) (p²+q²).
  2. Calculate the Gini index for split using the weighted Gini score of each node of that split.

CART (Classification and Regression Tree) uses the Gini index method to create split points.

 

Gain ratio

Information gain is biased towards choosing attributes with a large number of values as root nodes. It means it prefers the attribute with a large number of distinct values.

C4.5, an improvement of ID3, uses Gain ratio which is a modification of Information gain that reduces its bias and is usually the best option. Gain ratio overcomes the problem with information gain by taking into account the number of branches that would result before making the split. It corrects information gain by taking the intrinsic information of a split into account.

Let us consider if we have a dataset that has users and their movie genre preferences based on variables like gender, group of age, rating, blah, blah. With the help of information gain, you split at ‘Gender’ (assuming it has the highest information gain) and now the variables ‘Group of Age’ and ‘Rating’ could be equally important and with the help of gain ratio, it will penalize a variable with more distinct values which will help us decide the split at the next level.

Gain Ratio

 

 

Where “before” is the dataset before the split, K is the number of subsets generated by the split, and (j, after) is subset j after the split.

 

Reduction in Variance

Reduction in variance is an algorithm used for continuous target variables (regression problems). This algorithm uses the standard formula of variance to choose the best split. The split with lower variance is selected as the criteria to split the population:

Above X-bar is the mean of the values, X is actual and n is the number of values.

Steps to calculate Variance:

  1. Calculate variance for each node.
  2. Calculate variance for each split as the weighted average of each node variance.

 

Chi-Square

The acronym CHAID stands for Chi-squared Automatic Interaction Detector. It is one of the oldest tree classification methods. It finds out the statistical significance between the differences between sub-nodes and parent node. We measure it by the sum of squares of standardized differences between observed and expected frequencies of the target variable.

It works with the categorical target variable “Success” or “Failure”. It can perform two or more splits. Higher the value of Chi-Square higher the statistical significance of differences between sub-node and Parent node.

It generates a tree called CHAID (Chi-square Automatic Interaction Detector).

Mathematically, Chi-squared is represented as:

Steps to Calculate Chi-square for a split:

  1. Calculate Chi-square for an individual node by calculating the deviation for Success and Failure both
  2. Calculated Chi-square of Split using Sum of all Chi-square of success and Failure of each node of the split

 

How to avoid/counter Overfitting in Decision Trees?

The common problem with Decision trees, especially having a table full of columns, they fit a lot. Sometimes it looks like the tree memorized the training data set. If there is no limit set on a decision tree, it will give you 100% accuracy on the training data set because in the worse case it will end up making 1 leaf for each observation. Thus this affects the accuracy when predicting samples that are not part of the training set.

Here are two ways to remove overfitting:

  1. Pruning Decision Trees.
  2. Random Forest

Pruning Decision Trees

The splitting process results in fully grown trees until the stopping criteria are reached. But, the fully grown tree is likely to overfit the data, leading to poor accuracy on unseen data.

Pruning in action

 

 

In pruning, you trim off the branches of the tree, i.e., remove the decision nodes starting from the leaf node such that the overall accuracy is not disturbed. This is done by segregating the actual training set into two sets: training data set, D and validation data set, V. Prepare the decision tree using the segregated training data set, D. Then continue trimming the tree accordingly to optimize the accuracy of the validation data set, V.

Pruning

 

 

In the above diagram, the ‘Age’ attribute in the left-hand side of the tree has been pruned as it has more importance on the right-hand side of the tree, hence removing overfitting.

Random Forest

Random Forest is an example of ensemble learning, in which we combine multiple machine learning algorithms to obtain better predictive performance.

Why the name “Random”?

Two key concepts that give it the name random:

  1. A random sampling of training data set when building trees.
  2. Random subsets of features considered when splitting nodes.

A technique known as bagging is used to create an ensemble of trees where multiple training sets are generated with replacement.

In the bagging technique, a data set is divided into N samples using randomized sampling. Then, using a single learning algorithm a model is built on all samples. Later, the resultant predictions are combined using voting or averaging in parallel.

Random Forest in action

 

 

 

Which is better Linear or tree-based models?

Well, it depends on the kind of problem you are solving.

  1. If the relationship between dependent & independent variables is well approximated by a linear model, linear regression will outperform the tree-based model.
  2. If there is a high non-linearity & complex relationship between dependent & independent variables, a tree model will outperform a classical regression method.
  3. If you need to build a model that is easy to explain to people, a decision tree model will always do better than a linear model. Decision tree models are even simpler to interpret than linear regression!

 

Decision Tree Classifier Building in Scikit-learn

The dataset that we have is a supermarket data which can be downloaded from here.
Load all the basic libraries.

import numpy as np
 import matplotlib.pyplot as plt 
 import pandas as pd

 

Load the dataset. It consists of 5 features, UserID, Gender, Age, EstimatedSalary and Purchased.

data = pd.read_csv('/Users/ML/DecisionTree/Social.csv')
 data.head()

 

Dataset

 

 

We will take only Age and EstimatedSalary as our independent variables X because of other features like Gender and User ID are irrelevant and have no effect on the purchasing capacity of a person. Purchased is our dependent variable y.

feature_cols = ['Age','EstimatedSalary' ]X = data.iloc[:,[2,3]].values
 y = data.iloc[:,4].values

 

The next step is to split the dataset into training and test.

from sklearn.model_selection import train_test_split
 X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.25, random_state= 0)

 

Perform feature scaling

#feature scaling
 from sklearn.preprocessing import StandardScaler
 sc_X = StandardScaler()
 X_train = sc_X.fit_transform(X_train)
 X_test = sc_X.transform(X_test)

 

Fit the model in the Decision Tree classifier.

from sklearn.tree import DecisionTreeClassifier
 classifier = DecisionTreeClassifier()
 classifier = classifier.fit(X_train,y_train)

 

Make predictions and check accuracy.

#prediction
 y_pred = classifier.predict(X_test)#Accuracy
 from sklearn import metricsprint('Accuracy Score:', metrics. accuracy_score(y_test,y_pred))

 

The decision tree classifier gave an accuracy of 91%.

Confusion Matrix

from sklearn.metrics import confusion_matrix
 cm = confusion_matrix(y_test, y_pred)Output:
 array([[64, 4],
 [ 2, 30]])

 

It means 6 observations have been classified as false.

Let us first visualize the model prediction results.

from matplotlib.colors import ListedColormap
 X_set, y_set = X_test, y_test
 X1, X2 = np.meshgrid(np.arange(start = X_set[:,0].min()-1, stop= X_set[:,0].max()+1, step = 0.01),np.arange(start = X_set[:,1].min()-1, stop= X_set[:,1].max()+1, step = 0.01))
 plt.contourf(X1,X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape), alpha=0.75, cmap = ListedColormap(("red","green")))plt.xlim(X1.min(), X1.max())
 plt.ylim(X2.min(), X2.max())for i,j in enumerate(np.unique(y_set)):
 plt.scatter(X_set[y_set==j,0],X_set[y_set==j,1], c = ListedColormap(("red","green"))(i),label = j)
 plt. title("Decision Tree(Test set)")
 plt.xlabel("Age")
 plt.ylabel("Estimated Salary")
 plt.legend()
 plt.show()

 

Let us also visualize the tree:

You can use Scikit-learn’s export_graphviz function to display the tree within a Jupyter notebook. For plotting trees, you also need to install Graphviz and pydotplus.

conda install python-graphviz
pip install pydotplus

export_graphviz function converts decision tree classifier into dot file and pydotplus convert this dot file to png or displayable form on Jupyter.

from sklearn.tree import export_graphviz
 from sklearn.externals.six import StringIO 
 from IPython.display import Image 
 import pydotplusdot_data = StringIO()
 export_graphviz(classifier, out_file=dot_data, 
 filled=True, rounded=True,
 special_characters=True,feature_names = feature_cols,class_names=['0','1'])
 graph = pydotplus. graph_from_dot_data(dot_data.getvalue()) 
 Image(graph.create_png())

 

Decision Tree.

 

 

In the decision tree chart, each internal node has a decision rule that splits the data. Gini referred to as the Gini ratio, which measures the impurity of the node. You can say a node is pure when all of its records belong to the same class, such nodes known as the leaf node.

Here, the resultant tree is unpruned. This unpruned tree is unexplainable and not easy to understand. In the next section, let’s optimize it by pruning.

Optimizing the Decision Tree Classifier

criterion: optional (default=”gini”) or Choose attribute selection measure: This parameter allows us to use the different-different attribute selection measure. Supported criteria are “gini” for the Gini index and “entropy” for the information gain.

splitter: string, optional (default=”best”) or Split Strategy: This parameter allows us to choose the split strategy. Supported strategies are “best” to choose the best split and “random” to choose the best random split.

max_depth: int or None, optional (default=None) or Maximum Depth of a Tree: The maximum depth of the tree. If None, then nodes are expanded until all the leaves contain less than min_samples_split samples. The higher value of maximum depth causes overfitting, and a lower value causes underfitting (Source).

In Scikit-learn, optimization of decision tree classifier performed by only pre-pruning. The maximum depth of the tree can be used as a control variable for pre-pruning.

# Create Decision Tree classifer object
 classifier = DecisionTreeClassifier(criterion="entropy", max_depth=3)# Train Decision Tree Classifer
 classifier = classifier.fit(X_train,y_train)#Predict the response for test dataset
 y_pred = classifier.predict(X_test)# Model Accuracy, how often is the classifier correct?
 print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

 

Well, the classification rate increased to 94%, which is better accuracy than the previous model.

Now let us again visualize the pruned Decision tree after optimization.

dot_data = StringIO()
 export_graphviz(classifier, out_file=dot_data, 
 filled=True, rounded=True,
 special_characters=True, feature_names = feature_cols,class_names=['0','1'])
 graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) 
 Image(graph.create_png())

 

Decision Tree after pruning

 

 

This pruned model is less complex, explainable, and easy to understand than the previous decision tree model plot.

 

Conclusion

In this article, we have covered a lot of details about Decision Tree; It’s working, attribute selection measures such as Information Gain, Gain Ratio, and Gini Index, decision tree model building, visualization and evaluation on supermarket dataset using Python Scikit-learn package and optimizing Decision Tree performance using parameter tuning.

Well, that’s all for this article hope you guys have enjoyed reading it, feel free to share your comments/thoughts/feedback in the comment section.

Source

 

 

Thanks for reading !!!

Bio: Nagesh Singh Chauhan is a Data Science enthusiast. Interested in Big Data, Python, Machine Learning.

Original. Reposted with permission.

A Dive Into Decision Trees. How do Decision Trees work? | by Abhijit Roy

Photo by Marius Masalar on Unsplash

How do Decision Trees work?

Decision Trees are some of the most used machine learning algorithms. They are used for both classification and Regression. They can be used for both linear and non-linear data, but they are mostly used for non-linear data. Decision Trees as the name suggests works on a set of decisions derived from the data and its behavior. It does not use a linear classifier or regressor, so its performance is independent of the linear nature of the data. Boosting and Bagging algorithms have been developed as ensemble models using the basic principle of decision trees compiled with some modifications to overcome some important drawbacks of decision trees and provide better results. One of the other most important reasons to use tree models is that they are very easy to interpret.

Decision Trees

Decision Trees can be used for both classification and regression. The methodologies are a bit different, though principles are the same. The decision trees use the CART algorithm (Classification and Regression Trees). In both cases, decisions are based on conditions on any of the features. The internal nodes represent the conditions and the leaf nodes represent the decision based on the conditions.

Classification

A decision tree is a graphical representation of all possible solutions to a decision based on certain conditions.

On each step or node of a decision tree, used for classification, we try to form a condition on the features to separate all the labels or classes contained in the dataset to the fullest purity. Let’s see how the idea works.

Say, the table given above is our dataset. If we try to analyze the dataset and create a decision tree based on the dataset, we will obtain something like the tree given below

In the root node, we are using “Feat 2 ≥3” as the condition to segregate our dataset for the first time. We can see if the answer is false, we reach a leaf, that has only the entries with label 3. So, label 3 is completely separated. But, if the answer is True, we reach an intermediate node. In this case, there are 3 entries, 2 of which belong to label 2 and one belongs to label 1. So, this result has impurity as there are two labels mixed. We apply another condition on the intermediate state and obtain the purely separated labels. On Leaf 2, we have obtained only the Label 1 entry and on leaf 3, we have only the Label 2 entries.

Now, the questions that may arise are:

  1. How do we fix the deciding value in case of numerical features, for instance, “feat 2≥3”?
  2. How do we decide which features should we use to create our internal condition nodes?
  3. How to decide which feature to use first, for instance, in the previous case, we had feature 1 and feature 2, so, which feature should we use as the root?

We will see the answers to these questions as we move forward.

Before we jump into the details of those answers let’s look at some definitions which will play major roles in answering the questions.

Entropy: It gives the measure of impurity or randomness in the data. It is given by:

Entropy= — P(class 1) x Log(P(class 1)) — P(class 2) x Log(P(class 2))

where P denotes the probability.

If there are two classes, class 1 and class 2, of equal numbers, i.e, the number of entries of class 1 is equal to the number of entries of class 2, and we randomly select an entry, it will belong to any class 1 or 2 with a 50% probability each. In such cases, the entropy will be high.

If a certain dataset has all data belonging to either class 1 or class 2, the entropy obtained is 0, as in that case P(class1) or P(class2) will be equal to 0. If P(class1)=0 then P(class2) should be equal to 1.So, it is evident that the entropy will be high if we have impure or mixed class labels in a dataset.

Pr(x==1)

The above diagram shows the variation of the probability of labels with entropy. We can see if the probability of a label is 0.5, the entropy is the maximum.

If in a dataset there are a total of 20 entries or rows, and 14 of them belongs to label 1 and 6 of them belongs to label 2, the entropy will be equal to:

= — P(label 1).log(P(label 1)) — P(label 2).log(P(label 2))

= — 14/20.log(14/20) — 6/20.log(6/20)

=0.880

Information Gain: The information gain is the decrease in the entropy after the dataset is split on the basis of an attribute. Constructing a decision tree depends on finding the attribute that returns the highest information gain. It helps in choosing which feature or attribute will be used to create the deciding internal node at a particular point.

It is given by:

Information gain=Entropy(s) — [(Weighted average) x (Entropy of each feature)

Now, how does this actually operate?

Say, this is our dataset. So, we have feature 1 and feature 2, which should be our root node. The above decision is based on the amount of information gain each of the features provides. So, let’s check.

Firstly, we want to check the distribution of the labels based on each feature, in order to get an insight into how much information gain will a feature provide.

For instance for feature 1,

We can see if we select the condition “Feature 1==2”, we can successfully separate label 2 from the dataset purely.

Now, Entropy (E(s))= — 4/10*log(4/10) — 6/10*log(6/10)=0.97

For Feature 1:

E(feature_1==1)= — 3/4*log(3/4) — 1/4*log(1/4)=0.81

E(feature_1==2)= — 3/3log(3/3) = — 1log1=0

E(feature_1==3)= — 2/3*log(2/3) — 1/3*log(1/3)=0.91

For Feature 1, the weighted average=

4/10*0.81+3/10*0+3/10*0.91=0.597

Information Gain= 0.97–0.597=0.313

Again, let’s do the same for feature 2:

For Feature 2:

For Feature 2:

E(feature_2==1)= — 2/3*log(2/3) — 1/3*log(1/3)=0. 91

E(feature_2==2)= — 2/3*log(2/3) — 1/3*log(1/3)=0.91

E(feature_2==3)= — 3/4*log(3/4) — 1/4*log(1/4)=0.81

For Feature 2, the weighted average=

3/10*0.91+3/10*0.91+4/10*0.81=0.87

So, Information Gain for feature 2= 0.97–0.87=0.1

Now, we found information gain at the point for feature 1 is larger than that of feature 2. So, we will use feature 1 to form the root node.

Gini index: Gini Index is also a measure of impurity used to build a decision tree using the CART mechanism.

It is given by:

Gini Impurity= 1 — (Probability of ‘Class 1’)² — (Probability of ‘Class 2’)²

Let’s see its working also,

Say, this is our distribution on a dataset of 100 entries. A condition applied to Feature 1 gives the above conclusion.

Gini Impurity of Leaf 1: 1–(35/43)² — (8/43)² =0.302

Gini Impurity of Leaf 2: 1 — (15/57)² — (42/57)²=0.38

Now, the overall Gini Impurity for Feature 1:

The weighted average of Gini impurity:

=43/100* 0. 30+ 57/100*38=0.34

Similarly, we calculate the Gini impurity index for other features also.

One thing to notice is that if every entry in a dataset belongs to only 1 class, i.e, either class 1 or class 2,

Gini Impurity: 1-(1/(0+1)²)-(0/(0+1)²)=0

So, we can see for a pure distribution the Gini impurity index is 0.

Gini Impurity index can also be used to decide which feature should be used to create the condition node. The feature that results in a smaller Gini impurity index is chosen to create the internal condition node at that point.

We have seen the concepts, we required to know in order to understand the working of the decision tree. I think we have found the answers to our second and third questions, i.e, how the features are decided and which feature is used to form the condition for the conclusion. We need to now find the answer to the first question, i.e, how to obtain the values which are used to form the condition in case of continuous numerical features.

The process to find the best suitable value to create a condition is pretty easy. First, we sort the dataset based on the numerical valued feature in an ascending manner. Next, we find the average of the adjacent pairs of numeric values.

Say, we obtained the averages as shown above. So, avg_1=(val_1+val_2)/2. Next, we fit the average values in the condition to check which provides the minimum Gini Impurity index or which provides the maximum information gain, based on the methods we are using. The average value serves as the cutoff value in our condition formed for the feature.

Similarly, for discrete value-based or class-based features, we try and fit every value present in the set and create the condition. Finally, we select the value or condition that gives the minimum Gini impurity index. So, this wraps up our first query.

Now, we have seen we consider the Gini impurity index or Information gain to decide which condition we should consider. In some cases, some features show no improvement at all or 0 information gain, such features are never used in a decision tree. This is how the decision tree does automatic feature selection.

One of the chief challenges of the decision tree is that it results in overfitting the data. This is mostly because of the fact, that it creates a condition-based approach to the training data. So, it fits very tightly on the training data. Now, the more the conditions applied on the train data in the tree, the deeper it grows, the tighter it fits the data. But after a point, it starts taking into consideration very small changes on some features, that give very low information gain, mostly these points destroy the generalization of the model and behave like outliers on the test data. We can restrict these conditions by using a threshold on the information gain, such that, if the information gain provided by the condition is less than a given value, we won’t consider the condition. This partially prevents overfitting and helps create a generalized model.

Pruning is a method of removal of nodes to get the optimal solution and a tree with reduced complexity. It removes branches or nodes in order to create a sub-tree that has reduced overfitting tendency. We will talk about the concept once we are done with Regression trees.

Regression

To understand the concept of regression trees, we must have a clear idea about the concept of regression and linear regression. If needed, feel free to go through my article on regression.

Let’s consider the two conditions:

In the first distribution, we can see, we can easily fit a line to it. So, in this case, we can use linear regression to predict a value. But, in the case of the second distribution, it is very evident that we can’t fit any particular straight line to the distribution to predict a value, as the distribution behaves very differently in different ranges. In other words, in the second case, the distribution is non-linear. In such non-linear cases, Regression Trees are used. A point of argument can be that we can use polynomial regression for non-linear data, but sometimes distributions are pretty complex to analyze and decide the polynomials so, the decision tree’s condition-based approach is preferred.

In regression trees, the leaves represent a continuous numeric value in contrast to classification trees which normally represent boolean or discrete values at the leaves.

The above diagram represents the basic structure of the regression trees. The tree grows more complex and difficult to analyze when multiple features chip in and the dimensionality of the feature set increases.

Now, let’s see how we decide which value we should pick for creating the conditions. It is a bit different from what we did in the case of classification trees.

Let’s say we have a feature set comprising of three features: Feature 1, feature 2, and feature 3.

DIstribution for Feature 1

Let’s consider the above image as our distribution for feature 1. Y-axis has the values on which we need to fit the regression and the X-axis has the feature values. So, we consider the values shown in the diagram as Val 1, Val 2, and so on, as the cutoff values to form the condition on Feature 1. These values are nothing but the average of feature 1 values of two corresponding points. Now, let's see how the calculations move forward.

Now, let's consider the above diagrams. The diagrams are not to scale, so avg_2 is not equal to avg_4. When we consider the val_1 as the cutoff value for feature 1, we take the average values of all the Y-values for the points to the left of the line as avg_1, and similarly, we take the average values of all the points to the right of the line as avg_2. So, for any value, for which the condition feature 1>val_1 holds True, the tree provides, the avg_2 as the prediction and vice versa.

Next, we take the squared errors for all the points. So for any point to the right of the line, the squared error of the point is (actual value — predicted value(i.e, avg_2))² and similarly for any point on the left, the squared error of the point is (actual value — predicted value(i.e, avg_1))². So, we find the squared error for all the points and sum them up. This is called the Sum of Squared Error. Say, for value 1 the Sum of Squared error is S_1.

Similarly, for value 4, the sum of squared error is calculated to say S_4. We can observe that S_4 will be much less than S_1. So, we find the sum of squared errors for all the values we looked at, to be the cutoff to create the condition. If we plot them, we will obtain the plot as shown below.

In the above diagram, we can see that the val_4 for Feature 1 provides the minimum sum of squared error for the point. So, now we have seen how the cutoff value for a particular value is decided.

Now, if we have multiple features, how to decide which feature should we use to create the node. For this, we create candidates. For instance, say for feature 1, val_4 gives the minimum sum of squared error. So, it is termed as the candidate for feature 1. Similarly, we find candidates for feature 2 and feature 3. Say the SSR corresponding to feature 2’s candidate is S_2_3, and that of feature 3’s candidate is S_3_5. We will compare S_4, S_2_3, and S_3_5 and select the minimum. The feature corresponding to the minimum will be used to create the deciding node.

The above-described method is used recursively to create Regression Trees. So, with this, we have seen how the CART mechanism works and how the classification and regression trees are formed. Now, briefly go through a very important concept in trees: Pruning.

Pruning

The method is used in trees to reduce overfitting. It is a procedure in which nodes and branches of trees are reduced in order to reduce the depth of the tree.

The above diagram shows how the concept of Pruning works.

There are basically two types of Pruning:

  1. Pre-Pruning
  2. Post Pruning

In Pre-pruning, we set parameters like ‘min_samples’, ‘max_depth’, and ‘max_leaves’ during the creation of the tree. All of the parameters need hyperparameter tuning and found using cross-validation and grid-search methods. It restricts the tree to a certain depth or a certain number of leaves.

Post-pruning methods are mostly done after the tree is already formed. Cost complexity pruning is the most used post-pruning approach.

Conclusion

Decision trees carry huge importance as they form the base of the Ensemble learning models in case of both bagging and boosting, which are the most used algorithms in the machine learning domain. Again due to its simple structure and interpretability, decision trees are used in several human interpretable models like LIME.

In this article, we have seen how decision trees work using the principles of CART.

Hope this helps.

Decision Trees Loginom Wiki

Synonyms: Classification Tree

Sections: Algorithms

A decision tree is a classifier built on the basis of “if, then” decision rules arranged in a tree-like hierarchical structure.

The decision tree is based on the process of recursively splitting the initial set of objects into subsets associated with predefined classes. Partitioning is performed using decision rules, in which attribute values ​​are checked according to a given condition.

Based on supervised learning. A set of observations is used as a training dataset, for which a class label is predefined.

Structurally, a decision tree consists of objects of two types - nodes (node) and leaves (leaf). The nodes contain decision rules and subsets of observations that satisfy them. The leaves contain observations classified by the tree: each leaf is associated with one of the classes, and the object that is distributed into the leaf is assigned the corresponding class label.

Visually, nodes and leaves in the tree are clearly distinguishable: rules are specified in the nodes that break the observations contained in it, and further branching is performed. There are no rules in the leaves, they are marked with the label of the class, the objects of which are included in this leaf. Leaves do not branch, and they terminate the branch of the tree (which is why they are sometimes called terminal nodes).

If the class assigned by the tree matches the target class, then the object is recognized, otherwise it is unrecognized. The top node of the tree is called the root node. It contains the entire training or production dataset.

The decision tree is a linear classifier, i.e. produces a division of objects in a multidimensional space by planes (in the two-dimensional case, by lines).

The tree shown in the figure solves the problem of classifying objects by two attributes into three classes.

In the figure, the circles represent objects of class 1, the squares represent objects of class 2, and the triangles represent objects of class 3. The feature space is divided by lines into three subsets associated with classes. The same subsets will correspond to the three possible outcomes of the classification. The class "triangles" has unrecognized examples ("squares"), i.e. examples that fall into subsets associated with another class.

Theoretically, the algorithm can generate new partitions until all examples are recognized correctly, i.e. until the subsets associated with leaves become homogeneous in class composition. However, this leads to a complication of the tree: a large number of branches, nodes, and leaves complicates its structure and impairs its interpretability. Therefore, in practice, the size of the tree is limited even at the expense of some loss of accuracy. This process is called decision tree simplification and can be implemented using early stopping and pruning techniques.

Decision trees are greedy algorithms. They can be dichotomous (binary), having only two descendants in a node, and polychotomous, having more than 2 descendants in a node. Dichotomous trees are easier to construct and interpret.

Decision trees have become one of the most popular classification methods in data mining and business intelligence today. Therefore, they are part of almost any analytical software.

A large number of different algorithms for constructing decision trees have been developed. The most famous is the family of algorithms based on the criterion of information gain (ID3, C4. 5, C5.0), proposed by Ross Quinlan at the beginning of 1980s.

Also widely known is the CART (Classification and Regression Tree) algorithm, which, as the name implies, allows you to solve not only classification problems, but also regression. Algorithm proposed by Leo Breiman in 1982

The wide popularity of decision trees is due to the following advantages:

  • the rules in them are formed practically in natural language, which makes the explanatory power of decision trees very high;
  • can work with both numerical and categorical data;
  • require relatively little data preprocessing, in particular, they do not require normalization, the creation of dummy variables, they can work with gaps;
  • can work with large amounts of data.

At the same time, decision trees have a number of limitations:

  • instability - even small changes in data can lead to significant changes in classification results;
  • because decision tree algorithms are greedy, they are not guaranteed to build an optimal tree;
  • tendency to retrain.

Currently, decision trees continue to develop: new algorithms (SHAID, MARS, Random Forest) and their modifications are being created, problems of constructing ensembles of models based on decision trees are being studied.

Decision trees are becoming an important tool for business process management and decision support. General principles of work and areas of application are described in the article "Decision Trees: General Principles".

Decision Tree Classifier

Decision Tree Classifier is a simple and widely used classification method that applies a simple idea to solve a classification problem.

The above example shows how we make decisions considering the different dimensions of our lives. The same steps are performed in the decision tree classifier.

Introduction

A decision tree is a simple representation for classifying examples. It is a Supervised machine learning algorithm in which the data is continuously partitioned according to a certain parameter.

The decision tree is the most powerful and popular tool for classification and forecasting. A decision tree is a tree structure similar to a flowchart, where each inner node represents an attribute test, each branch represents the result of the test, and each leaf node contains a class label.

The decision tree consists of :

  • Root . This is the topmost node in the decision tree. It learns to split based on the value of the attribute. It splits the tree in a recursive way called recursive splitting.
  • Branch/Subtree. A subsection of a decision tree is called a branch or subtree. It connects two tree nodes.
  • Decision assembly. When a subnode is split into additional subnodes, it is called a decision node.
  • Leaf/terminal assembly . Nodes with no child nodes (no further splitting) are called leaf or end nodes.

How does the decision tree algorithm work?

The basic idea of ​​any decision tree algorithm is:

  1. Select the best attribute (feature) using attribute selection scores ( ASM ) to split the records.
  2. Make this attribute a decision node and split the dataset into smaller subsets.
  3. Starts building the tree, repeating the process recursively for each child until one of the following conditions is met:
  • All tuples belong to the same attribute value.
  • There are no more attributes left.
  • No more copies.

Before we delve into the concept, let's understand the mixin, the types of mixin measures. This will help us better understand the algorithm.

An impurity, as its name implies, is a mixture of two or more things (eg heterogeneous) rather than a single component (eg homogeneous).

As you can see in the figure, the raw form of data that we will get is impure data (Figure A). By building node by node (Figure B), we try to make it clean at every level, and finally classify it (Figure C).

There are a couple of measures of impurity, but in this story we will only talk about two such measures,

  1. Entropy
  2. Gini index / Gini impurity

Entropy

Entropy is the amount of information needed to accurately describe a sample. So, if the sample is homogeneous (pure), that means all the elements are the same, then the entropy is 0, otherwise, if the sample is divided equally, then the entropy is maximum 1.

Gini index / admixture Gini

The Gini index is a measure of inequality in a sample. It has a value between 0 and 1. A Gini index of 0 means that the samples are perfectly homogeneous (pure) and all elements are similar, while a Gini index of 1 means the maximum inequality between elements (impure). This is the sum of the squares of the probabilities of each class. This is illustrated as,

Building a decision tree:

Step 1 . Calculate the entropy of the target.

Step 2 . The data set is then split into different attributes. The entropy for each branch is calculated. It is then proportionally added to get the total entropy for the split. The resulting entropy is subtracted from the entropy before separation. The result is an increase in information or a decrease in entropy.

Step 3 . Select the attribute with the highest information gain as the decision node, split the dataset into its branches, and repeat the same process for each branch.

Step 4a . A branch with entropy 0 is a terminal node.

Step 4b . A branch with entropy greater than 0 needs further splitting.

Now we have learned almost all the necessary intuition behind the decision tree algorithm, so let's look at its code.

  # Importing important libraries  from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import accuracy_score  # Make a model  model = DecisionTreeClassifier()  # fit the model with the training data  model.fit(train_x,train_y)  # depth of the decision tree  print('Depth of the Decision Tree :', model.get_depth())  # predict the target on the train dataset  predict_train = model.predict(train_x) print('Target on train data',predict_train)  # Accuray Score on train dataset  accuracy_train = accuracy_score(train_y,predict_train) print('accuracy_score on train dataset : ', accuracy_train)  # predict the target on the test dataset  predict_test = model.predict(test_x) print('Target on test data',predict_test)  # Accuracy Score on test dataset  accuracy_test = accuracy_score(test_y,predict_test) print('accuracy_score on test dataset : ', accuracy_test) 

Decision Tree - Overfitting

Overfitting is a significant practical challenge for decision tree models and many other predictive models. Overfitting occurs when the learning algorithm continues to develop hypotheses that reduce the training set error by increasing the test set error. There are several approaches to avoid overfitting when building decision trees.

  • Prepruning stops the tree from growing early before it perfectly classifies the training set.
  • Postpruning allows the tree to perfectly classify the training sample and then postprun the tree.

From a practical point of view, the second approach to retrain trees after pruning is more successful because it is not easy to judge exactly when to stop growing a tree. But to solve this overfitting problem in the solution, we use a different method, i.e. a random forest classifier.

Advantages and disadvantages of decision trees in machine learning

A decision tree is used to solve both classification and regression problems. But the main disadvantage of a decision tree is that it usually results in overfitting of data. Let's take a closer look at its advantages and disadvantages.

Benefits of Decision Tree

1. Clear Visualization: The algorithm is easy to understand, interpret and visualize since this idea is mainly used in our daily life. The result of a decision tree can be easily interpreted by humans.

2. Simplicity and ease of understanding. The decision tree looks like simple if-else statements that are very easy to understand.

3. Decision trees can be used for both classification and regression problems.

4 . Decision trees can handle both continuous and categorical variables.

5. Function scaling not required . In the case of a decision tree, no feature scaling (standardization and normalization) is required because it uses a rule-based approach rather than a distance calculation.

6. The decision tree can automatically handle missing values ​​ .

7 . The decision tree is usually robust to outliers and can handle them automatically.

9 . Less training period : The training period is shorter compared to random forest because it creates only one tree, unlike the forest of trees in random forest.

Decision tree weaknesses

1. Overfitting: This is the main problem of the decision tree. This usually results in overfitting of the data, eventually leading to incorrect predictions. To fit the data (even noisy ones), it keeps generating new nodes, and eventually the tree becomes too complex to interpret. Thus, it loses its generalization possibilities. It performs very well on trained data, but starts to make a lot of mistakes on unseen data.

2. High dispersion. As mentioned in point 1, a decision tree usually results in data overfitting. Due to overfitting, there is a very high probability of high variance in the output, which leads to a large number of errors in the final estimate and shows a high inaccuracy of the results.


Learn more