How to make a math tree diagram

Probability Tree Diagrams Explained! — Mashup Math

This quick introduction will teach you how to calculate probabilities using tree diagrams.

Figuring out probabilities in math can be confusing, especially since there are many rules and procedures involved. Luckily, there is a visual tool called a probability tree diagram that you can use to organize your thinking and make calculating probabilities much easier.

At first glance, a probability tree diagram may seem complicated, but this page will teach you how to read a tree diagram and how to use them to calculate probabilities in a simple way. Follow along step-by-step and you will soon become a master of reading and creating probability tree diagrams.

What is a Probability Tree Diagram?

Example 01: Probability of Tossing a Coin Once

Let’s start with a common probability event: flipping a coin that has heads on one side and tails on the other:

This simple probability tree diagram has two branches: one for each possible outcome heads or tails. Notice that the outcome is located at the end-point of a branch (this is where a tree diagram ends).

Also, notice that the probability of each outcome occurring is written as a decimal or a fraction on each branch. In this case, the probability for either outcome (flipping a coin and getting heads or tails) is fifty-fifty, which is 0.5 or 1/2.

Example 02: Probability of Tossing a Coin Twice

Now, let’s look at a probability tree diagram for flipping a coin twice!

Notice that this tree diagram is portraying two consecutive events (the first flip and the second flip), so there is a second set of branches.

Using the tree diagram, you can see that there are four possible outcomes when flipping a coin twice: Heads/Heads, Heads/Tails, Tails/Heads, Tails/Tails.

And since there are four possible outcomes, there is a 0. 25 (or ¼) probability of each outcome occurring. So, for example, there is a 0.25 probability of getting heads twice in a row.

How to Find Probability

The rule for finding the probability of a particular event in a probability tree diagram occurring is to multiply the probabilities of the corresponding branches.

For example, to prove that there is 0.25 probability of getting two heads in a row, you would multiply 0.5 x 0.5 (since the probability of getting a heads on the first flip is 0.5 and the probability of getting heads on the second flip is also 0.5).

0.5 x 0.5 = 0.25

Repeat this process on the other three outcomes as follows, and then add all of the outcome probabilities together as follows:

Note that the sum of the probabilities of all of the outcomes should always equal one.

From this point, you can use your probability tree diagram to draw several conclusions such as:

·       The probability of getting heads first and tails second is 0. 5x0.5 = 0.25

·       The probability of getting at least one tails from two consecutive flips is 0.25 + 0.25 + 0.25 = 0.75

·       The probability of getting both a heads and a tails is 0.25 + 0.25 = 0.5

Independent Events and Dependent Events

What is an independent event?

Notice that, in the coin toss tree diagram example, the outcome of each coin flip is independent of the outcome of the previous toss. That means that the outcome of the first toss had no effect on the probability of the outcome of the second toss. This situation is known as an independent event.

 What is a dependent event?

Unlike an independent event, a dependent event is an outcome that depends on the event that happened before it. These kinds of situations are a bit trickier when it comes to calculating probability, but you can still use a probability tree diagram to help you.

Let’s take a look at an example of how you can use a tree diagram to calculate probabilities when dependent events are involved.

How to Make a Tree Diagram

Example 03:

Greg is a baseball pitcher who throws two kinds of pitches, a fastball, and a knuckleball. The probability of throwing a strike is different for each pitch:

·       The probability of throwing a fastball for a strike is 0.6

·       The probability of throwing a knuckleball for a strike 0.2

Greg throws fastballs more frequently than he throws knuckleballs. On average, for every 10 pitches he throws, 7 of them are fastballs (0.7 probability) and 3 of them are knuckleballs (0.3 probability).

So, what is the probability that the pitcher will throw a strike on any given pitch?

 To find the probability that Greg will throw a strike, start by drawing a tree diagram that shows the probability that he will throw a fastball or a knuckleball

The probability of Greg throwing a fastball is 0. 7 and the probability of him throwing a knuckleball is 0.3. Notice that the sum of the probabilities of the outcomes is 1 because 0.7 + 0.3 is 1.00.

 Next, add branches for each pitch to show the probability for each pitch being a strike, starting with the fastball:

Remember that the probability of Greg throwing a fastball for a strike is 0.6, so the probability of him not throwing it for a strike is 0.4 (since 0.6 + 0.4 = 1.00)

Repeat this process for the knuckleball:

Remember that the probability of Greg throwing a knuckleball for a strike is 0.2, so the probability of him not throwing it for a strike is 0.8 (since 0.2 + 0.8 = 1.00)

Now that the probability tree diagram has been completed, you can perform your outcome calculations. Remember that the sum of the probability outcomes has to equal one:

Since you are trying to figure out the probability that Greg will throw a strike on any given pitch, you have to focus on the outcomes that result in him throwing a strike: fastball for a strike or knuckleball for a strike:

The last step is to add the strike outcome probabilities together:

0. 42 + 0.06 = 0.48

 The probability of Greg throwing a strike is 0.48 or 48%.

Probability Tree Diagrams: Key Takeaways

·      A probability tree diagram is a handy visual tool that you can use to calculate probabilities for both dependent and independent events.

·      To calculate probability outcomes, multiply the probability values of the connected branches.

·      To calculate the probability of multiple outcomes, add the probabilities together.

·      The probability of all possible outcomes should always equal one. If you get any other value, go back and check for mistakes.


Check out the animated video lessons and keep

Check out the video lessons below to learn more about how to use tree diagrams and calculating probability in math:

Have thoughts? Share your input in the comments section below!

(Never miss a Mashup Math blog--click here to get our weekly newsletter!)

By Anthony Persico

Anthony is the content crafter and head educator for YouTube's MashUp Math. You can often find me happily developing animated math lessons to share on my YouTube channel . Or spending way too much time at the gym or playing on my phone.


Probability Tree Diagrams: Examples, How to Draw

Probability > How to Use a Probability Tree

Probability trees are useful for calculating combined probabilities for sequences of events. It helps you to map out the probabilities of many possibilities graphically, without the use of complicated probability formulas.

Watch the video for an example.

How to draw a probability tree

Watch this video on YouTube.

Can’t see the video? Click here.

Why Use a probability tree?
Sometimes you don’t know whether to multiply or add probabilities. A probability tree makes it easier to figure out when to add and when to multiply. Plus, seeing a graph of your problem, as opposed to a bunch of equations and numbers on a sheet of paper, can help you see the problem more clearly.

Parts of a Probability Tree Diagram

A probability tree has two main parts: the branches and the ends(sometimes called leaves). The probability of each branch is generally written on the branches, while the outcome is written on the ends of the branches.

Multiplication and Addition
Probability Trees make the question of whether to multiply or add probabilities simple: multiply along the branches and add probabilities down the columns. In the following example (from Yale University), you can see how adding the far right column adds up to 1, which is what we would expect the sum total of all probabilities to be:
.9860 + 0.0040 + 0.0001 + 0.0099 = 1

Real Life Uses

Probability trees aren’t just a theoretical tool used the in the classroom—they are used by scientists and statisticians in many branches of science, research and government. For example, the following tree was used by the Federal government as part of an early warning program to assess the risk of more eruptions on Mount Pinatubo, an active volcano in the Philippines.
Image: USGS.

How to Use a Probability Tree or Decision Tree

Sometimes, you’ll be faced with a probability question that just doesn’t have a simple solution. Drawing a probability tree (or tree diagram) is a way for you to visually see all of the possible choices, and to avoid making mathematical errors. This how to will show you the step-by-step process of using a decision tree.
How to Use a Probability Tree: Steps
Example question: An airplane manufacturer has three factories A B and C which produce 50%, 25%, and 25%, respectively, of a particular airplane. Seventy percent of the airplanes produced in factory A are passenger airplanes, 25% of those produced in factory B are passenger airplanes, and 25% of the airplanes produced in factory C are passenger airplanes. If an airplane produced by the manufacturer is selected at random, calculate the probability the airplane will be a passenger plane.

Step 1:Draw lines to represent the first set of options in the question (in our case, 3 factories). Label them: Our question lists A B and C so that’s what we’ll use here.

Step 2: Convert the percentages to decimals, and place those on the appropriate branch in the diagram. For our example, 50% = 0.5, and 25% = 0.25.

Step 3: Draw the next set of branches. In our case, we were told that 70% of factory A’s output was passenger. Converting to decimals, we have 0.7 P (“P” is just my own shorthand here for “Passenger”) and 0.3 NP (“NP” = “Not Passenger”).

Step 4:Repeat step 3 for as many branches as you are given.

Step 5: Multiply the probabilities of the first branch that produces the desired result together. In our case, we want to know about the production of passenger planes, so we choose the first branch that leads to P.

Step 6: Multiply the remaining branches that give the desired result. In our example there are two more branches that can lead to P.

Step 6: Add up all of the probabilities you calculated in steps 5 and 6. In our example, we had:

.35 + .0625 + .0625 = .475

That’s it!

Example 2

Example Question: If you toss a coin three times, what is the probability of getting 3 heads?

The first step is to figure out your probability of getting a heads by tossing the coin once. The probability is 0.5 (you have a 50% probability of tossing a heads and 50% probability of tossing a tails). Those probabilities are represented at the ends of each branch.

Next, add two more branches to each branch to represent the second coin toss. The probability of getting two heads is shown by the red arrow. To get the probability, multiply the branches:
0.5 * 0.5 = 0.25 (25%).
This makes sense because your possible results for one head and one tails is HH, HT, TT, or TH (each combination has a 25% probability).

Finally, add a third row (because we were trying to find the probability of throwing 3 heads). Multiplying across the branches for HHH we get:
0.5 * 0.5 * 0.5 = 0.125, or 12.5%.

In most cases, you will multiply across the branches to get probabilities. However, you may also want to add vertically to get probabilities. For example, if we wanted to find out our probability of getting HHH OR TTT, we would first calculated the probabilities for each (0.125) and then we would add both those probabilities: 0.125 + 0.125 = 0.250.

Tip: You can check you drew the tree correctly by adding vertically: all the probabilities vertically should add up to 1.

Next: Tree Diagram Real Life Example


Punongbayan, R. et al. USGS Repository: Eruption Hazard Assessments and Warnings.

Stephanie Glen. "Probability Tree Diagrams: Examples, How to Draw" From Elementary Statistics for the rest of us!


Need help with a homework or test question? With Chegg Study, you can get step-by-step solutions to your questions from an expert in the field. Your first 30 minutes with a Chegg tutor is free!

Comments? Need to post a correction? Please Contact Us.

Decision trees - CART mathematical apparatus. Part 1

The general principle of constructing decision trees was given in the article "Decision Trees - Basic Principles of Operation".

This article will focus on the CART algorithm. CART, short for Classification And Regression Tree, is a binary decision tree algorithm first published by Briman et al. in 1984 [1]. The algorithm is designed to solve classification and regression problems. There are also several modified versions - IndCART and DB-CART algorithms. The IndCART algorithm, which is part of the Ind package, differs from CART in using a different way of handling missing values, does not perform the regression part of the CART algorithm, and has different cutoff parameters. The DB-CART algorithm is based on the following idea: instead of using the training dataset to determine splits, use it to estimate the distribution of input and output values, and then use this estimate to determine splits. DB, respectively, means - "distribution based". This idea is claimed to result in a significant reduction in classification error compared to standard tree building methods. The main differences between the CART algorithm and the ID3 family algorithms are:

  • binary representation of the decision tree;
  • partition quality evaluation function;
  • tree pruning mechanism;
  • algorithm for handling missing values;
  • building regression trees.

Binary representation of the decision tree

In the CART algorithm, each decision tree node has two children. At each step of building the tree, the rule formed in the node divides the given set of examples (training set) into two parts - the part in which the rule is true (child - right) and the part in which the rule is not true (child - left). To select the optimal rule, the function of estimating the quality of the partition is used.

Proposed algorithmic solution.

It is quite obvious. Each node (structure or class) must have references to two descendants Left and Right - similar structures. The node must also contain a rule identifier (for more details on the rules, see below), describe in some way the right side of the rule, contain information about the number or ratio of examples of each class of the training sample "passed" through the node, and have the sign of a terminal node - a leaf. These are the minimum requirements for the structure (class) of a tree node.

Partition quality evaluation function

Decision tree training refers to the supervised learning class, that is, the training and test samples contain classified sets of examples. The evaluation function used by the CART algorithm is based on the intuitive idea of ​​reducing impurity (uncertainty) in a node. Consider a problem with two classes and a node that has 50 instances of one class each. The node has the maximum "impurity". If a partition is found that splits the data into two subgroups of 40:5 examples in one and 10:45 in the other, then intuitively the "impurity" will decrease. It will completely disappear when a split is found that will create subgroups 50:0 and 0:50. In the CART algorithm, the idea of ​​"impurity" is formalized in index 92$ ,

where p i is the probability (relative frequency) of class i in T .

If the set t is divided into two parts t 1 and t 2 with the number of examples in each N 1 and N 2 , :

$$Gini_{split}\,(T) = \frac{N_1}{N}\,\cdot\,Gini\,(T_1)\,+\,\frac{N_2}{N}\, \cdot\,Gini\,(T_2)$$

The best split is the one for which Gini split (T) is minimal. 2\Biggr)\,\rightarrow\,\min$$ 92\rightarrow\max$$

As a result, the best partition is the one for which the value is maximum . Less commonly, the CART algorithm uses other partitioning criteria Twoing, Symmetric Gini, etc., see [1] for more details.

Partitioning rules

The vector of predictor variables supplied to the input of the tree can contain both numeric (ordinal) and categorical variables. In any case, at each node, the partition goes only through one variable . If the variable is of a numeric type, then a rule of the form 9 is formed in the node0031 x i <= c . Where with is some threshold, which is most often chosen as the arithmetic mean of two adjacent ordered values ​​of the variable x i of the training sample. If the variable is of categorical type, then the rule x i V(x i ) is formed in the node, where V(x i ) is some non-empty subset of the set of values ​​of the variable x i in the training set. Therefore, for n values ​​of a numeric attribute, the algorithm compares n-1 partitions, and for a categorical one (2 n-1 – 1). At each step of tree construction, the algorithm sequentially compares all possible partitions for all attributes and selects the best attribute and the best partition for it.

Proposed algorithmic solution.

Let's agree that the source of the data necessary for the operation of the algorithm is represented as a flat table. Each row of the table describes one example of a training/test set.

Each step of building a tree actually consists of a set of three laborious operations.

The first is sorting the data source by column. Required to calculate the threshold when the attribute currently being considered is of a numeric type. At each step of building a tree, the number of sorts will be at least equal to the number of attributes of a numeric type.

The second is the separation of the data source. After the best partition is found, it is necessary to split the data source in accordance with the rule of the formed node and recursively call the construction procedure for the two halves of the data source.

Both of these operations are connected (if acted directly) with the movement of significant amounts of memory. Here, the data source is intentionally not called a table, since you can significantly reduce the time spent on building a tree if you use an indexed data source. Access to data in such a source occurs not directly, but through logical indexes of data rows. You can sort and split such a source with minimal performance loss.

The third operation, which takes 60–80% of the program execution time, is the calculation of indices for all possible partitions. If you have n - numeric attributes and m - examples in the sample, then you get a table of n * (m-1) - indexes, which takes up a large amount of memory. This can be avoided by using one column for the current attribute and one row for the best (maximum) indexes for all attributes. You can even use only a few numeric values, resulting in a fast, but poorly readable code. You can significantly increase productivity if you use that L \u003d N - R, l i = "n i " – r i , while l i and r i always change and only by one when moving to the next line for the current attribute. That is, counting the number of classes, and this is the main operation, will be performed quickly if you know the number of instances of each class in total in the table and, when moving to a new row in the table, change by one only the number of instances of one class - the class of the current example.

It is convenient to represent all possible partitions for categorical attributes by analogy with the binary representation of a number. If the attribute has n unique values. 2 n - partitions. The first (where all zeros) and the last (all ones) are of no interest to us, we get 2 n - 2. And since the order of the sets is also unimportant here, we get (2 n - 2) / 2 or (2 n- 1 - 1) the first (from one) binary representations. If {A, B, C, D, E} are all possible values ​​of some attribute X, then for the current partition, which has a representation, say {0, 0, 1, 0, 1}, we obtain the rule X in {C, E} for the right branch and [ not {0, 0, 1, 0, 1} = {1, 1, 0, 1, 0} = X in {A, B, D} ] for the left branch.

Often, attribute values ​​of a categorical type are represented in the database as string values. In this case, it is faster and more convenient to create a cache of all attribute values ​​and work not with values, but with indexes in the cache.

Tree pruning

Tree pruning, originally called minimal cost-complexity tree pruning, is the most significant difference between the CART algorithm and other tree building algorithms. CART sees pruning as getting a compromise between two problems: getting an optimal tree size and getting an accurate estimate of the misclassification probability.

The main pruning problem is the large number of all possible pruned subtrees for a single tree. More precisely, if a binary tree has | T | – sheets, then there exists ~[1.5028369 |T| ] pruned subtrees. And if the tree has at least 1000 leaves, then the number of pruned subtrees becomes simply huge.

The basic idea of ​​the method is not to consider all possible subtrees, limiting ourselves only to the "best representatives" according to the estimate below.

Denote | T | – number of tree leaves, R(T) – tree classification error equal to the ratio of the number of incorrectly classified examples to the number of examples in the training set. Let's define $C_{\alpha}\,(T)$ – total cost (estimate/cost-complexity indicator) of the tree T as:

$C_{\alpha}\,(T) = R\,(T)\, +\,\alpha\,*\,|T|$, where | T | is the number of leaves (terminal nodes) of the tree, is some parameter that varies from 0 to + $\infty$. The total cost of a tree consists of two components - the tree classification error and the penalty for its complexity. If the classification error of a tree is constant, then as the value increases, the total cost of the tree will increase. Then, depending on , a less branched tree that gives a larger classification error may cost less than a tree that gives a smaller error, but more branched.

Let's define T max - the maximum size of the tree to be cut. If we fix the value of $\alpha$, then there is a smallest minimizable subtree $\alpha$ that satisfies the following conditions:

  1. $C_{\alpha}\Bigl(T\,(\alpha)\Bigr) = min_{\ ,\,T \leq T_{\max}}\,C_{\alpha}(T)$
  2. $if\,\,C_{\alpha}\,(T) = C_{\alpha}\Bigl(T\,(\alpha)\Bigr)\,then\,\,T\,(\alpha) \leq\,T$

The first condition says that there is no such subtree of the tree T max , which would have a lower cost than $T\,(\alpha)$ for this value of $\alpha$. The second condition says that if there is more than one subtree that has a given total cost, then we choose the smallest tree.

It can be shown that for any value there is such a smallest minimizable subtree. But this task is not trivial. What she is saying is that it cannot be that two trees reach a minimum total cost and they are incomparable, i.e. neither of them is a subtree of the other. We will not prove this result.

Although it has an infinite number of values, there are a finite number of subtrees of the tree Tmax. You can build a sequence of decreasing supports of the tree T Max :

T 1 > T 2 > T 3 > ... (t 1 },

(where T 1 - - root node of the tree) such that T k is the smallest minimizable subtree for $[\alpha_k,\,\alpha_{k+1})$. This is an important result, as it means that we can get the next tree in the sequence by applying pruning to the current tree. This allows us to develop an efficient algorithm for finding the smallest minimized subtree for various values ​​of $\alpha$. The first tree in this sequence is the smallest subtree of the tree T max having the same classification error as T max , i.e. $T_1 = T(\alpha=0)$. Clarification: if the splitting goes on until only one class remains in each node, then T 1 = T max , but since prepruning methods are often used, then there may be a subtree of the tree T max having the same classification error.

Proposed algorithmic solution.

Calculation algorithm T 1 of T max simple. Find any pair of leaves with a common ancestor that can be merged, i.e. truncated to the parent node without increasing the classification error. R(t) = R(l) + R(r) , where r and l are leaves of node t . Continue until there are no more pairs left. So we get a tree that has the same cost as T max at = 0, but less branching than T max .

How do we get the next tree in the sequence and the corresponding $\alpha$ value? Let T t be the branch of the tree T with the root node t . For what values ​​will the tree T - T t be better than T ? If we cut at node t , then its contribution to the total cost of the tree T – T t becomes $C_{\alpha}(\{t\}) = R\,(t)\,+\, \alpha$, where R(t) = r(t)* p(t) , r(t) is the classification error of node t and p(t) is the proportion of cases that "passed" through node t . Alternative: R(t)= m/n , where m is the number of examples classified incorrectly, and n is the total number of classified examples for total trees.

The contribution T t to the total cost of tree T is $C_{\alpha}(T_t) = R\,(T_t)\,+\,\alpha\,|T_t|$, where

$R\,(T_t) = \sum_{{t'}\in {T_t}}{R(t')}$

Tree T – T t will be better than T when $C_{\alpha}\{(t\}) = C_{\alpha}\,(T_t)$, because with this they have the same value in magnitude, but T – T t is the smaller of the two. When $C_{\alpha}\{(t\}) = C_{\alpha}\,(T_t)$ we get:

$$R\,(T_t)\,+\,\alpha\,|T_t | = R\,(t)\,+\,\alpha$$


So for any node t to T 1 if we increase $\alpha$ then when

$$\alpha = \frac{R\,(t)\,-\,R\ ,(T_{1,\,t})}{|T_{1,\,t}|\,-\,1}$$

the tree obtained by pruning at node t will be better than T 1 .

The main idea is the following: we calculate this value for each node in the tree T 1 , and then we select "weak ties" (there may be more than one), i.e. nodes for which the value is

$$g\,(t) = \frac{R\,(t)\,-\,R\,(T_{1,\,t})}{|T_{1,\,t} |\,-\,1}$$

is the smallest. We prune T 1 at these nodes to get T 2 , the next tree in the sequence. Then we continue this process for the resulting tree and so on until we get the root node (a tree with only one node).

Algorithm for calculating a sequence of trees.

$T_1 = T\,(\alpha = 0)$

$\alpha_1 = 0$

k = 1

while T k > {root node} do begin

for all nonterminal nodes (!leaves) in t T k

$$g\,(t) = \frac {R\,(t)\,-\,R\,(T_{k,\,t})}{|T_{k,\,t}|\,-\,1}$$

$\alpha_{k+1} = min_{t}\,\,g_k(t)$

Traverse from top to bottom all nodes and truncate those where {k+1}$

k = k + 1


Nodes must be traversed from top to bottom so as not to cut off nodes that are cut off by themselves, as a result of cutting off the n-th ancestor.

Consider the method in more detail using an example.

As an example, let's calculate the sequence of subtrees and corresponding values ​​for the tree shown in fig. 1.

Obviously, T 1 = T max , since all sheets contain examples of the same class and cutting off any node ( t 3 or t 5 leads to an increase in the error) 9004 5 903 classification.
Then we calculate g 1 (t) for all nodes t to t 1 .

$$g\,(t_1) = \frac{R\,(t_1)\,-\,R\,(T_{1,\,t_1})}{|T_{1,\,t_1}| \,-\,1}$$

R(t 1 ) – classification error. If you turn t 1 into a leaf, then you should match it with some class; since the number of examples of both classes is the same (equal to 100), then the class is chosen at random, in any case it incorrectly classifies 100 examples. In total, the tree was trained on 200 examples (100+100=200 for the root node). R(t 1 ) = m/n. m=100, n=200. R(t 1 ) = 1/2.

$R\,(T_1,\,t_1)$ is the sum of errors of all leaves of subtree . It is calculated as the sum over all sheets of the ratio of the number of misclassified examples in sheet to the total number of examples for in tree . In the example, we divide everything by 200. Since for the subtree rooted at t 1 , it is the same tree T 1 , all leaves do not have misclassified examples, so 95\frac{0}{200} = 0$$

$|T_1,\,t_1|$ number of leaves of subtree rooted at node t 1 . There are 5 in total.

We get:

g 1 (t 1 ) = (1/2 - 0)/(5 - 1) = 1/8.
g 1 (t 2 ) = 3/20.
g 1 (t 3 ) = 1/20.
g 1 (t 5 ) = 1/20.

Nodes t 3 and t 5 both store the minimum value g 1 , we get a new tree T 2 by cutting T 1 at both of these nodes. We got the tree shown in Figure 2.

Next, we continue to calculate g - the value for T 2 .

g 2 (t 1 ) = (100/200 - (0/200 + 10/200 + 10/200)) / (3 - 1) = 2/10.

g 2 (t 2 ) = (60/200 - (0/200 + 10/200)) / (2 - 1) = 1/4.

The minimum is stored in the node t 1 , so we truncate at t 1 and get the root of the tree (T 3 = {t 1 ) This completes the cutting process.

The sequence of values ​​is: $\alpha$ 1 = 0, $\alpha$ 2 = 1/20, $\alpha$ 3 = 2/10 . So T 1 is the best tree for [0, 1/20) , T 2 – for [1/20, 2/10) and

Selection of the final tree

So, we have a sequence of trees, we need to choose the best tree from it. The one we will use in the future. The most obvious is the choice of the final tree through testing on a test sample. The tree with the lowest classification error is the best tree. However, this is not the only possible way.

For a continuation of the description, see the article "Description of the CART algorithm. Part 2".


  • L. Breiman, J.H. Friedman, R.A. Olshen, and C.T. Stone. Classification and Regression Trees. Wadsworth, Belmont, California, 1984.
  • J.R. Quinlan. C4.5 Programs for Machine Learning. Morgan Kaufmann, San Mateo, California, 1993.
  • Machine Learning, Neural and Statistical Classification. Editors: D. Michie, D.J. Spiegelhalter, C.C. Taylor, 02/17/1994.

Decision trees - C4.5 mathematical apparatus | Part 1

Deconstructing decision tree learning algorithm C4.5: requirements for training dataset and classification of new objects.

This article will consider the mathematical apparatus of the learning algorithm for decision trees C4.5. The algorithm was proposed by R. Quinlan as an improved version of the ID3 algorithm, which added the ability to work with missing data. The basic building ideas were described in the article Decision Trees: General Principles.

Before proceeding to the description of the algorithm, let's define the mandatory requirements for the structure of the training data set and directly to the data itself, under which the C4.5 algorithm will work and give correct results.

  1. Data must be structured, i.e. be a table whose columns are attributes (features) that describe a subject area or business process, and rows are training examples that are classified objects for which a class label is given (since the algorithm uses supervised learning). All rows must contain the same set of attributes.
  2. One of the attributes must be specified as the target, i.e. class attribute. Each training example must have a class label. Input attributes can be either continuous or discrete, while a class attribute can only be discrete, i.e. take a finite number of unique values.
  3. Each instance of the training set must uniquely refer to the corresponding class. Probabilistic estimates of the degree of belonging of examples to a class are not used (such a formulation refers to fuzzy decision trees). The number of classes in the training set should be much less than the number of training examples.

Description of the learning algorithm

Let a learning set S be given, containing m attributes and n examples. For the set S, k classes C_1,C_2,…C_k are defined. The task is to build a hierarchical classification model in the form of a decision tree based on the training set S.

The decision tree is built from top to bottom - from the root node to the leaves.

At the first step of training, an "empty" tree is formed, which consists only of the root node containing the entire training set. It is required to split the root node into subsets from which descendant nodes will be formed. To do this, one of the attributes is selected and rules are formed that break the training set into subsets, the number of which is equal to the number p of unique attribute values.

As a result of splitting, p (according to the number of attribute values) subsets are obtained and, accordingly, p descendants of the root node are formed, each of which is assigned its own subset. This procedure is then recursively applied to all subsets until the training stop condition is met.

The main problem in training decision trees is choosing the attribute that will provide the best split (according to some measure of quality) at the current node. Although some decision tree learning algorithms allow each attribute to be used only once, in our case this restriction will not apply - each attribute can be used for splitting an arbitrary number of times.

Let a partitioning rule be applied to the training set, which uses the attribute A, which takes p values ​​a_1,a_2,…,a_p. As a result, p subsets S_1,S_2,…,S_p will be created, where examples will be distributed in which attribute A takes the corresponding value.

This raises the question: is the split by the selected attribute the best, or could we get a better split by choosing another attribute? To answer this question, we use information about the number of examples of all classes in the training set and in each resulting subset. 9{k} \frac{N(C_jS)}{N(S)}\text{Info}(S_i), (3)

branching attribute, you can use the following criterion:

\text{Gain}(A)=\text{Info}(S)- \text{Info}_A(S), (4)

gain - increase, increase). Then the criterion value is calculated for all potential partition attributes, and the attribute that maximizes it is selected.

The described procedure is applied to subsets S_i and further, until the values ​​of the criterion cease to increase significantly with new partitions or another stop condition is met.

If during tree construction an “empty” node is formed, where no example has fallen, then it is converted into a leaf that is associated with the class most often found in the immediate ancestor of the node.

The information gain criterion is based on the property of entropy, which means that it is the largest when all classes are equally probable, i.e. the class choice is maximally undefined, and is 0 when all instances in the node belong to the same class (in this case, the ratio under the logarithm is 1 and its value is 0). Thus, the increase in information reflects an increase in the class homogeneity of the resulting nodes.

The described procedure applies to discrete attributes. In the case of continuous attributes, the algorithm works a little differently. The threshold against which all values ​​will be compared is selected. Let the numeric attribute X take on a finite set of values ​​{x_1,x_2,…,x_p }. By ordering the examples in ascending order of attribute values, we get that any value between x_i and x_{i+1} divides all examples into two subsets. The first subset will contain the attribute values ​​x_1,x_2,…,x_i, and the second one will contain {x_{i+1},x_{i+2},…,x_p}.

Then the average can be chosen as the threshold: {T_1,T_2,…,T_{n-1}}. Consistently applying formulas (2), (3) and (4) to all potential thresholds, we choose the one that gives the maximum value according to criterion (4). Then, this value is compared with the criterion value (4) calculated for other attributes. If this value is the largest of all attributes, then it is selected as the threshold for testing.

It should be noted that all numerical tests are binary, i.e. divide the tree node into two branches.

Practical use of decision trees

After the decision tree is built on the training data set and a decision is made about its performance (the percentage of correctly recognized examples on the training set is quite large), you can start practical work with the tree - classifying new objects.

The new object to be classified first enters the root node of the tree, and then moves through the nodes, each of which checks if the attribute value matches the rule in this node, after which the object is redirected to one of the descendant nodes.

Learn more