We make several decisions in life. Few of them are choosing to save or spend money, buying a home with a partner, breaking up with partner, getting a pet, choice of degree to study, choosing a University, quitting a job, quitting smoking, getting married, having children, buying a new car and many more.. Recent addition – Should I get a COVID test?
Decision Tree is the graphical (inverted tree-like) representation of all the possible solutions to a decision based on certain conditions. These conditions are generally if-then statements. The deeper the tree, the more complex these conditions will be. Each condition is a question and its answers help quantify the information from the question.
Now let’s look at different parts of a Decision Tree/ Terminologies used.
1. Decision Node/ Root Node: This is the starting point of algorithm with entire data set.
2. Leaf Node: This is end point of the algorithm which carries the decision. It cannot be split further.
3. Branches: Represents different options/ courses of action available while making a decision. They are commonly indicated with an arrow line.
4. Parent Node: Every node that further gives way to other sub-nodes.
5. Child Node: Nodes arising out of a parent node.
6. Pruning: Removing unwanted branches thereby reduces the complexity of the model.
All said and done, but how do we pick the best attribute? We are going to use 2 methods – Information Gain and Gini Index. They are going to help us pick the best attribute!
Construction of decision trees is all about finding an attribute which gives the highest Information Gain (IG). It is based on the concept of entropy, which is the degree of uncertainty, impurity or disorder. It aims to reduce the level of entropy starting from the root node to the leaf nodes.
Equation of Information Gain
Information Gain = Entropy parent – [weighted average] * [Entropy children]
We’ll discuss every part of this formula as we progress.
Now what’s Entropy?
Entropy controls how a decision tree decides to split the data. This is the 1st step in the Algorithm and it is a metric which measures the impurity of the data set.
Equation of Entropy
Entropy = -∑ P (x)log P(x)
*P(x) denotes the probability
Let’s understand these terminologies and equations better with an example of any major life decision. Right now, I’m on a fence about whether to buy a home or continue renting.
Buying a home is a major life decision and a major factor in anyone’s decision making process comes down to finances. Of course, there are many other factors/ attributes like location or size of the house, situations etc... I have kept it simple and listed 3 attributes influencing our decision.
Please understand these attributes before we proceed. First attribute is “Financial Stability” – Do you feel confident with your financial situation? Do you have funds to pay bills and you are debt free and saved enough for future goals and for any emergencies? If the answer is yes, then you are Stable!
Next one is “Duration of Stay” – How long are you going to stay in a same place. If you intend to stay in one place for a long time or not? Lower limit is 5yrs here.
Last one “Budget” - We need to assess our financial situation and be realistic about it. Can you afford down payment, taxes, maintenance, Repairs & others?
Decision is a Label.
Values under every attribute are child node and values under Decision is a parent node.
Let’s work with Financial Stability Attribute.
There are 4 values under this attribute and corresponding value (Label) for every attribute.
Values under Decision column is a parent node (Buy, Rent, Buy, Rent)
Values under Financially Stable is a child node (Stable, Unstable, Stable, Stable)
There are 2 types of values (Rent, Buy) in parent node and 4 values (Buy, Rent, Buy, Rent)
P(Rent) = Fraction of Rent values in parent node = 2/4 = 0.5
P(Buy) = Fraction of Buy values in patent node = 2/4 = 0.5
Therefore, Entropy of Parent node:
Entropy (Parent) = - [0.5 log2 (0.5) + 0.5 log2 (0.5)]
= - [-0.5 + (-0.5)]
Hence Entropy of Parent node is 1. This is the 1st half of the Information Gain formula!
Now next step is to find out Entropy of Child nodes.
When Attribute (Financial Stability) is Stable, Decision is BBR
When Attribute (Financial Stability) is Unstable, Decision is R
Now, let's find out Entropy of both the child nodes (BBR, R) using Entropy formula.
Entropy of left-side child node (BBR) = Buy, Buy, Rent = 0.9
Entropy of right-side child node (R) = Rent = 0
Total No. of examples in parent node (BRBR) = 4
No. of examples in left-side child node (BBR) = 3
No. of examples in right-side child node (R) = 1
[Weighted Average] Entropy (Children) = [0.9 * ¾] + [0 * ¼] = 0.675
This is the 2nd half of Information Gain formula!
Information Gain = Entropy parent – [weighted average] (Entropy children)
So, Information Gain for Financial Stability = 1 – 0.675 = 0.325
In similar manner, we calculate IG for Duration of Stay and IG for Budget.
IG (Financial Stability) = 0.325
IG (Duration of Stay) = 0
IG (Budget) = 1
Decision Tree Algorithm constructs a Decision Tree based on Attribute that have highest Information Gain. In our case, best attribute is Budget whose Information gain is the highest compared to other 2 attributes. Hence Budget is the winner!
Gini Index or Gini Impurity measures the probability of an attribute being wrongly labelled when it is randomly chosen. But what does ‘impurity’ mean? The degree of Gini index varies between 0 and 1, where 0 denotes that all elements belong to one class, and 1 denotes that the elements are randomly distributed across various classes. A Gini Index of 0.5 denotes equally distributed elements among some classes. If all the elements belong to a single class, then it can be called pure.
While building the decision tree, we would prefer choosing an attribute with the least Gini index as the root node.
Equation of Gini Index
where pi is the probability of an object being classified to a specific class.
Let’s understand how the Gini Index works using our earlier example.
1. Gini Index for Financial Stability Attribute
If (Financial Stability = Stable & Decision = Buy) = 2/3
If (Financial Stability = Stable & Decision = Rent) = 1/3
Gini Index (Stable) = 1 – [(2/3)^2+ (1/3)^2]
= 1- [0.44 + 0.11] = 0.45
If (Financial Stability = Unstable & Decision = Buy) = 0
If (Financial Stability = Unstable & Decision = Rent) = 1
Gini Index (Unstable) = 1 – [(0)^2+(1)^2 ]
= 1 – 1= 0
Weighted sum of the Gini Indices can be calculated as follows:
Gini Index for Financially Stable = [(3/4)0.45 + (1/4) 0] = 0.3375
2. Gini Index for Duration of Stay Attribute
P (More than 5yrs) = 2/4
If (Duration of Stay= More than 5 yrs & Decision= Buy) = 1/2
If (Duration of Stay= More than 5 yrs & Decision= Rent) = 1/2
Gini Index (More than 5 yrs) = 1 – [(1/2)^2+ (1/2)^2] = 0.5
P (Less than 5yrs) = 2/4
If Duration of Stay=Less than 5 yrs & Decision = Buy =1/2
If Duration of Stay= Less than 5 yrs & Decision = Rent =1/2
Gini Index (Less than 5 yrs)=1 – [(1/2)^2+ (1/2)^2] = 0.5
Weighted sum of the Gini Indice calculations
Gini Index for Duration of Stay = [(2/4)0.5 + (2/4)0.5] = 0.50
3. Gini Index for Budget Attribute
P (Yes) = 2/4
If (Budget = Yes & Decision = Buy) = 2/2
If (Budget = Yes & Decision = Rent) = 0
Gini Index (Yes) = 1 – [(2/2)^2+ (0)^2] = 0
P (No) = 2/4
If (Budget = No & Decision = Buy) = 0
If (Budget = No & Decision = Rent) =2/2
Gini Index (No)= 1 – [(1/2)^2 + (1/2)^2] = 0
Weighted sum of the Gini Indice calculations
Gini Index for Budget = [(2/4)0 + (2/4)0] = 0
From our calculations, we observe that Budget has the lowest Gini Index and hence it will be chosen as the root node for decision tree. Budget is again the winner!
Gini Index, unlike Information Gain, isn’t computationally intensive as it doesn’t involve the logarithm function used to calculate entropy in information gain, which is why Gini Index is preferred over Information Gain. Using Decision Tree, we arrived at a conclusion that Budget is the best attribute to consider while deciding to buy a home.
There are a few pros and cons that come along with the decision trees. Let’s discuss the advantages first.
Decision trees takes less time in processing the data.
Even with few missing values in the dataset, the performance of the model won’t be affected.
A Decision Tree model is intuitive and easy to explain to the technical teams and to stakeholders too. Hence, it can be implemented across several organizations.
Here comes the disadvantages.
In decision trees, small changes in the data can cause a large change in the structure of the decision tree that in turn leads to instability.
Training time drastically increases, proportional to the size of the dataset. Calculations can go complex compared to the other traditional algorithms.
There is still a lot more to learn. Hope this article gave you a quick start to explore other advanced classification algorithms.