A Beginner's guide to Decision Tree

We make several decisions in life. Few of them are choosing to save or spend money, buying a home with a partner, breaking up with partner, getting a pet, choice of degree to study, choosing a University, quitting a job, quitting smoking, getting married, having children, buying a new car and many more.. Recent addition – Should I get a COVID test?

Decision Tree is the graphical (inverted tree-like) representation of all the possible solutions to a decision based on certain conditions. These conditions are generally if-then statements. The deeper the tree, the more complex these conditions will be. Each condition is a question and its answers help quantify the information from the question.

Now let’s look at different parts of a Decision Tree/ Terminologies used.

1. Decision Node/ Root Node: This is the starting point of algorithm with entire data set.

2. Leaf Node: This is end point of the algorithm which carries the decision. It cannot be split further.

3. Branches: Represents different options/ courses of action available while making a decision. They are commonly indicated with an arrow line.

4. Parent Node: Every node that further gives way to other sub-nodes.

5. Child Node: Nodes arising out of a parent node.

6. Pruning: Removing unwanted branches thereby reduces the complexity of the model.

All said and done, but how do we pick the best attribute? We are going to use 2 methods – Information Gain and Gini Index. They are going to help us pick the best attribute!

Information Gain

Construction of decision trees is all about finding an attribute which gives the highest Information Gain (IG). It is based on the concept of entropy, which is the degree of uncertainty, impurity or disorder. It aims to reduce the level of entropy starting from the root node to the leaf nodes.

Equation of Information Gain

Information Gain = Entropy parent – [weighted average] * [Entropy children]

We’ll discuss every part of this formula as we progress.

Now what’s Entropy?

Entropy controls how a decision tree decides to split the data. This is the 1st step in the Algorithm and it is a metric which measures the impurity of the data set.

Equation of Entropy

Entropy = -∑ P (x)log P(x)

*P(x) denotes the probability

Let’s understand these terminologies and equations better with an example of any major life decision. Right now, I’m on a fence about whether to buy a home or continue renting.

Buying a home is a major life decision and a major factor in anyone’s decision making process comes down to finances. Of course, there are many other factors/ attributes like location or size of the house, situations etc... I have kept it simple and listed 3 attributes influencing our decision.

Please understand these attributes before we proceed. First attribute is “Financial Stability” – Do you feel confident with your financial situation? Do you have funds to pay bills and you are debt free and saved enough for future goals and for any emergencies? If the answer is yes, then you are Stable!

Next one is “Duration of Stay” – How long are you going to stay in a same place. If you intend to stay in one place for a long time or not? Lower limit is 5yrs here.

Last one “Budget” - We need to assess our financial situation and be realistic about it. Can you afford down payment, taxes, maintenance, Repairs & others?

Decision is a Label.

Values under every attribute are child node and values under Decision is a parent node.

Let’s work with Financial Stability Attribute.

  • There are 4 values under this attribute and corresponding value (Label) for every attribute.

  • Values under Decision column is a parent node (Buy, Rent, Buy, Rent)

  • Values under Financially Stable is a child node (Stable, Unstable, Stable, Stable)

  • There are 2 types of values (Rent, Buy) in parent node and 4 values (Buy, Rent, Buy, Rent)

  1. P(Rent) = Fraction of Rent values in parent node = 2/4 = 0.5

  2. P(Buy) = Fraction of Buy values in patent node = 2/4 = 0.5

Therefore, Entropy of Parent node:

Entropy (Parent) = - [0.5 log2 (0.5) + 0.5 log2 (0.5)]

= - [-0.5 + (-0.5)]

= 1

Hence Entropy of Parent node is 1. This is the 1st half of the Information Gain formula!

Now next step is to find out Entropy of Child nodes.

When Attribute (Financial Stability) is Stable, Decision is BBR

When Attribute (Financial Stability) is Unstable, Decision is R

Now, let's find out Entropy of both the child nodes (BBR, R) using Entropy formula.

Entropy of left-side child node (BBR) = Buy, Buy, Rent = 0.9

Entropy of right-side child node (R) = Rent = 0

Total No. of examples in parent node (BRBR) = 4

No. of examples in left-side child node (BBR) = 3

No. of examples in right-side child node (R) = 1