# Decision Tree, Information Gain and Gini Index for Dummies

Decision Tree can be defined as a diagram or a chart that people use to determine a course of action or show a statistical probability. It represents a possible decision, outcome or reaction and an end result.

A Decision Tree is a structure that includes decision nodes, branches, and leaf nodes. Each internal node signifies a test on an attribute, each branch denotes the outcome of a test, and each leaf node holds a class label.

Decision Trees are used to find an answer to a complex problem.

The following methods could be used to achieve these answers:

1. Information Gain

2. Gini Index

3. Gain Ratio

Let us further understand how to calculate Information Gain and Gini Index with the help of an example.

Insomnia is a sleep disorder marked by problems getting to sleep, staying asleep and sleeping as long as one would like. It can have serious effects, leading to excessive lethargy, a higher risk of accidents and health effects from sleep deprivation.

A few major causes of Insomnia are irregular sleep schedule, unhealthy eating habits, illness and pain, bad lifestyle and stress. The following is the data we will use:

__Information Gain__ - It is the main key that is used by decision tree Algorithms to construct it. It measures how much information a feature gives us about the class.

Information Gain = entropy(parent) – [weighted average] * entropy(children)

Entropy – It is the impurity in a group of examples. Information gain is the decrease in entropy.

1. Analysis of Sleep Schedule feature:

Let us consider the values of the Insomnia column as the parent node (Acute, Acute, Hyper, Hyper) and the values of Sleep Schedule column as the children node (On Track, On Track, On Track, Off Track).

P(Acute) = fraction of Acute examples in parent node = 2/4 = 0.5

P(Hyper) = fraction of Hyper examples in parent node = 2/4 = 0.5

Therefore, Entropy of Parent Node:

= - [0.5 log2 (0.5) + 0.5 log2 (0.5)]

= - [- 0.5 + (- 0.5)]

= 1

Entropy of left child node (On Track) = Acute, Acute, Hyper = 0.9

Entropy of right child node (Off Track) = Hyper = 0

[Weighted Average] Entropy (children) = 3/4 * 0.9 + 1/4 * 0

= 0.675

Information Gain of Sleep Schedule

= entropy(parent) – [weighted average] * entropy(children)

= 1 – 0.675

Information Gain = 0.325

2. Analysis of Eating Habits feature:

Entropy of left child node (Healthy) is Acute, Hyper

Entropy of right child node (Unhealthy) is Acute, Hyper

Information Gain = Entropy(parent) – [Weighted average] * Entropy (children)

= 1 - (2/4 * 1 + 2/4 * 1)

= 1 - 1

Information Gain = 0

3. Analysis of Lifestyle feature:

Entropy of left child node of (Good) = Acute, Acute

Entropy of right child node of (Bad) = Hyper, Hyper

Information Gain = Entropy(parent) – [Weighted average] * Entropy (children)

= 1 - (2/4 * 0 + 2/4 * O)

= 1 - 0

Information Gain = 1

4. Analysis of Stress feature:

Entropy of left child node of (Stressed) = Acute, Hyper

Entropy of right child node of (Relaxed) = Acute, Hyper

Information Gain = Entropy(parent) – [Weighted average] * Entropy (children)

= 1 - (2/4 * 1 + 2/4 * 1)

= 1 - 1

Information Gain = 0

As per the calculations above, the information gain of Sleep Schedule is 0.325, Eating Habits is 0, Lifestyle is 1 and Stress is 0. So, the Decision Tree Algorithm will construct a decision tree based on feature that has the highest information gain. In our case it is Lifestyle, wherein the information gain is 1.

__Gini Index __- Gini Index or Gini Impurity is the measurement of probability of a variable being classified wrongly when it is randomly chosen. The degree of Gini Index varies from zero to one.

Formula –

Here, (Pi) is the probability of an element classified wrongly.

1. Analysis of Sleep Schedule feature:

P (On Track) = 3/4

P (Off Track) = 1/4

Sleep Schedule = On Track and Insomnia = Acute = 2/3

Sleep Schedule = On Track and Insomnia = Hyper = 1/3

Gini Index (On Track) = 1 - [(2/3)2 + (1/3)2]

= 1 - [ 0.44 + 0.11]

= 0.45

Sleep Schedule = Off Track and Insomnia = Acute = 0

Sleep Schedule = Off Track and Insomnia = Hyper = 1

Gini Index (Off Track) = 1 - [02 + 12]

= 0

Weighted sum of Gini indices (Sleep Schedule)

= [3/4 * 0.45] + [1/4 * 0]

= 0.3375

2. Analysis of Eating Habits feature:

P (Healthy) = 2/4

P (Unhealthy) = 2/4

Eating Habits = Healthy and Insomnia = Acute = 1/2

Eating Habits = Healthy and Insomnia = Hyper = 1/2

Gini Index (Healthy) = 1 - [(1/2)2 + (1/2)2]

= 0.5

If (Eating Habits = Unhealthy and Insomnia = Acute): 1/2

If (Eating Habits = Unhealthy and Insomnia = Hyper): 1/2

Gini Index (Unhealthy) = 1 - [(1/2)2 + (1/2)2]

= 0.5

Weighted sum of Gini indices (Eating Habits)