Decision Tree are a type of Supervised Machine Learning, where we predicts the value of a target based on several inputs. Each branch of the decision tree could be possible outcome.

We can make decision tree with the help of three Attributes:

Information Gain

Gini Index

Here, will try to make a decision tree with the help of two attributes: Information Gain and Gini Index.

We are going to see here that which feature will predict Height of a Child more accurately:

We will take 3 features namely; Family History, Diet and Physical Activities. Height of Child will be parent node.

First we will use Information Gain, formula for computing Information Gain is:

Formula 1

Entropy(Amount of randomness or things which can't be predicted) of parent node(Height of Child). Formula for Entropy is:

Formula 2

Weighted average of entropy of children

Formula 3

p(short) = fraction of short in chart = 1/4 = 0.25

p(tall) = fraction of tall in chart = 3/4 = 0.75

Therefore, using formula 2:

Entropy(parent) = - {0.25 * log2(0.25) + 0.75 * log2(0.75)

= - {-1.75 + (-0.30)}

= 2.05

Now we will calculate Information Gain for our first feature Family History:

Here Parent Node is TSTT and Child Node is TTST:

First we calculate entropy of left side of child node, p(tall) is 2/3= 0.67 and p(short) is 1/3 = 0.33

So, Entropy(left side: TST) = - {0.67 log2(0.67) + 0.33 log2(0.33)}

= 0.9 (using formula 2)

Entropy of right side of child node, p(tall) is 1 and p(short) is 0

So, Entropy(right side:T) = - {o + 1log2(1)}

= 0 (using formula 2)

Weighted Average according to the above formula 3:

= (3/4) * 0.9 + (1/4) * 0

= 0.675

Information Gain for Family History according to the above formula 1 is:

= 2.05 - 0.675

= 1.375

Information Gain for our second feature Diet:

Child node is TTTS:

Entropy of left side of child node, p(tall) is 1 and p(short) is 0

Entropy(TTT) = - { 1 log2(1) + o}

= 0 (using formula 2)

Entropy of right side of child node, p(tall) is 0 and p(short) is 0

Entropy(S) = - { 0 + 1 log2(1)}

= 0 (using formula 2)

Weighted average = (3/4) * 0 + (1/4) * 0

= 0 (using formula 3)

Information Gain for Diet = 2.05 - 0

= 2.05 (using formula 1)

Information Gain for third feature Physical Activity:

Child Node is TTTS:

Entropy of left side of child node, p(tall) is 1 and p(short) is 0

Entropy(TT) = - { 1 log2(1) + 0}

= 0 (using formula 2)

Entropy of right side of child node, p(tall) is 1/2 = 0.5 and p(short) is 1/2 = 0.5

Entropy(TS) = - {0.5 log2(0.5) + 0.5 log2(0.5)}

= 1 (using formula 2)

Weighted average = (2/4) * 0 + (2/4) * 1

= 0.5 (using formula 3)

Information Gain for Physical Activities = 2.05 - 0.5

= 1.55 (using formula 1)

Information Gain(family history) => 1.375

information Gain(diet) => 2.05

information Gain(physical activities) => 1.55

Conclusion: More the information gain more accurate the outcome is, that's why the most accurate feature which will predict height of child more precisely is Diet.

2. Gini Index: Gini Index measure the degree or probability of a particular variable being wrongly classified when it is randomly chosen.

The degree of Gini Index varies between 0 and 1.

The formula for Gini Index is:

Formula 4

where pi is the probability of an object being classified to a particular class:

Gini Index for Family History:

p(family history = Tall): 3/4

p(family history = short): 1/4

If (family history = Tall and height of child = Tall): 2/3

If (family history = Tall and height of child = Short): 1/3

Gini index(tall) = 1- {(2/3)^2 + (1/3)^2}

= 1 - (0.44 + 0.11)

= 0.45 (using formula 4)

If (family history = Short and height of child = Tall): 1

If (family history = Short and height of child = Short): 0

Gini Index(short) = 1 - {(1)^2 + (0)^2}

= 0

Gini Index(family history) = (3/4) * 0.45 + (1/4) * 0

= 0.33

Gini Index for Diet:

p(Diet = Good): 3/4

p(Diet = bad): 1/4

If (Diet = good and height of child = tall): 1

If (Diet = good and height of child = short): 0

Gini Index(good) = 1 - {(1)^2 + (0)^2}

= 0 (using formula 4)

If (Diet = bad and height of child = Tall): 0

If (Diet = bad and height of child = Short): 1

Gini Index(bad) = 1 - {(0)^2 + (1)^2}

= 0

Gini Index(diet) = (3/4) * 0 + (1/4) * 0

= 0

Gini Index for Physical Activities:

p(physical activities = No): 2/4

p(physical activities = Yes): 2/4

If (physical activities = No and height of child = Tall): 1/2

If (physical activities = No and height of child = Short): 1/2

Gini Index(no) = 1- {(0.5)^2 + (0.5)^2}

= 0.5 (using formula 4)

If (physical activities = Yes and height of child = Tall): 1

If (physical activities = Yes and height of child = Short): 0

Gini Index(yes) = 1 - {(1)^2 + (0)^2}

= 0

Gini Index(physical activities) = (2/4) * 0.5 + (2/4) * 0

= 0.25

Gini Index(family history) => 0.33

Gini Index(diet) => 0

Gini Index(physical activities) => 0.25

Conclusion: Less the Gini Index is the better chance of accuracy of outcome, that's why the feature which is most suitable is Diet.

Gini Index, unlike Information Gain, isn't difficult to compute as it does not involve logarithm function(used in calculating Entropy in Information Gain). That's why Gini Index is more preferred than Information Gain.

Thank You!

## ความคิดเห็น