Just as the trees are a vital part of human life, tree-based algorithms are an important part of machine learning. The structure of a tree has given the inspiration to develop the algorithms and feed it to the machines to learn things we want them to learn and solve problems in real life. These tree-based learning algorithms are considered to be one of the best and most used supervised learning methods. Methods like decision trees, random forest, gradient boosting are being popularly used in all kinds of data science problems. Hence, for every beginner in machine learning, it’s important to learn these algorithms and use them for modeling.

**What are Decision Trees?**

A decision tree is a tree-like graph with nodes representing the place where we pick an attribute and ask a question; edges represent the answers to the question, and the leaves represent the actual output or class label. They are used in non-linear decision making with a simple linear decision surface.

Decision trees classify the examples by sorting them down the tree from the root to some leaf node, with the leaf node providing the classification to the example. Each node in the tree acts as a test case for some attribute, and each edge descending from that node corresponds to one of the possible answers to the test case. This process is recursive in nature and is repeated for every subtree rooted at the new nodes.

**Decision Trees in Real-Life**

You've probably used a decision tree before in your own life to make a decision. Let's take an example of the decision about if you want to play tennis on a particular day with your child.

It might depend on various factors like whether or not you get free from your office on time and able to leave early enough and whether you reach home before 6 pm in the evening depending upon traffic or whether your child has some other activity already scheduled that day; in all the cases, your decision to go out to play tennis with your son depends mainly upon your and your child availability on that particular day and the weather outside.

If the weather is fine, and you are free from the office and reach home on time and your child doesn't have any other class, you may want to go out to the tennis court with him. If you reach on time but he already has some other activity scheduled that day, you might want to just relax at home, watching some movie.

'

This is a clear example of a *real-life decision tree*. We’ve built a tree to model a set of **sequential,** hierarchical decisions that ultimately lead to some final result. Notice that we’ve also chosen our decisions to be quite “high-level” in order to keep the tree small. For example, what if we set up *many* possible options for the weather such as 25 degrees sunny, 25 degrees raining, 26 degrees sunny, 26 degrees raining, 27 degrees sunny…. etc, our tree would be huge! The **exact** temperature really isn’t too relevant, we just want to know whether it’s OK to be outside or not.

So, you calculate all these factors for the last few days and form a lookup table like the one below.

**How does the Decision Tree algorithm work?**

The basic idea behind any decision tree algorithm is as follows:

Select the best attribute using Attribute Selection Measures(ASM) to split the records.

Make that attribute a decision node and breaks the dataset into smaller subsets.

Starts tree building by repeating this process recursively for each child until one of the condition will match:

All the tuples belong to the same attribute value.

There are no more remaining attributes.

There are no more instances.

ASM provides a rank to each feature(or attribute) by explaining the given dataset. The best score attribute will be selected as a splitting attribute (Source). In the case of a continuous-valued attribute, split points for branches also need to define. The most popular selection measures are** Information Gain**, **Gain Ratio,** and **Gini Index**.

**What is Information Gain and how do you measure it?**

Information gain is a statistical property that measures how well a given attribute separates the training examples according to their target classification.

Information Gain is a decrease in entropy(randomness). It computes the difference between entropy before split and average entropy after split of the dataset based on given attribute values. The basic formula to calculate entropy is :

Where Pi is the probability that an arbitrary tuple in D belongs to class Ci.

Information Gain is, then calculated as:

Now, let's calculate the entropy of each parent node and child nodes:

E(Parent Node) = -[P(tennis)log(base2)P(tennis) + P(Home)log(base2)P(Home)]

= - [(1/2)log(base2)(1/2) + (1/2)log(base2)(1/2)]

= 1

Since P(tennis) = 1/2 and P(Home) = 1/2

Now, let's find out if the parent node splits by the first attribute,i.e **weather**.

If we see the given data above, according to the attribute weather, we will get the following values for our decision to play or not:

Rainy: Stay at Home

Sunny : Tennis,Tennis,StayHome

Entropy of the Right child Node: E(rainy) = -P(Home)log(base2)P(Home)

=-1log(1)= 0

Entropy of the left child node : E(Sunny) = - [P(Home)log(base2)P(Home)

+ P(Tennis)log(base2)P(Tennis)]

= -[(1/3)log(base2)(1/3) + (2/3)log(base2)(2/3)]

= 0.9

So, Information Gain(Weather) = Entropy(Parent) - [sum of(weighted average*entropy of each child)]

= 1 - [(3/4)*(0.9)+(1/4)*0] = 0.325

Likewise, we calculated the entropy and information gain for other attributes and get the following results:

IG(WorkSchedule) = 1

IG(Traffic) = 0.325

IG(Availability) = 0.325

Since Infomation gain for attribute,"Work Schedule" is highest, we can say that the most accurate attribute whichh will predict whether you will play tennis that day is your **work schedule**.

**What is the Gini Index?**

Decision tree algorithm CART (Classification and Regression Tree) uses the Gini method to create split points.

Where pi is the probability that a tuple in D belongs to class Ci.

Gini Index, also known as Gini impurity, **calculates the amount of probability of a specific attribute that is classified incorrectly when selected randomly. **If all the elements are linked with a single class then it can be called pure. The Gini Index considers a binary split for each attribute. You can compute a weighted sum of the impurity of each partition. If a binary split on an attribute A partitions data D into D1 and D2, the Gini index of D is:

In the case of a discrete-valued attribute, the subset that gives the minimum Gini index for that chosen is selected as a splitting attribute. In the case of continuous-valued attributes, the strategy is to select each pair of adjacent values as a possible split-point and point with a smaller Gini index chosen as the splitting point.

The attribute with the **minimum Gini index** is chosen as the splitting attribute.

Now, let's try to calculate the Gini Index for our data above based on the attribute. **weather**

Gini(Sunny) = 1 - [(2/3)*(2/3) + (1/3)*(1/3)] = 4/9 = 0.444

Gini(Rainy) = 1 - [(1/1)*(1/1)] = 0

So, Gini(Weather) = (3/4) * 0.444 + (1/4) * 0 = 0.333

Based on the other attributes, the Gini Index is as follows:

Gini(Traffic) = (3/4) * {1 - [(1/3)*(1/3) + (2/3)*(2/3)] } + (1/4) * { 1- [ (1/1)*(1/1)]} = 0.333

Gini(Work Schedule) = (2/4) * [1 - (1*1)] + (2/4) * [1 - (1*1)] = 0

Gini(Availability) = 0.333

So, the attribute with the minimum Gini Index, that is,** Work Schedul**e is the splitting attribute for decision making here.