Classification is a technique to divide the datasets into different categories by providing label for it. Reason to classify the data is to perform analysis and predict the data into different groups or categories.

**Decision trees** are extremely useful for data analytics and machine learning because they break down complex data into more manageable parts. They're often used in these fields for prediction analysis, data classification, and regression.

Types of classification: -

1)Decision tree

2)Random forest

3)Naïve Bayes

4)Logistic regression

**When and where should decision trees be used?**

It is one way to display an algorithm that only contains conditional control statements. **Decision trees** are commonly **used** in operations research, specifically in **decision** analysis, to help identify a strategy most likely to reach a goal but are also a popular tool in machine learning.

**Where?**

**Decision trees** are commonly **used** in operations research, specifically in **decision** analysis, to help identify a strategy most likely to reach a goal but are also a popular tool in machine learning.

Simple decision tree looks like below

or

Decision tree consist of following terminologies

**Root/Parent Node: -**

It is the start of the decision tree which represents the full data set. It is divided further in two or more nodes.
**Leaf Node: -**

They are the last output node. The tree can not be further divided after the leaf node

**Splitting: -**

It is the process of dividing the decision node /parent node into sub -nodes based on the conditions.

**Branches/sub tree: -**

A node formed by splitting the main data set.

**Pruning: -**

It is the process of removing the unwanted branches from the tree. It is opposite of splitting.

**Parent/Child Node: -**

The main starting node is called the Parent node and the other sub nodes are called the Child Node.

Now we learn how or where to split the tree? Which attribute is best attribute as root node?

**1)Entropy: -**

*It is the measurement of the impurity or randomness in the data points.*

A high order of disorder means a low level of impurity, let me simplify it. **Entropy is calculated between 0 and 1**, although depending upon the number of groups or classes present in the data set it could be larger than 1 but it signifies the same meaning, i.e. higher level of disorder.

For the sake of simple interpretation, let us confine the value of entropy between 0 and 1.

*“Entropy is a degree of randomness or uncertainty, in turn, satisfies the target of Data Scientists and ML models to reduce uncertainty.”*

Entropy is calculated by the formula

Here ‘p’ denotes the probability that it is function of entropy.

**What is Information Gain?**

The concept of entropy plays an important role in calculating Information Gain.

Information Gain is applied **by quantifying the size of uncertainty, disorder or impurity**, in general, **with the intention of decreasing the amount of entropy initiating from the top (root node) to bottom (leaves nodes).**

**2)Gini Index**

Gini Index is also known as Gini Impurity. It calculates the amount of probability of a specific feature that is classified incorrectly when selected randomly.

Gini Index is calculated as follows

Where Pi denotes the probability of an element being classified for a distinct class.

Calculate the gini index for the following table:

Decision tree for the above table:

1. Target is the decision node.

2. It is subdivided into parent node (High bps, High cholesterol)

3. Parent node is divided into child node basing on the value of how many 1 or 0 in parent node such as EX: HBPS1 &HBPS0

4. These are again divided into leaf node based on target=1 & target=0

(leaf node is end node; it cannot be further divided)

Now let calculate Gini index for High bps, High cholesterol and will find the on which factor decision is made.

The factor which gives least Gini index is the winner.

1. **Gini index for High Bps:**

Decision tree for High BPS

Probability for parent node:

P0= 10/14

P1=4/14

Now we calculate for child node:

i) For BPS=1,

for bps =1 this is table

If (Bps=1 and target =1)=8/10

if(Bps=1 and target=0)=2/10

Gini index PBPS1=1-{(PBPS1)2+(PBPS0)2

= 1-{{8/10)2+(2/10)2}

=0.32

2) if BPS=0

If (BPS=0 and target=0)=4/4=1

If (BPS=0 and target=1)=0

Gini index PBPS0=1-{(1)-(0)}

= 1-1

=0

Weighted Gini index

w.g =P0*GBPS0+ P1*GBPS1

= 4/14*0 + 10/14*0.32

= 0.229

I hope this article was helpful to you all in understanding about the decision tress. Please leave your queries if any below.

## Comentários