Is Decision Trees effective or not?

Classification is a technique to divide the datasets into different categories by providing label for it. Reason to classify the data is to perform analysis and predict the data into different groups or categories.


Decision trees are extremely useful for data analytics and machine learning because they break down complex data into more manageable parts. They're often used in these fields for prediction analysis, data classification, and regression.


Types of classification: -

1)Decision tree

2)Random forest

3)Naïve Bayes

4)Logistic regression


When and where should decision trees be used?

It is one way to display an algorithm that only contains conditional control statements. Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a strategy most likely to reach a goal but are also a popular tool in machine learning.


Where?

Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a strategy most likely to reach a goal but are also a popular tool in machine learning.


Simple decision tree looks like below



or




Decision tree consist of following terminologies

Root/Parent Node: -

It is the start of the decision tree which represents the full data set. It is divided further in two or more nodes. Leaf Node: -

They are the last output node. The tree can not be further divided after the leaf node

Splitting: -

It is the process of dividing the decision node /parent node into sub -nodes based on the conditions.

Branches/sub tree: -

A node formed by splitting the main data set.

Pruning: -

It is the process of removing the unwanted branches from the tree. It is opposite of splitting.

Parent/Child Node: -

The main starting node is called the Parent node and the other sub nodes are called the Child Node.


Now we learn how or where to split the tree? Which attribute is best attribute as root node?

1)Entropy: -

It is the measurement of the impurity or randomness in the data points.

A high order of disorder means a low level of impurity, let me simplify it. Entropy is calculated between 0 and 1, although depending upon the number of groups or classes present in the data set it could be larger than 1 but it signifies the same meaning, i.e. higher level of disorder.

For the sake of simple interpretation, let us confine the value of entropy between 0 and 1.

“Entropy is a degree of randomness or uncertainty, in turn, satisfies the target of Data Scientists and ML models to reduce uncertainty.”


Entropy is calculated by the formula



Here ‘p’ denotes the probability that it is function of entropy.

What is Information Gain?

The concept of entropy plays an important role in calculating Information Gain.

Information Gain is applied by quantifying the size of uncertainty, disorder or impurity, in general, with the intention of decreasing the amount of entropy initiating from the top (root node) to bottom (leaves nodes).


2)Gini Index

Gini Index is also known as Gini Impurity. It calculates the amount of probability of a specific feature that is classified incorrectly when selected randomly.


Gini Index is calculated as follows


Where Pi denotes the probability of an element being classified for a distinct class.

Calculate the gini index for the following table:



Decision tree for the above table:



1. Target is the decision node.

2. It is subdivided into parent node (High bps, High cholesterol)

3. Parent node is divided into child node basing on the value of how many 1 or 0 in parent node such as EX: HBPS1 &HBPS0

4. These are again divided into leaf node based on target=1 & target=0

(leaf node is end node; it cannot be further divided)


Now let calculate Gini index for High bps, High cholesterol and will find the on which factor decision is made.

The factor which gives least Gini index is the winner.


1. Gini index for High Bps:

Decision tree for High BPS



Probability for parent node:

P0= 10/14

P1=4/14


Now we calculate for child node:


i) For BPS=1,

for bps =1 this is table

If (Bps=1 and target =1)=8/10

if(Bps=1 and target=0)=2/10

Gini index PBPS1=1-{(PBPS1)2+(PBPS0)2

= 1-{{8/10)2+(2/10)2}

=0.32


2) if BPS=0

If (BPS=0 and target=0)=4/4=1

If (BPS=0 and target=1)=0

Gini index PBPS0=1-{(1)-(0)}

= 1-1

=0


Weighted Gini index

w.g =P0*GBPS0+ P1*GBPS1

= 4/14*0 + 10/14*0.32

= 0.229


I hope this article was helpful to you all in understanding about the decision tress. Please leave your queries if any below.




31 views0 comments

Recent Posts

See All

Text Summarization through use of Spacy library

Text summarization in NLP means telling a long story in short with a limited number of words and convey an important message in brief. There can be many strategies to make the large message short and

 

© Numpy Ninja.