This article discusses the basics of Decision trees and its structure and classification of decision trees along with advantages and disadvantages and we will end the article with the application of decision trees without getting into high-level details.
Target Audience: Beginners in Data Science
What Is a Decision Tree?
A decision tree is a type of supervised machine learning used to categorize or make predictions based on previous data. It Is a flow chart like structure, and it is a form of supervised learning, meaning
that the model is trained and tested on a set of data that contains the desired categorization.
The decision tree algorithm can be used for solving regression and classification problems too. The decision tree may not always provide a clear answer or decision. Instead, it may present options so the data scientist can make an informed decision. Decision trees mimic human thinking, so it’s generally easy for data scientists to understand and interpret the results.
Decision Tree Structure:
1. Root Node:
It is the starting point of the tree, where it is further split down.
It is a process of dividing a node into two or more sub-nodes.
3. Decision Node:
When a sub-node splits into further sub-nodes, then it is called the decision node.
4. Leaf / Terminal Node:
The nodes that cannot split further are called as leaf/terminal nodes.
When we remove sub-nodes of a decision node, this process is called pruning.
6. Branch / Sub-Tree:
These are the arrows connecting the nodes, shows the flow in the tree.
7. Parent and Child Node:
A node, which is divided into sub-nodes is called a parent node of sub-nodes whereas sub-nodes are the child of a parent node.
The root and leaf nodes contain questions or criteria that needs to be answered.
Types of Decision trees:
The decision trees can be typically classified into two types:
1. Categorical Variable decision trees
In this type of decision tree, the outcome is a simple yes or no. It fits into either one of the categories. The data is placed into a single category based on the decisions at the nodes throughout the tree.
Ex. Did the employee achieve a target? Yes or no
2. Continuous variable decision trees
In this type of decision trees, the answer is not just a simple yes or no, but the outcome depends on the decisions that are made further up the tree. It is also called as regression tree. The advantage of this type of tree is that the outcome can be predicted based on multiple variables rather than on a single variable as in a categorical variable decision tree. Continuous variable decision trees are used to create predictions. The system can be used for both linear and non-linear relationships if the correct algorithm is selected.
Understanding the advantages and disadvantages are helpful to see if this particular model fits for the use case.
Advantages of Decision Trees:
Few of the advantages are listed below but are not limited to:
Decision trees are easy to understand and interact.
The model can interpret accurate results and tree's reliability can be trusted and quantified.
They require less time for data preparation as it does not require data normalization
Decision trees are very helpful in data cleaning, it takes much less time in the data cleaning process in comparison to other modeling techniques.
It takes less time for data exploration to find important variables and its relationship with other variables.
Disadvantages of Decision Trees:
These are not ideal for large data sets.
Very small changes in data might result in generating a completely different tree, this is termed as variance, so decision trees are treated as unstable. The concept of bagging, boosting, etc. are introduced for the same.
As datasets have values with many levels, these are interconnected so calculations become more complex.
Applications of Decision Trees:
Applications includes several areas, some of them can be identified as:
Customer relationship management and fraud detection