Updated: Mar 18
Before we jump into finding the answer to the above question, let’s try to understand what the “Decision tree” algorithm is.
So, what is a Decision tree?
If we strip down to the basics, decision tree algorithms are nothing but a series of if-else statements that can be used to predict a result based on the data set. This flowchart-like structure helps us in decision-making.
The idea of a decision tree is to divide the data set into smaller data sets based on the descriptive features until we reach a small enough set that contains data points that fall under one label.
Each feature of the data set becomes a root[parent] node, and the leaf[child] nodes represent the outcomes. For instance, this is a simple decision tree that can be used to predict whether I should write this blog or not.
Amazing isn’t it! Such a simple decision-making is also possible with decision trees. They are easy to understand and interpret because they mimic human thinking.
Alright, now coming to the main question “Is decision tree a classification or regression model?” To answer this question, first, let us understand classification and regression using the below diagram.
In the above example, regression is used to predict the student’s marks. Whereas, classification is used to predict whether the student has passed or failed the exams.
So, what is the difference between regression and classification?
Regression is used when we are trying to predict an output variable that is continuous. Whereas, classification is used when we are trying to predict the class that a set of features should fall into.
A decision tree can be used for either regression or classification. It works by splitting the data up in a tree-like pattern into smaller and smaller subsets. Then, when predicting the output value of a set of features, it will predict the output based on the subset that the set of features falls into.
There are 2 types of Decision trees:
Classification trees are used when the dataset needs to be split into classes that belong to the response variable.
Regression trees, on the other hand, are used when the response variable is continuous.
In other words, regression trees are used for prediction-type problems while classification trees are used for classification-type problems.
How Classification and Regression Trees Work
A classification tree splits the dataset based on the homogeneity of data. Say, for instance, there are two variables; salary and location; which determine whether or not a candidate will accept a job offer.
If the training data shows that 95% of people accept the job offer based on salary, the data gets split there and salary becomes a top node in the tree. This split makes the data “95% pure”. Measures of impurity like entropy are used to quantify the homogeneity of the data when it comes to classification trees.
In a regression tree, a regression model is fit to the target variable using each of the independent variables. The data is then split at several points for each independent variable.
At those points, the error between the predicted values and actual values is squared to get “A Sum of Squared Errors”(SSE). The point that has the lowest SSE is chosen as the split point. This process is continued recursively.
Advantages of Decision Trees 1. Decision trees are easy to interpret.
2. To build a decision tree requires some data preparation from the user but normalization of data is not required.
Disadvantages of Decision Trees 1. Decision trees are likely to over-fit noisy data. The probability of overfitting on noise increases as a tree gets deeper.
** A decision tree can be used for either regression or classification.
** Decision trees are easy to understand, visualize and interpret.
** The flowchart-like structure helps us in decision-making.
Thanks for reading!