Random Forest:
Random Forest is a classifier that evolves from Decision trees. As the name suggests, this algorithm creates the forest with a number of trees. The random forest algorithm is a supervised classification algorithm which can be used for both classification and regression kind of problems.
To understand Random Forest better we must first know what is a Decision Tree and how does it work.
Decision Tree:
I am sure all of us must have been using Decision tree technique on our day-to-day life knowingly or unknowingly. We just don’t give a fancy name to those decision-making process. Let’s see it with an example.
Ask yourself if you are Hungry?
Decision trees are a type of model used for both classification and regression. A decision tree typically starts with a single node and then divides into various branches for a possible outcome. In the above given example, we are deciding whether we are hungry or not? The First node gets split into two based on Yes or No. When its Yes, the branch gets further split into two and when it is No its stops splitting.
It’s very easy to determine the outcome visually in the above example but in general it is not so because our data will not be this clean. A decision tree will have to undergo various process before each branch gets split and finally outcome is achieved.
Now let’s go back to our main topic...
Random Forest Classifier:
Random forests create various number of decision trees on randomly selected data samples, gets prediction from each tree and selects the best solution by means of voting.
To put it simple-Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction.
Let’s see how random forest classifier works…
Let’s say you wanted to go for a holiday and you are confused on the place. So, you decided to ask you friends and they gave their recommendation. There are list of places and so you ask them to vote for the best. The place that gets the greatest number of votes will be the final choice for your trip.
In this above example the first step of asking for a recommendation for a place is like using the Decision Tree Algorithm. After collecting the whole recommendation and asking them to vote and making a decision is the Random Forest Algorithm.
A large number of trees which are uncorrelated and operating together will outperform any other individual model. The “Forest” is built with an ensemble of decision trees, usually trained with the “bagging” method. The bagging method is nothing but a combination of different models to increase the overall result.
Important Parameters of Random Tree Classifier:
1. n_estimators – Based on this parameter number of trees are build in the algorithm before taking the average prediction. More the number of trees increases the performance and make the prediction more stable.
2. Criterion - This is the function to measure the quality of split. The criterion can either be Gini or entropy but by default the split happens using Gini.
3. Max_feature – This is one of the important parameters, which is the maximum number of features that the random forest will consider to split a node.
4. Min_sample_leaf - Based on this parameter the minimum number of leafs required to split each internal node.
5. N_jobs - This hyperparameter tells the engine how many jobs it can run parallel. If it has a value of “1”, it can only use one processor and if it has “-1” means that there is no limit.
6. Random_state - This Parameter controls the randomness of the sample. The model will always produce the same results when it has a definite value of random_state and if it has been given the same hyperparameters and the same training data.
7. Oob_score - oob means Out of bag. This parameter is a random forest cross-validation method. In the sampling, about one-third of the data is not used to train the model and can be used to evaluate its performance. These samples are called the out-of-bag samples.
Summary:
Random Forest algorithm is a great choice for anyone who needs to develop a model quickly. Random forests are also very hard to beat performance wise. It can handle a lot of different feature types, like binary, categorical and numerical.
Overall, random forest is a simple and flexible tool that being said any model will have its own limitations.
Comments