All about Random Forests and handling Missing Values in them.
The Random Forest Algorithm is one of the most popular and most powerful supervised Machine Learning Algorithm that is capable of performing both Classification and Regression tasks. As the name suggests, this algorithm creates a forest with a number of decision trees. In general, the more the decision trees in the forest , the more robust the prediction , resulting in higher accuracy.
In order to create a forest with multiple decision trees, we will use the same method we used in Decision trees, such as the Information Gain and the Gini Index.
How does a Random Forest Work:
In Random Forest, we create multiple trees as opposed to a single tree in CART model.
To classify a new object based on attributes, each tree gives a classification , and we say the tree will vote for that class. The Random Forest Classifier algorithm chooses the classification having the most votes . In the case of Regression, the R.F Regressor Algorithm take the average of the outputs of the different trees.We will not go in detail about how the Random Forests work in this blog, maybe we will learn that in another blog.
In real world data, there are some instances where a particular element is absent because of various reasons — ranging from human errors during data entry, incorrect sensor readings, to software bugs in the data processing pipeline. Training a model with a data set that has a lot of missing values can drastically impact the quality of machine learning model.
Handling the missing values is one of the greatest challenges faced by analysts, because making the right decision on how to handle it generates robust data models.
Types of Missing Data:
Missing Completely at Random (MCAR)
The Probability of missing is uncorrelated with the data (We know Data is missing totally by chance)
Missing at Random (MAR)
This occurs when the missingness is not random, but where missingness can be fully accounted for by variables where there is complete information ( We know why data is missing and we have measured the cause of why)
Missing Not at Random (MNAR)
Data is missing because of something we haven't measured.(We have no idea why data is missing or we know why , but didn't measure it)
One of the main advantages of Random Forest Algorithm is that it can perform the best to handle the MCAR and MAR values , and it can maintain accuracy for missing data. It adapts to the data structure taking into consideration of the high variance or the bias, producing better results on large datasets.
Random Forests and Missing Data :
Data that is missing is problematic as many statistical analysis require complete data. This forces researchers who want to use a statistical analysis that requires complete data, to choose between imputing data or discarding missing values.But to simply discard missing data is not a good practice, as valuable information may be lost and inferential power compromised. Thus imputing missing data in those cases is a more practical way to proceed.
While many statistical methods have been developed for imputed missing data, many of these perform poorly in high dimensional and large scale data settings .
But on the contrary, Random Forests can
1. handle mixed types of missing data,
2. address interactions and nonlinearity ,
3. scale to high dimensions while avoiding overfitting,
4. and yield measures of variable importance useful for variable selection.
Some Missing Data Algorithms in RF:
RF Proximity Algorithm.
One can find the detailed explanation of RF proximity algorithm in the below link.
2. MICE (multivariate imputation using chained equation).
One can find the detailed explanation of MICE in the below link.
Being able to effectively impute missing data is of great importance to scientists working with real world data today. Many findings have demonstrated that Random Forests performed the best amongst the different methods ,with excellent prediction performance and ability to handle all forms of data . Given that RF meets all the characteristics for handling missing data, it is very desirable to use RF for imputing data.