Missing Data is very common in statistical analysis, and the Imputation of missing values is a very important step in data analysis. Many Algorithms used for analysis of large -scale data often require fully observed datasets without any missing values. However , this is seldom the case, say for example in the field of Medical Research.
So, to take care of Missing values, there are a lot of techniques ,from simple mean/median/mode to more sophisticated methods like KNN. So how much of an impact does the method we select has, on the final result? The Answer is VERY much.
Simple methods like the mean/median/mode imputation don't work well. This underestimates the variance in our data because we're making numerous values the exact same , when in real life that will not be the case.
So, then if we take the standard imputation methods like KNN ,MICE, etc, several studies have reported that amongst them all, MissForest performed favorably well , and MissForest has been used as a benchmark for non-parametric imputation methods. In a comparison study done by Waljee et al, MissForest was found to consistently produce the lowest imputation error compared with other imputation methods , including KNN and MICE, when data was completely missing at random (MCAR).
MissForest - How does it work?
For study purpose, We will take a simple dataset of kids data that has weight (kg)and height(cm) columns. There are missing values in the height column. We will use MissForest to impute the missing values in the dataset.
Missforest is an imputation algorithm that uses random forests to do the task. It works as follows:
Step1-Initialization .
For a variable containing missing values, the missing values will be replaced with its mean (for continuous variables) or its most frequent class(for categorical variables).
Mean of the observed heights =(47.1 +64.7+68.2+81.6 )/ 4 = 65.4
So, 65.4 ( the mean) is been replaced in the place of all occurrences of NaN in the height column.
Step -2. Imputation.The imputation process is done sequentially in ascending order of NaN's for each variable. The observations in the dataset are divided into two parts according to whether the variable is observed or missing in the original dataset. The observed observations are used as the training set and the missing observations are used as the prediction set.
The training sets and the prediction sets are fed into a Random Forest model .
This RF model is trained to predict, Height based on Weight. From the RF model, we get generated RF predictions which are then filled in-place of the Prediction set to produce a transformed dataset.
Step 3 -Stop. When all the variables with missing data have been imputed , then one imputation iteration is completed. The imputations continues for multiple iterations.
The reason for the multiple iterations is that, from iteration 2 onwards, the random forests performing the imputation will be trained on better and better quality data , the model uses its current position to improve itself further.
The imputation process is iterated until the relative sum of squared differences (or proportion of falsely classified entries for categorical variables) between the current and the previous imputation results increases, and MissForest outputs the previous imputation as the final result.
The model decides in the following iterations to adjust predictions or to keep them the same. In the below table, we can see that the second row ( height 59.1) is kept the same by the model ,but the 4th row is kept as the prediction set again.
The iterative process of training and predicting is done , until a stopping criterion is met , or a maximum number of user-specified iterations is reached. Generally , datasets become well imputed after 4 to 5 iterations, depending upon the size and the amount of missing data. Hence the maximum number of iterations is set to a default value of 10 to limit the computational time to a reasonable level.
Advantages:
1.It can be applied to mixed data types, numerical and categorical.
2. No pre-preprocessing required (no standardization, normalization, scaling, data splitting, etc)
3. Robust to noisy data, as random forests effectively have built-in feature selection.
4. Using OOB(Out of Bag) error estimates, it assesses the quality of an imputation
without the need for laborious cross-validations.
Disadvantages:
If the dataset is sufficiently small, it may be more expensive to run MissForest.
Also, its an algorithm, not a model object, meaning it must be run every time data is imputed , which could be problematic in some production environments.
Conclusion:
Being able to effectively impute missing data is of great importance to scientists working with real world data today. MissForest is a highly accurate method of imputation for missing data and outperforms other methods like MICE and KNN. So next time if you encounter missing data in your dataset, try using MissForest !
References:
How I can transfrom the ximp data set in a data.frame o
how is out-of-bag(OOB) imputation error is estimated?
The R package missForest also returns this error which is used to access the quality of imputation