MICE? Fills in null values? Yes, it does ,when it is a part of Machine Learning. MICE (Multiple Imputation by chained equations) it is a technique used during Data analysis for filling in the missing values when they are Missing Completely At Random(MCAR).
Introduction
The MICE was first introduced in 2001 as an R package . Not until 2019 ,it was introduced in Python under the term 'IterativeImputer' .
In this technique the missing values are imputed iteratively by predictions based on the values of the other columns.MICE imputation can be done on a single column or multiple columns at once.
MICE is a very strong algorithm that gives some of the most plausible values for imputation and a strong model.
How do you all feel about learning the code of MICE in two programming languages? Interesting right.
That is what I am here to show in this blog.
Lets embark on the adventure of learning MICE in both R Programming and Python.
To understand the basic working of MICE ,please refer to the blog written by my peer
Understanding MICE in R Programming
In R, MICE can be run with a simple one line code.
mice(
data,
m = 5,
method = NULL,
predictorMatrix,
ignore = NULL,
where = NULL,
blocks,
visitSequence = NULL,
formulas,
blots = NULL,
post = NULL,
defaultMethod = c("pmm", "logreg", "polyreg", "polr"),
maxit = 5,
printFlag = TRUE,
seed = NA,
data.init = NULL,
...
)
The most commonly used attributes for mice() are:
data - The data in which the null values must be imputed.
m - Number of multiple imputations. The default is 5.
method - It specifies the imputation method. It can be either one method for all the columns or it can be different methods for different columns. For example, pmm, logreg, polyreg .
maxit - Maximum number of iterations
seed - A random number set to make sure the same set of data is being used every time.
Please use the command help(mice) to understand more about the mice() attributes.
MICE Example
To give you an idea of how it works in R, lets take a dataset that is inbuilt in MICE package. As a first , let us install the package using the below code.
install.packages('mice')
Note that this step need to be performed only once in R studio. It works the same way as pip install in Python.
Once installed, the R studio Packages tab , shows a tick mark near the package name to indicate that the package has been installed right.
We have to call the package every time we are going to use it .It is similar to the import in Python. It is done using the below.
library('mice')
To use the inbuilt data we will follow the code written in the screenshot.
The orange lines are the code and the white lines show the output.
The above DataFrame has 4 variables with 25 observations, Age(age), BMI(bmi) , Hypertensive(hyp) and Cholesterol(chl). Below is the explanation of the variables:
age
Age group (1=20-39 years, 2=40-59 years, 3=60+ years)
bmi
Body mass index (kg/m**2)
hyp
Hypertensive (1=Not Hypertensive,2=Hypertensive)
chl
Total serum cholesterol (mg/dL)
Below is the summary of the dataset and the code for the same.
Summary shows that BMI, hyp and chl has 9 , 8 and 10 missing values respectively.
Performing basic EDA(Exploratory Data Analysis) by checking data types of columns in data.
It is known, when skimming through the data that Hypertensive and Age are categorical columns .Whereas, the above screen shot shows that the data type is numerical.
When a data type needs to be changed we use the below lines of code
data$hyp = as.factor(data$hyp)
data$age = as.factor(data$age)
Thus , the data type is changed for both the columns.
It can be seen that age says Factor w/3 levels, which means it has three categories(1,2,3) and hyp column has Factor w/2 levels with means it has two categories(1,2).
For checking the pattern of null values in the data, function inbuilt in mice package can be used
The code generates both a numerical and graphical representations. For better understanding of these representations , let us go through a few points.
The left axis sums up to the total number of rows.
13 indicates the there are 13 rows with no missing values.
7 indicates there are so many rows with only the age values.
Rest of the numbers on the left hand side , 3,1,1 shows that those number of rows were missing only cholesterol, or only bmi or only hyp and bmi respectively.
The bottom axis shows the total number of null values as shown in the summary of the data(Figure -3).
The right axis values gives the number of missing values of each row in the pattern.
The missing values are marked in pink in the graphical representation.
The value 27 indicates the total number of missing values in the data.
Looking at the pattern of the missing data, we can understand that, the values are missing completely at random and filling them with a unitary method like mean, median mode will not give us a meaningful data.
Thus , we will use MICE to do the imputation.
Imputing the data
imp_data = mice(data,m=5,method = c(" ","pmm","logreg","pmm"),maxit = 20)
The above code says that, mice() will impute the null values in the data five times in 20 iterations.
Let's try the predictive mean matching (pmm) in R to do the imputation. Since , hyp is a categorical variable, logistic regression (logreg) is being used. The empty " " in the method is for age column as it has no values that need to be filled.
The output has been stores in imp_data.
In each iteration the mice has given many values that could be possibly close to the missing value. How to know which one to use now? We have to choose from the 5 rounds of imputations .
Choosing the right data
Let us check the summary of the data before MICE imputation.
The above given is the summary of the data , after age and hyp column are changed to categorical.
It is seen that the mean and median of bmi is between 26.50 and 26.75. Also , the Interquartile range for the bmi is 22.65 and 28.93 .
Let's have a look at the bmi imputed values(Figure 8.2) .
The imputation 2 ,4 and 5 have values above 30 which is greater than the 3rd quartile value( 28.93) . Thus those imputations are removed from the race. Imputation 1 has more close values to the mean than imputation 3.
Likewise, for chl column (Figure 8.3 ) ,it is seen that the other imputations have values like 284 which is way above the 3rd quartile value and having them might skew the data.
Only round 1 of imputations have more number of closer values to the mean and median and many lie between the interquartile ranges.
Thus, we can come to a conclusion that the round 1 of imputations is the best fit for the data.
To fill the data with values from the imputation we use the below code.
final_data = complete(imp_data,1)
The final_data has the imputed values .
Checking the data fit
We can plot the data to see if the imputed values are a good fit to the data. Using a strip plot we can see how well the imputed values fits into the observed values.
stripplot(imp_data, pch = 20, cex = 1.2)
The blue dots are the observed values and the magenta dots are the imputed values. We can see that the values are a good fit to the data for further processing.
Now that have understood the working of MICE in R , let us learn the Python variation too.
Understanding MICE in Python
In Python , the MICE can be run in a package called the IterativeImputer. It was in the experimental stage until recently and it is now available in the sklearn.impute package.
This is how the Iterative Imputed code is written.
IterativeImputer(
estimator=None,*,
missing_values=nan,
sample_posterior=False,
max_iter=10,
tol=0.001,
n_nearest_features=None,
initial_strategy='mean',
imputation_order='ascending',
skip_complete=False,
min_value=-inf,
max_value=inf,
verbose=0,
random_state=None,
add_indicator=False,)
Some of the most commonly used attributes of the IterativeImputer are:
estimator - The kind of algorithm that this imputed is being used. Eg - Linear Regression, Logistic Regression or XG Boost.
max_iter - The maximum number iterations the algorithm can go through.Same as maxit in R.
tol - Tolerance of the stopping condition. It is the amount of tolerance to the change in value between the iterations. If the tolerance is more than the change of values between the iterations , the iterations are stopped and the values obtained at the end of it is treated as the final values.
n_nearest_features - The number of nearest features that can be used to determine the missing value.
initial_strategy - A usual method that is used to initially impute the data. It can be mean, median , mode or constant value.
verbose - The debug messages that are posted during the processing. It can be either 0,1 or 2. The higher the number, more the messages.
random_state - A random number set to make sure the same set of data is being used every time
To understand the rest of the attributes, we can wither do a shift+tab or use the below link
Let's get the ball rolling.
IterativeImputer Example
To begin with, let us import the required packages,
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.impute import IterativeImputer
We will first create a dataset with missing values .
Data Creation
df = pd.DataFrame({
'age': [25,27,29,31,33,np.nan],
'experience': [np.nan, 3,5,7,9,11],
'salary': [50, np.nan, 110,140,170,200],
'purchased' : [0,1,1,0,1,0]
})
df
All the columns are self explanatory. The target variable here is purchased . Thus, the data is split in to the columns to be processed and the target column.
X = df.drop('purchased', 1)
y = df['purchased']
We will be using the X dataset.
Imputer implementation
Linear Regression model is used in the Iterative Imputed to perform imputation.
lr = LinearRegression()
imp = IterativeImputer(estimator=lr, verbose=2, max_iter=30, tol=1e-10, imputation_order='roman')
Getting the final data
When we fit,
imp.fit_transform(X)
The output will give you the imputed data ,
As a first step, the algorithm would have filled in the missing values with the mean of each column. In the next step, it removes the imputed value for the age column and keeps the rest imputed values.
The algorithm then tries to predict the value for age column. The difference between the prediction in step 2 and the mean in step 1 is the change.
Scaled tolerance is the given tolerance * maximum value present in the data. In this case the scaled tolerance is 1e-10* 200 = 2e-08
The similar steps repeat for every iteration. When the change is reduced below the scaled tolerance, then the iterations will be stopped.
The above output shows that the change is much lesser than the scaled tolerance in the 9th round. The importance of the above mentioned difference supersedes that of the maximum iterations ,i.e. 30 in this case.
The Output array shows that the data has been imputed with the values 35,1 and 80 for age, experience and salary missing values .
Conclusion
The MICE has thus filled in the missing values in the data . Hope this blog was able to guide you through the journey of null value imputation through MICE and Iterative Imputer in R and Python respectively.
Please feel free to add in your comments and feedback.
Keep learning and keep exploring!
References:
can we have an example of MICE with python, with data as mixed type?