# Let the MICE fill the missing values with R and Python

MICE? Fills in null values? Yes, it does ,when it is a part of Machine Learning. MICE (Multiple Imputation by chained equations) it is a technique used during Data analysis for filling in the missing values when they are **Missing Completely At Random(MCAR).**

__Introduction__

The MICE was first introduced in 2001 as an R package . Not until 2019 ,it was introduced in Python under the term 'IterativeImputer' .

In this technique the missing values are imputed iteratively by predictions based on the values of the other columns.MICE imputation can be done on a single column or multiple columns at once.

MICE is a very strong algorithm that gives some of the most plausible values for imputation and a strong model.

How do you all feel about learning the code of MICE in two programming languages? Interesting right.

That is what I am here to show in this blog.

Lets embark on the adventure of learning MICE in both R Programming and Python.

To understand the basic working of MICE ,please refer to the blog written by my peer

__https://www.numpyninja.com/post/mice-algorithm-to-impute-missing-values-in-a-dataset__

__Understanding MICE in R Programming__

In R, MICE can be run with a simple one line code.

```
mice(
data,
m = 5,
method = NULL,
predictorMatrix,
ignore = NULL,
where = NULL,
blocks,
visitSequence = NULL,
formulas,
blots = NULL,
post = NULL,
defaultMethod = c("pmm", "logreg", "polyreg", "polr"),
maxit = 5,
printFlag = TRUE,
seed = NA,
data.init = NULL,
...
)
```

The most commonly used attributes for mice() are:

**data** - The data in which the null values must be imputed.

**m **- Number of multiple imputations. The default is 5.

**method** - It specifies the imputation method. It can be either one method for all the columns or it can be different methods for different columns. For example, pmm, logreg, polyreg .

**maxit** - Maximum number of iterations

**seed** - A random number set to make sure the same set of data is being used every time.

Please use the command **help(mice)** to understand more about the mice() attributes.

__MICE Example__

To give you an idea of how it works in R, lets take a dataset that is inbuilt in MICE package. As a first , let us install the package using the below code.

`install.packages('mice')`

** **Note that this step need to be performed only once in R studio. It works the same way as __pip install in Python__.

Once installed, the R studio Packages tab , shows a tick mark near the package name to indicate that the package has been installed right.

We have to call the package every time we are going to use it .It is similar to the __import in Python.__ It is done using the below.

`library('mice')`

To use the inbuilt data we will follow the code written in the screenshot.

The orange lines are the code and the white lines show the output.

The above DataFrame has 4 variables with 25 observations, Age(age), BMI(bmi) , Hypertensive(hyp) and Cholesterol(chl). Below is the explanation of the variables:

__age__

Age group (1=20-39 years, 2=40-59 years, 3=60+ years)

__bmi__

Body mass index (kg/m**2)

__hyp__

Hypertensive (1=Not Hypertensive,2=Hypertensive)

__chl__

Total serum cholesterol (mg/dL)

Below is the summary of the dataset and the code for the same.

Summary shows that BMI, hyp and chl has 9 , 8 and 10 missing values respectively.

Performing **basic EDA(Exploratory Data Analysis) **by checking data types of columns in data.

It is known, when skimming through the data that Hypertensive and Age are categorical columns .Whereas, the above screen shot shows that the data type is numerical.

When a data type needs to be changed we use the below lines of code

```
data$hyp = as.factor(data$hyp)
data$age = as.factor(data$age)
```

Thus , the data type is changed for both the columns.

It can be seen that **age** says Factor w/3 levels, which means it has three categories(1,2,3) and **hyp **column has Factor w/2 levels with means it has two categories(1,2).

For **checking the pattern of null values** in the data, function inbuilt in mice package can be used

The code generates both a numerical and graphical representations. For better understanding of these representations , let us go through a few points.

The

**left axis**sums up to the total number of rows.13 indicates the there are 13 rows with

**no missing values.**7 indicates there are so many rows with only the

**age**values.Rest of the numbers on the left hand side , 3,1,1 shows that those number of rows were missing only

**cholesterol**, or only**bmi**or only**hyp****and bm**i respectively.

The

**bottom axis**shows the**total number of null values**as shown in the summary of the data(Figure -3).The

**right axis**values gives the number of missing values of each row in the pattern.The

**missing values**are**marked in pink**in the graphical representation.The value 27 indicates the t

**otal number of missing values in the data.**

Looking at the pattern of the missing data, we can understand that, the values are **missing completely at random** and filling them with a unitary method like mean, median mode will not give us a meaningful data.

Thus , we will use MICE to do the imputation.

__Imputing the data__

```
imp_data = mice(data,m=5,method = c(" ","pmm","logreg","pmm"),maxit = 20)
```

The above code says that, mice() will impute the null values in the data f**ive times in 20 iterations**.

Let's try the predictive mean matching (**pmm**) in R to do the imputation. Since , **hyp **is a categorical variable, logistic regression (**logreg**) is being used. The empty " " in the method is for **age** column as it has no values that need to be filled.

The output has been stores in imp_data.

In each iteration the mice has given many values that could be possibly close to the missing value. How to know which one to use now? We have to choose from the 5 rounds of imputations .

__Choosing the right data__

Let us check the summary of the data before MICE imputation.

The above given is the summary of the data , after **age** and **hyp** column are changed to categorical.

It is seen that the mean and median of **bmi **is between 26.50 and 26.75. Also , the Interquartile range for the **bmi **is 22.65 and 28.93 .

Let's have a look at the **bmi** imputed values(Figure 8.2) .

The imputation 2 ,4 and 5 have values above 30 which is greater than the 3rd quartile value( 28.93) . Thus those imputations are removed from the race. Imputation 1 has more close values to the mean than imputation 3.

Likewise, for **chl **column (Figure 8.3 ) ,it is seen that the other imputations have values like 284 which is way above the 3rd quartile value and having them might skew the data.

Only round 1 of imputations have more number of closer values to the mean and median and many lie between the interquartile ranges.

*Thus, we can come to a conclusion that the round 1 of imputations is the best fit for the data. *

To fill the data with values from the imputation we use the below code.

`final_data = complete(imp_data,1)`

The **final_data ** has the imputed values .

__Checking the data fit__

We can plot the data to see if the imputed values are a good fit to the data. Using a strip plot we can see how well the imputed values fits into the observed values.

`stripplot(imp_data, pch = 20, cex = 1.2)`

*The blue dots are the observed values and the magenta dots are the imputed values.** *We can see that the values are a good fit to the data for further processing.

Now that have understood the working of MICE in R , let us learn the Python variation too.

__Understanding MICE in Python__

In Python , the MICE can be run in a package called the **IterativeImputer**. It was in the experimental stage until recently and it is now available in the **sklearn.impute ** package.

This is how the Iterative Imputed code is written.

```
IterativeImputer(
estimator=None,*,
missing_values=nan,
sample_posterior=False,
max_iter=10,
tol=0.001,
n_nearest_features=None,
initial_strategy='mean',
imputation_order='ascending',
skip_complete=False,
min_value=-inf,
max_value=inf,
verbose=0,
random_state=None,
add_indicator=False,)
```

Some of the most commonly used attributes of the IterativeImputer are:

**estimator** - The kind of algorithm that this imputed is being used. Eg - Linear Regression, Logistic Regression or XG Boost.

**max_iter** - The maximum number iterations the algorithm can go through.Same as maxit in R.

**tol** - Tolerance of the stopping condition. It is the amount of tolerance to the change in value between the iterations. If the tolerance is more than the change of values between the iterations , the iterations are stopped and the values obtained at the end of it is treated as the final values.

**n_nearest_features **- The number of nearest features that can be used to determine the missing value.

**initial_strategy **- A usual method that is used to initially impute the data. It can be mean, median , mode or constant value.

**verbose **- The debug messages that are posted during the processing. It can be either 0,1 or 2. The higher the number, more the messages.

**random_state** - A random number set to make sure the same set of data is being used every time

To understand the rest of the attributes, we can wither do a shift+tab or use the below link

__https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html__

Let's get the ball rolling.

__IterativeImputer Example__

To begin with, let us import the required packages,

```
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.impute import IterativeImputer
```

We will first create a dataset with missing values .

__Data Creation__

```
df = pd.DataFrame({
'age': [25,27,29,31,33,np.nan],
'experience': [np.nan, 3,5,7,9,11],
'salary': [50, np.nan, 110,140,170,200],
'purchased' : [0,1,1,0,1,0]
})
df
```

All the columns are self explanatory. The target variable here is **purchased . **Thus, the data is split in to the columns to be processed and the target column.

```
X = df.drop('purchased', 1)
y = df['purchased']
```

We will be using the X dataset.

__Imputer implementation__

Linear Regression model is used in the Iterative Imputed to perform imputation.

```
lr = LinearRegression()
imp = IterativeImputer(estimator=lr, verbose=2, max_iter=30, tol=1e-10, imputation_order='roman')
```

__Getting the final data__

When we fit,

`imp.fit_transform(X)`

The output will give you the imputed data ,

As a

**first step,**the algorithm would have filled in the missing values with the mean of each column. In the next step, it removes the imputed value for the age column and keeps the rest imputed values.The algorithm then tries to predict the value for age column. The difference between the prediction in step 2 and the mean in step 1 is the

**change**.Scaled tolerance is the given tolerance * maximum value present in the data. In this case the scaled tolerance is 1e-10* 200 = 2e-08

The similar steps repeat for every iteration. When the

**change**is reduced below the**scaled tolerance**, then the iterations will be stopped.

The above output shows that the **change ** is much lesser than the **scaled tolerance **in the 9th round**. **The importance of the above mentioned difference supersedes that of the maximum iterations ,i.e. 30 in this case.

The Output array shows that the data has been imputed with the values 35,1 and 80 for age, experience and salary missing values .

__Conclusion__

The MICE has thus filled in the missing values in the data . Hope this blog was able to guide you through the journey of null value imputation through MICE and Iterative Imputer in R and Python respectively.

Please feel free to add in your comments and feedback.

Keep learning and keep exploring!

__References:__