Feature Engineering-Handling Missing Data with Python

We often have an intention of filling up spaces which are empty, be it home, heart or data.

In real scenario we often see and observe that data is missing in the datasets.


Before seeing techniques to handle them lets see what kind of problems it may present and what are various kinds of missing data.

Missing data present various problems.


1) The missing data reduces statistical power, which refers to the probability that the test will reject the null hypothesis when it is false.

2) The missing data can cause bias in the estimation of parameters.

3) It can reduce the representativeness of the samples.

Missing data reduces the power of a trial. Some amount of missing data is expected, and the target sample size is increased to allow for it. However, such cannot eliminate the potential bias. More attention should be given to the data that is missing in the design and performance of the studies and in the analysis of the resulting data.

What are the different types of Missing Data?


1. Missing completely at Random (MCAR): A variable is missing completely at random (MCAR) if the probability of being missing is the same for all the observations. When data is MCAR, there is absolutely no relationship between the data missing and any other values, observed or missing, within the dataset. In other words, those missing data points are a random subset of the data. There is no systematic approach going on that makes some data more likely to be missing than other.

Data is rarely MCAR.

In example below we can see the data is missing without any rule.



The statistical advantage of data that are MCAR is that the analysis remains unbiased. Power may be lost in the design, but the estimated parameters are not biased by the absence of the data.

2. Missing Data Not At Random(MNAR): Systematic missing Values There is absolutely some relationship between the data missing and any other values, observed or missing, within the dataset.

The only way to obtain an unbiased estimate of the parameters in such a case is to model the missing data.

In example below we can see that only salaries equal or less than 30000 are missing.


3. Missing at Random (MAR): The missing data here is affected only by the complete (observed) variables and not by the characteristics of the missing data itself. in other words , for a data point , to be missing is not related to the missing data, but it is related to some of ( or all ) the observed data

Assumes that we can predict the value that is missing based on the other data.

In example below, we can see the data missing is of people of age>50



Now lets see how we can handle them with various techniques.


The techniques of handling, missing values are:

1. Mean/ Median/Mode replacement

2. Random Sample Imputation

3. Capturing NAN values with a new feature

4. End of Distribution imputation

5. Arbitrary imputation

6. Frequent categories imputation

Mean/Median/Mode Imputation

Mean/median imputation has the assumption that the data are missing completely at random (MCAR). We solve this by replacing the NAN with the most frequent occurrence of the variables.

Assumption: The missing data is completely at random (MCAR).

In python we can do it by following code:


def median_rep(df,field,median):
 df[field+"_median"]=df[field].fillna(median)

or

from sklearn.preprocessing import Imputer
 values = mydata.values
 imputer = Imputer(strategy=’median’)

Advantages

1) Easy to implement

2) Fast way to obtain the complete dataset.

3) Works well with small numerical datasets.

Disadvantages

1) Change or Distortion in the original variance as we can see in graph above.

2) Impacts Correlation.

3) Not very accurate.

4) Works well only on column levels.


Random Sample Imputation


Missing data with a random sample extracted from the variable. It works with both numerical and categorical variables. A list of variables can be indicated, or the imputer will automatically select all variables.

Assumption:


def random_rep(df,field):
 df[field+"_random"]=df[field]
 # random value to fill the na
 random_value=df[field].dropna().sample(df[field].isnull().sum(),random_state=0)
 #pandas need to have same index in order to merge the dataset
 random_value.index=df[df[field].isnull()].index
 df.loc[df[field].isnull(),field+'_random']=random_sample

or


imputer = mdi.RandomSampleImputer(random_state=[field1, field2],
 seed='observation',
 seeding_method='add'/’mutiply’)