Feature Engineering-Handling Missing Data with Python
We often have an intention of filling up spaces which are empty, be it home, heart or data.
In real scenario we often see and observe that data is missing in the datasets.
Before seeing techniques to handle them lets see what kind of problems it may present and what are various kinds of missing data.
Missing data present various problems.
1) The missing data reduces statistical power, which refers to the probability that the test will reject the null hypothesis when it is false.
2) The missing data can cause bias in the estimation of parameters.
3) It can reduce the representativeness of the samples.
Missing data reduces the power of a trial. Some amount of missing data is expected, and the target sample size is increased to allow for it. However, such cannot eliminate the potential bias. More attention should be given to the data that is missing in the design and performance of the studies and in the analysis of the resulting data.
What are the different types of Missing Data?
1. Missing completely at Random (MCAR): A variable is missing completely at random (MCAR) if the probability of being missing is the same for all the observations. When data is MCAR, there is absolutely no relationship between the data missing and any other values, observed or missing, within the dataset. In other words, those missing data points are a random subset of the data. There is no systematic approach going on that makes some data more likely to be missing than other.
Data is rarely MCAR.
In example below we can see the data is missing without any rule.
The statistical advantage of data that are MCAR is that the analysis remains unbiased. Power may be lost in the design, but the estimated parameters are not biased by the absence of the data.
2. Missing Data Not At Random(MNAR): Systematic missing Values There is absolutely some relationship between the data missing and any other values, observed or missing, within the dataset.
The only way to obtain an unbiased estimate of the parameters in such a case is to model the missing data.
In example below we can see that only salaries equal or less than 30000 are missing.
3. Missing at Random (MAR): The missing data here is affected only by the complete (observed) variables and not by the characteristics of the missing data itself. in other words , for a data point , to be missing is not related to the missing data, but it is related to some of ( or all ) the observed data
Assumes that we can predict the value that is missing based on the other data.
In example below, we can see the data missing is of people of age>50
Now lets see how we can handle them with various techniques.
The techniques of handling, missing values are:
1. Mean/ Median/Mode replacement
2. Random Sample Imputation
3. Capturing NAN values with a new feature
4. End of Distribution imputation
5. Arbitrary imputation
6. Frequent categories imputation
Mean/median imputation has the assumption that the data are missing completely at random (MCAR). We solve this by replacing the NAN with the most frequent occurrence of the variables.
Assumption: The missing data is completely at random (MCAR).
In python we can do it by following code:
def median_rep(df,field,median): df[field+"_median"]=df[field].fillna(median)
from sklearn.preprocessing import Imputer values = mydata.values imputer = Imputer(strategy=’median’)
1) Easy to implement
2) Fast way to obtain the complete dataset.
3) Works well with small numerical datasets.
1) Change or Distortion in the original variance as we can see in graph above.
2) Impacts Correlation.
3) Not very accurate.
4) Works well only on column levels.
Random Sample Imputation
Missing data with a random sample extracted from the variable. It works with both numerical and categorical variables. A list of variables can be indicated, or the imputer will automatically select all variables.
def random_rep(df,field): df[field+"_random"]=df[field] # random value to fill the na random_value=df[field].dropna().sample(df[field].isnull().sum(),random_state=0) #pandas need to have same index in order to merge the dataset random_value.index=df[df[field].isnull()].index df.loc[df[field].isnull(),field+'_random']=random_sample
imputer = mdi.RandomSampleImputer(random_state=[field1, field2], seed='observation', seeding_method='add'/’mutiply’)
· random_state (int, str or list, default=None) – The random_state can take an integer to set the seed when extracting the random samples. Alternatively, it can take a variable name or a list of variables, which values will be used to determine the seed observation per observation.
· seed (str, default='general') –Indicates whether the seed should be set for each observation with missing values,or if one seed should be used to impute all variables in one go.
general: one seed will be used to impute the entire data frame. This is equivalent to setting the seed in pandas. Sample(random_state).
observation: the seed will be set for each observation using the values of the variables indicated in the random_state for that particular observation.
· seeding_method (str, default='add') – If more than one field are indicated to seed the random sampling per observation, you can choose to combine those values as an addition or a multiplication. Can take the values ‘add’ or ‘multiply’.
1) Easy to implement.
2) Random sampling imputation preserves the original distribution, which differs from the other imputation techniques.
3) There is less distortion in variance
1. It depends on a particular data condition that randomness will work, in every situation it is not as useful.
Capture NAN values with new feature addition
This kind of strategy works well if the data are not missing completely at random (MNAR).
Capture Nan values with a key value of 0 or 1, and replace the Nan values in field with any strategy like mean, median or mode.
This way there will be a key with us for further seeing if Nan values had any effect on distribution of data.
Assumption: For data not missing completely at random (MNAR).
def capture_nan(df,field): import numpy as np df[field+’_NAN']=np.where(df[field].isnull(),1,0)
1. Implementation is easy.
2. Captures the importance of missing values
1. Creating Additional Features.
If there are a lot of fields which are to be tracked for missing values then this can make data with higher dimensions. (Curse of Dimensionality)
End of Distribution imputation
If by observation one feels that the missing value is not at random then capturing that information is important. In this scenario, one would want to replace missing data with values that are at the tails of the distribution of the variable.
Assumption: Missing value is not at random. (MNAR)
def impute_nan(df,field,extreme): extreme=df.field.mean()+3*df.field.std() df[field+"_end_distribution"]=df[field].fillna(extreme)
# set up the imputer import feature_engine.missing_data_imputers as mdi tail_imputer = mdi.EndTailImputer(distribution='gaussian',tail='right',fold=3,variables=[field])
right tail: mean + 3*std
left tail: mean - 3*std
1. Easy and quick to implement.
2. It captures the importance of missing values (if one suspects the missing data is valuable)
1. This action may distort the variable, mask predictive power if missing data is not important.
2. Hide true outliers if the missing data is large or create an unintended outlier.
Arbitrary value imputation
It is defined as replacing all occurrences of missing values within a variable by an arbitrary value. Ideally the value should be different from the median/mean/mode, and not within the normal range of the variable.
Assumption: Data is not missing at random.(MNAR)
def arb_rep(df,field,arbVal): df[field+"_arbVal"]=df[field].fillna(arbVal)
from sklearn.impute import SimpleImputer # create the imputer, with fill value 99 as the arbitraty value imputer = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=99)
Implementation is easy.
It’s a fast way to obtain complete datasets.
It captures the importance of a value being “missing”, if there is one.
Distortion of the original variable distribution and variance.
Distortion of the covariance with the remaining dataset variables.
If the arbitrary value is at the end of the distribution, it may mask or create outliers.
We need to be careful not to choose an arbitrary value too similar to the mean or median (or any other typical value of the variable distribution).
The higher the percentage of NA, the higher the distortions.