During data collection, missing data or values, occur when no data value is stored for the variable in an observation. It occurs in almost all research even in a well-designed and controlled study. Missing data can reduce the statistical power of a study and can produce biased estimates, skewed data set which leads to invalid conclusions. There are so many reasons for missing data in a clinical data set, some of them are listed below.
(i) Patient refusal to respond to specific questions (e.g., income, work status) (ii) Loss of patient to follow-up (iii) Investigator or mechanical error (iv) Physicians not ordering certain investigations for some patients (e.g., cholesterol test, vitamin tests) (v) Past data might get corrupted due to improper maintenance. (vi)Observations are not recorded for certain fields due to some reasons.
Broadly missing values are classified into three major categories listed as
Missing completely at random (MCAR) Missing at random (MAR) Missing not at random (MNAR)
Missing data is a common occurrence in clinical research. It occurs when the value of the variables of interest are not measured or recorded for all subjects in the sample. Most common practices to handle missing data are complete-case analyses (subjects with missing data are excluded) and mean-value imputation (missing values are replaced with the mean value of that variable). However, in many settings, these approaches can lead to biased estimates of statistics.
In this article, a different approach has been made to do the imputation for variables which have multiple time-based records for each individual admitted in hospital. For this, a dataset on sepsis (which is also available on Kaggle and click here to see the dataset) has been taken which have hourly records of the vitals and laboratory tests of the patients admitted in the hospital.
First of all, let us have a brief description of dataset. This dataset contains the records of 40336 individuals and 43 variables which includes vitals readings recorded hourly, laboratory test results and demographics of patients. There are multiple records of each patient reported hourly for vital variables like heart rate, temperature of body, pulse rate, blood pressure, O2 saturation and respiration and the missing percentage of the vitals is less than 16% excluding the temperature. On the other hand, variables (Laboratory measures) like glucose, white blood cell, pH, calcium, chloride, platelets etc. which are based on the laboratory tests, have very less records, and the null percentage is more than 90% excluding the glucose level. The reason for having high null percentage in laboratory values is that the blood samples are not taken every hour and also it took hours to get the results. For that reason, in every 24 hours there are hardly maximum 3 records for the laboratory variables which leads to high missing data percentage. The missing percentages of all the variables are shown below.
As sepsis is the body’s extreme response to an infection. And it happens when an infection you already have triggers a chain reaction throughout your body. Without timely treatment, sepsis can rapidly lead to tissue damage, organ failure, and even death. Healthcare professionals diagnose sepsis using a number of initial physical findings such as: fever, abnormal blood pressure, increased heart rate, difficulty in breathing.
That’s why here we will concentrate only on the vitals of each patient. It has been observed in the data that during the whole stay in the hospital of a patient only few records of vitals are missing. The vitals and laboratory tests of each individual are independent of others and purely depends on the condition of a patient. This criterion must be taken into account while handling missing data in clinical research where the records are taken hourly or in small periods. For that imputation of nulls has been made patient-wise and for that interpolate method of filling the null values have been approached. Here the records of vitals of a random patient have been taken where we can observe that only few vital records are missing.
In vitals, mean arterial pressure (MAP), systolic blood pressure (SBP) and diastolic blood pressure (DBP) are interrelated to each other by the formula as
MAP=DBP+(SBP-DBP)/3 and DBP=2 SBP/3.
So, with the help of these formula, we can impute few entries in the MAP and DBP
df = pd.read_csv("../input/sepsis-classification/Dataset.csv") df['DBP'].fillna(round((2*(df['SBP'])/3),2),inplace = True) df['MAP'].fillna(round((df['SBP']+2*(df['DBP']))/3,2),inplace = True)
As we are concentrating here to impute the vitals of each individual, and for this the first null records are filled by using the backward fill method and the last null records are filled by the method of forward fill, whereas the in between void entries are imputed by using the interpolate method. And the code for that is
df['HR']=df.groupby('Patient_ID')['HR'].apply(lambda group: group.interpolate(method='linear',limit_direction = 'both'))
df['O2Sat']=df.groupby('Patient_ID')['O2Sat'].apply(lambda group: group.interpolate(method='linear',limit_direction = 'both'))
df['Temp']=df.groupby('Patient_ID')['Temp'].apply(lambda group: group.interpolate(method='linear',limit_direction = 'both'))
df['SBP']=df.groupby('Patient_ID')['SBP'].apply(lambda group: group.interpolate(method='linear',limit_direction = 'both'))
df['MAP']=df.groupby('Patient_ID')['MAP'].apply(lambda group: group.interpolate(method='linear',limit_direction = 'both'))
df['DBP']=df.groupby('Patient_ID')['DBP'].apply(lambda group: group.interpolate(method='linear',limit_direction = 'both'))
df['Resp']=df.groupby('Patient_ID')['Resp'].apply(lambda group: group.interpolate(method='linear',limit_direction = 'both'))
Here, in the above code limit direction as both works as backward fill for front rows, backward fill for last rows and interpolate in between rows. With this, each variable which contains at least one record for each patient is filled. After doing this imputation, still there are few rows left for which the variables contain null values. These are the records of the patients for which there is no single entry for a specific variable. Here we can do the complete case analyses means removing all the missing entries.
Conclusion: In clinical datasets where there are multiple short period time-based records of patients, instead of using mean, median approach of imputation, we can use interpolate method of imputation. This approach is also useful in the datasets which have categorical items, and each item is totally different or independent of each other. In these types of cases, interpolation by grouping method is the most effective and unbiased method of imputation.