top of page

Explorative Data Analysis (EDA) of sepsis Dataset using Python

This dataset is used to train the Machine learning model to predict the Sepsis in Patients and also to analyze data in Tableau or power BI.. The performance of the ML-based prediction algorithm is highly dependent on the data set. Having a good data set is the greatest obstacle to producing an accurate prediction.

Sepsis is the life-threatening infection caused by body’s extreme reaction to infection harms the body’s tissues. Sepsis can result from any viral or bacterial infections in the body like pneumonia, UTI.

What is EDA?

Exploratory Data Analysis (EDA) is a method performed to understand the data. It is necessary and initial step in any data related project. EDA involves generating summary statistics for numerical data in the dataset and creating various graphical representations to understand the data better. It reveals the important information regarding the data like datatypes, outliers and patterns. This information is useful for the data Analysis and also to create better machine learning models.

understand the data

This step involves importing the libraries like pandas, numpy, seaborn and matplot lib and loading the data into dataframe.

The data consisted of a combination of hourly vital sign summaries, laboratory values, and patients’ demographics. Dataset contains 1552210 records with 43 features or fields.

describe() method shows basic statistical characteristics of each numerical feature (int64 and float64 types): number of non-missing values, mean, standard deviation, range, median, 0.25, 0.50, 0.75 quartiles and for categorical features, count, unique, top (most frequent value), and corresponding frequency. This gives us a broad idea of our dataset.


  • Vital Signs : Heart Rate, Temperature , Blood Pressure, etc.

  • Laboratory Values : Platelet Count, Glucose , Calcium etc.

  • Demographics : Age, Gender, ICULOS, HospitalAdmtime

  • SepsisLabel : 0 and 1

Data set has sepsis Label feature its value is either 0 or 1. The patients can be categorized under 3 conditions based on this feature.

Non sepsis Patients: NO sepsis on admission (Sepsis Label 0).

Sepsis after admission: No sepsis on admission, but later developed Sepsis in ICU (Sepsis Label 0 to 1).

Sepsis before admission: sepsis on admission (Sepsis Label 1).

Sepsis Patients are sum of the Patients Sepsis after admission and Sepsis before admission.

There are 40,336 number of patients in the dataset and in that 92.7 % patients are Sepsis patients, 7.3% patients are Non sepsis.

There are 55.9% Male patients and 44.1 % Female patients.

Cleaning the data:

Changing the datatype:

Data types of Patient_ID, Unit1, Unit2,Gender are changed to string.

Unnamed: 0 column is dropped.

Finding the missing values :

Lot of Numerical parameters in the dataset have missing values these can be imputed by front and back filling, filling with mean, filling with median, filling with mode, and mean-median filling. we can also drop the columns which have more percentage of null values by dropna() function.

Finding duplicate values:

We can use duplicate.sum() function to find the sum of duplicate values. It will show the number of duplicate values if they are present in the data. This data does not have duplicate values.

Finding the correlation:

We can find the pairwise correlation between the different columns of the data using the corr() method.

The resulting coefficient is a value between -1 and 1 inclusive, where:

  • 1: Total positive linear correlation

  • 0: No linear correlation, the two variables most likely do not affect each other

  • -1: Total negative linear correlation

Pearson Correlation is the default method of the function “corr”.

Now, we will create a heatmap using Seaborn to visualize the correlation between the different columns of our data:

  • Bilirubin_direct and Bilirubin_total are highly corelated and Bilirubin_direct is missing for about 96 % of patients. so,we can drop the Bilirubin_direct.

  • DBP(diastolic BP),MAP (mean arterial pressure), and SBP (systolic BP) are highly corelated. MAP can be calculated from DBP and SBP, thus, we can disregard DBP and SBP.

  • hemoglobin (Hgb) and hematocrit (Hct) values happened to be highly correlated. we can consider one feature, i.e hemoglobin (Hgb).

  • pH, HCO3, and PaCO2 are also highly correlated. we can consider any one from these.

Finding the outliers:

Outlier of any feature can be found by following code:

We can also use boxplot to visualize the outliers. A boxplot helps us in visualizing the data in terms of quartiles.

As we can observe from the above boxplot that the normal range of data lies within the block and the outliers are denoted by the small circles in the extreme end of the graph.

So to handle it we can either drop the outlier values or replace the outlier values using IQR(Interquartile Range Method).

IQR is calculated as the difference between the 25th and the 75th percentile of the data. The percentiles can be calculated by sorting the selecting values at specific indices. The IQR is used to identify outliers by defining limits on the sample values that are a factor k of the IQR. The common value for the factor k is the value 1.5.

pH has outlier. These can be removed or replace them to the upper and lower limit based on its value. similar way we can check outliers for other numerical columns.


EDA is crucial step it helps in better understanding of the problem statement and relationships between various features of the data. It involves a variety of graphical techniques to analyze the distribution of the data, the number of null values, and outliers.


55 views0 comments

Recent Posts

See All
bottom of page