Pandas is a one-stop shop when it comes to performing a data analytical task. It is a powerful data manipulation library in Python. Pandas is my go- to library when given a dataset to analyze. I love it for its versatility. With Pandas, one can read data from a variety of sources, prepare the data , clean it, transform it, analyze and visualize it, all within a single Framework.

In this blog, we will use Pandas library of Python to go though various steps involved in a data analysis process end to end.

Defining the Objective.

The first step of a data analytical process is ** defining the objective of the project or study**. This gives the right direction for our analysis.

*Objective of this study is to find the critical factors contributing to Gestational Diabetes.*

We will be analyzing the Gestational Diabetes Mellitus dataset from ** Kaggle**. Kindly note this is a synthetic data set.

Dataset contains two files:

*Gestational_diabetes_data.csv*

*Glucose_lowering_therapies.xlsx*

*Once the goal defined, we can start collecting the data.*

Data Collection.

If you wish to follow along, you may download this dataset on your local computer. If you choose to work locally you will need a Python environment to be able to code. I have used the Anaconda distribution of Python and will be working on a Jupyter notebook locally. If you prefer not to go this route, you may also simply create a Jupyter notebook on Kaggle and import this dataset there and start coding.

*First step is to install Pandas.*

*Next, import Pandas into Jupyter notebook.*

*Loading the dataset into Pandas Dataframes.*

**Dataframes** are two dimensional data structures which allow us to organize data in a tabular format similar to a table in SQL or a spreadsheet. We can load the dataset files into **Dataframes** using the **reader** function of Pandas. **pandas.read****_csv()**

If we print the type, we can see that it is a Dataframe.

We can see the contents of the Dataframe printed by typing in the name of the Dataframe in the cell of Jupyter Notebook and executing it.

Similarly, the **read_excel() **reader function is used to read the excel file **glucose_lowering_therapies.xlsx** into DataFrame **diab_therapy_df**.

Data exploration.

Now that the Dataset is loaded, we can start exploring it to understand its structure and contents. We will be using various methods and attributes of Pandas to do so.

**head()** to look at the top rows of a Dataframe.

**tail()** to look at bottom rows of a Dataframe.

Let’s look at all the **columns** in the Dataframe.

**Shape** attribute to look at the number of rows and columns.

**info()** method in Pandas to get the summary of the Dataframe. This gives detailed information about the number of columns and rows.

Let’s call the **info()**** **method on

**diab_therapy_df**

**Dataframe.**

**describe() **method is used to get descriptive statistics of a Dataframe.

Sorting data.

Sorting data helps identify patterns in the dataset and improves readability. Let us sort the data Dataframe ** diab_df** by

*PatientID.*Data cleaning.

Data cleaning is an essential and a crucial step in any analysis project. It is a process of ensuring accuracy and reliability in the data. Goal of data cleaning is to remove inconsistencies and errors in the dataset before we move on to the next step, that is analysis.

*Duplicates in data.*

**DataFrame.duplicated()** returns duplicate rows from the DataFrame. Let’s check for duplicates in both the **Dataframes**.

**drop_duplicates()** method is used to get rid of duplicate rows.

*Missing data.*

**isna() **method is used to detect missing values. It can be seen in the below screenshot, that the Glucose column of **diab_df** Dataframe has 10 missing values. However, **diab_therapy_df**** **does not have any.

Let us now look at observations where **Glucose** is null.

There are various methods to handle missing values, like **dropna()**, **fillna()**.

However, we will be handling the missing values of **Glucose** on the basis of the **Outcome** column. We will be imputing missing values of **Glucose** where **Outcome is 1** with the mean of **Glucose** where **Outcome is 1** and similarly for **Glucose** where **Outcome is 0**. Note that **Outcome = 1** implies **Gestational Diabetes Melitus positive** and **Outcome = 0** implies **Gestational Diabetes Melitus negative.**

Data manipulation and transformations.

Many a times we will be required to compute new columns from the existing ones, for our analysis. Let us calculate the **BMI** from ‘**Weight**’ and ‘**Height**’ column.

During the processing stage of data analysis, we might need to transform the existing columns to change their datatypes.

We can convert the column **Date** from **object** to **datetime** format using **pd.to****_datetime()** method of Pandas.

Converting the **datatype** of **Date** from **Object** to **Datetime64** format.

Data Analysis; Merging, Grouping and aggregates.

Let us try to find out what is the **distribution of the GDM **(G*estational Diabetes Mellitus*) positive and negative cases in our data using the **value_counts()** method of Pandas.

**value_counts()** is a Pandas method is a handy tool to have in our Pandas toolkit. It gives the frequency of column in a Dataframe.

Above result shows that there are **83 patients** with GDM **Outcome** as **1** (**GDM** **positive**) and **22 patients** with **GDM Outcome** as 0 (**GDM negative**)

Let us look at the percentages.

*79.05 % of the patients are tested positive for GDM.*

Now that we have looked at the GDM distribution , it is time to dive deep into our analysis and look at the factors that might be contributing to GDM.

We group by ‘**Outcome**’ of ** Gestational Diabetes Mellitus** and take the

**mean**and

**median**of variables

**BMI**,

**Glucose**,

**BloodPressure**,

**Age**,

**SkinThickness**and

**DiabetesPedigreeFunction**.

From the above grouping and aggregation of various variables of **diab_df** dataframe, we can see that the means and medians for various factors like **BMI**, **Glucose**, **BloodPressure**, **Age**, **SkinThickness, **and** DiabetesPedigreeFunction **are all relatively higher where GDM outcome is 1.

Here is another way of achieving this.

Let us dig deeper and look at the correlation of various variables with **GDM** outcome.

In the above analysis we can see that **Glucose**, **Age** and **BMI** have a **stronger positive correlation** with** GDM outcome**. **BloodPressure** has **weaker** **correlation** with** GDM outcome**.

Point to note here is, in the blog we are analyzing a synthetic dataset. In real-life results may vary depending on the data.

**Merging** in **Pandas** is like **joins** in **SQL**. Merging ** dataframes** enables comprehensive analysis.

For merging in Pandas we use the method ** pd.merge()**, we specify how we want to merge and on what key

**to merge the**

*column(s)***.**

*dataframes*Here were are **merging ***diab_df*** **(the original gestational diabetes dataset) and** ***diab_therapy_df*** **(the glucose-lowering therapies dataset) based **on the common column ‘***PatientId***’** and we will be doing a **left merge**

This merge will result into all the rows from the **left** **dataframe** , that is ** diab_df** and the matching rows from

**and nulls where there is no match.**

*diab_therapy_df*That is if a patient in *diab_df*** **does not have a corresponding entry in *diab_therapy_df*** **the resulting merged DataFrame will contain **NaN** (null) values for the columns from *diab_therapy_df.*

Now that the dataframes are merged, let us answer some questions.

*What is the distribution of the patients that received these therapies?*

We see that 90 patients received all the three therapies.

*What were the average glucose levels of the patients for various combinations of the therapies received?*

We see that average glucose was 175.92 for patients that received all of these three glucose lowering therapies. And it was 131.92 for patients that did not receive any of these therpies.

Data Visualization.

We can plot various graphs off of a DataFrame in Pandas using DataFrame.plot().

*Bar Plot.*

Let us visualize the patient count as per GDM outcome using a Bar graph.

We specify the type of graph through kind (kind = ‘bar’)

*Scatter Plot.*

Let us visualize the relationship between **BMI** and **Glucose** through a scatter plot. We give **kind = ‘scatter’,** **x= ’BMI’ **and** y= ’Glucose’ **as the parameters of** plot()** function.

*Pie Chart.*

Let us visualize the **demographics** data. We do a **value_counts()** on **Ethnicity** for **GDM** positive patients and the plot using **pie chart.**

*Hist Plot.*

Let us look at the age wise distribution of GDM positive patients.

*Box Plot.*

*Key Takeaways.*

of patients were*79.05 %***GDM positive**in this dataset.**90 patients out of 105**, that is**85.71 %**took**all three**types of glucose lowering therapies, viz,*Insulin*and*Metformin**Nutritional Counseling.*Arithmetic mean and median of

**BMI**,**Glucose**,**BloodPressure**,**Age**,**SkinThickness,**and**DiabetesPedigreeFunction**are all relatively higher for patients where GDM outcome is 1.**Glucose**,**Age**and**BMI**have a**stronger positive correlation**with**GDM outcome**.

Though we have gained some valuable insights, this is just the beginning. Feel free to explore more, perform additional analysis and statistical tests to gain deeper insights into the factors influencing Gestational Diabetes Mellitus.

With this, I would love to conclude that, Pandas has indeed proved to be a versatile tool to perform analysis of this Gestation Diabetes Dataset end to end.

*Thank You !*

*Thank You !*