Implementation of GDM Data with modelling

Gestational Diabetes is a problem many women suffer from during pregnancy. This model is build to understand the preconditions of it and take precautions if any at early stage if symptoms are there from the very start.

To do this GDM data is taken and exploratory data analysis is done.

In Python:

import pandas as pd
import numpy as np
df = pd.read_excel('../input/gdgdgdm/GDM.xlsx')


Under EDA data is analyzed for features to be used for detecting and analyzing gestational diabetes.

When we see the data ,we have to figure out which data is of before delivery and which one id of after delivery.

The after delivery data is dropped as it doesn't contribute anything to detect gestational diabetes.

Then we see there are 2 types of tests GCT and OGTT performed on the dataset. The values of glucose levels at 0 hr, 1 hr and 2 hrs is recorded for both and tests and Gestational diabetes is indicated wherever there is a value of equal or more than 7.8.

Also 87 out of 600 patients,glucose test values are not available as those were either transferred to other hospital or miscarried, these rows are deleted from main data.

This is implemented by:

conditions=[(df['1h glucose']>=7.8) | (df['OGTT 0h value']>=7.8) | (df['OGTT 1h value']>=7.8) | (df['OGTT 2h value']>=7.8),
(df['1h glucose']<7.8) | (df['OGTT 0h value']<7.8) | (df['OGTT 1h value']<7.8) | (df['OGTT 2h value']<7.8),
(df['1h glucose']=='') | (df['OGTT 0h value']=='') | (df['OGTT 1h value']=='') | (df['OGTT 2h value']=='')]

choices = ['1', '0',np.NaN]

df['GDM'] =, choices, default=np.NaN)
df.drop(df[df['GDM']=='nan'].index, inplace = True) 

Handling categorical features

Now categorical features like Smoking, Ethnicity, Previous GDM, Age>30, BMI>30,Screening method, Vit D list used. These are converted to numeric values using Ordinal number Encoding.

Handling NULL Values

After this we see how many features have null values we use seaborn library to see this by using code as below:

import seaborn as sns

This graph gives idea on NULL values, the NULLL values that need to be handled are 'V1 U creatinine ','V1 U protein ','V1 CRP','V1 ALT ','V1 Creatinine','V1 Platelet ','V1 Hb','WCC', and 'V1 HbA1c (mmol/mol)'.

The NULL values are here handled by KNN and MICE algorithm.

As common methods of imputation won't be helpful in medical data.

Values in glucose levels at 0 hr, 1 hr and 2 hrs for GCT and OGTT will have NULL values as one goes through 1 test or the other and these are used to derive values of GDM our target feature.

Also NULL Value rows for GDM are dropped as these will not lead to any conclusion.

We try to find relation of each data with probability of having gestational diabetes or No.

We do it by plotting through seaborn.

We plot first yes and No for gestational diabetes our target as under:


Graph shows that out of 516 rows of data 433re not having gestational diabetes while 83 have it.

If we draw this graph in relation to Age>30 the graph shows as below:

sns.countplot(x='GDM',hue='Age >30 10',data=df,palette='RdBu_r')

As the dataset we have has more values of Age more than 30, it doesn't show clear relation between having gestational diabetes and Age. Women having gestational diabetes is obviously higher for age more than 30 but it is same for women having no gestational diabetes.

We draw this graph in relation to Smoking the graph shows as below:

sns.countplot(x='GDM',hue='Smoking 123',data=df,palette='RdBu_r')

Here 3 have never smoked and 1 and 2 have been smoking currently as well as exited after pregnancy.

As the data of women who have never smoked is higher in women who didn't develop gestational diabetes so we can say non smoking women are safer to gestational diabetes than smoking ones.

If we draw this graph in relation to Overweight the graph shows as below:

sns.countplot(x='GDM',hue='Overweight 123',data=df,palette='RdBu_r')

The graph shows relation that people who are not overweight are less likely to develop gestational diabetes.