top of page
Search

# Simpson's Paradox -A weird statistical behavior- found in dataset of Multiple Sclerosis patients.

Simpson’s Paradox is an unusual behavior in which a trend is seen when we take the whole population into consideration for analysis, but the same trend is reversed when we divide the population into subgroups.

image source: https://unsplash.com/photos/vR6bNYTVlpo

Comorbidities and Symptoms of Covid-19 in Multiple Sclerosis Patients:

The dataset contains Patients with Multiple Sclerosis and we were asked to study the effects of Covid-19 in a Data Analysis hackathon in NumpyNinja.

These patients had different types of comorbidities present. Different types of comorbidities are listed as Cardiovascular, Kidney disease, Liver disease, Diabetes, Hypertension, Immunodeficiency, Lung disease, Malignancy and Neurological-Neuromuscular.

Symptoms which were found in patients when they were suspected of Covid-19 are self isolation, chills, dry cough, fatigue, fever, loss of smell and taste, nasal congestion, pain, pneumonia, shortness of breath and sore throat.

Symptoms were group under 4 main categories:

1. Cardiometabolic Diseases (which affects the heart)- heart disease, hypertension, diabetes and obesity.

2. Immunodeficiency (which affects the immune system)- Immunodeficiency and Malignancy.

3. Neurological/Neuromuscular (which affects a person's nervous system).

4. Organ System conditions (which affects a person's organs)- kidney disease, lung disease and liver disease).

Here is the bar graph which shows total patients by recovered patients.

This graph clearly shows that total number of patients recovered from Covid-19 are greater than those who were not recovered.

Now, lets see what we get if we divide this population into sub-groups of various symptoms listed above.

Now, after plotting the above graphs, we clearly see that the number of patients not recovered is higher than those recovered from Covid-19. This behavior is exactly opposite when we look at the total recovered patients.

This behavior, when we divide the total population into subgroups and the behavior is the opposite, is known as Simpson's paradox.

This happens because the third variable, which is not shown in the graph, is affecting the two variables which are plotted.

Let's bring the age demographics into the above graphs.

If we plot the bar graph of total population with recovered/not recovered patients with age as third variable, the results will be:

The age distribution for the recovered patients in the given dataset is: 124 recovered and 81 not recovered for age 18-50 and 32 recovered and 12 not recovered for ages 50 and above.

Considering both ages and comorbidities, we find that ages 51 and above patients has less recovery rate as compared to age 50 and below. This is evident when we analyze the recovery with respect to age. Also, this dataset consists of more patients from 50 and below ages.

With this, we conclude that we should not analyze the dataset taking into consideration one or two factors which might affect the total outcome, but we should look at the overall picture before making any conclusions and analysis.

Rated 0 out of 5 stars.
No ratings yet