top of page
hand-businesswoman-touching-hand-artificial-intelligence-meaning-technology-connection-go-

Handling outliers in data cleaning:

An Outlier is a data point that differs significantly with the other observations. Outliers will always be the values around minimum or maximum values. The presence of the outliers in the given dataset needs to identified and handled.


Why it is important to find and handle the outliers?

Finding and handling outliers is essential for ensuring the accuracy, reliability, and interpretability of data analysis results, as well as improving the performance of statistical models and decision-making processes.


Basic Definitions :

In order to find the outliers in the given data and also as a data analyst , understanding some basic definitions is mandatory. Data is the basic foundation for the field of data analytics, it is simply defined as the value assigned to your specific observation on measurement. 


Qualitative Data -  uses descriptive terms to measure/classify something of Interest. If you are able to connect the data segregated in Tableau, the dimensions are the Qualitative data and the measures are the Quantitative Data.


Let us consider the default “SuperStore” dataset in Tableau, Customer ID, Customer Name, Product Category, Order Date, Postal Code are best examples for Qualitative Data.


Though some of them are numerical like Customer ID, the aggregation on these data will not make any sense. 


Quantitative Data - uses numerical values to describe something of Interest. The data in its basic form making sense of the patterns can be very difficult because our human brains are not very efficient at processing long lists of raw numbers. Hence, aggregation is done on these data. Eg: Sales, Profit, Discount, Unit Price


Mean - The most common measure of central tendency is average which we calculate by adding all the values in our data set and then dividing this result by the number of observations. 


Median - The median is the middle value in the data set for which half the observations are higher and half the observations are lower. When there is even number of data points the median will be the average of the two center points.


Mode -The observation in the data set that appears most frequently.  There can be more than one mode of a data  set if  more than one value occurs the most frequent number of times.


Range - Finding the difference between the highest vale and the lowest value in the dataset.


Variance - The relative distance between the data points in the set and the mean of the data set.

The distance from every point to the mean and square those values. And then you'd sum up all those squared values, and that's called the variance of the data set.


Standard Deviation - The standard deviation measures, the spread of a data set. So if a data set is spread out and has a wide range, then it would have a large standard deviation. Mathematically speaking, Standard deviation is the square root of a variance.


To understand the definitions more better, try to apply the above mentioned definitions to the below table,


Finding Mean, Median, Mode, Range, Variance and Standard Deviation for a sample of data:


Table:1


There is another concept known as Quartile, which plays a major role in boxplots.


Quartile - If a collection of numbers is put in order and split into fourths, the quartiles are the boundary points. There are three quartiles, the first is the 25th percentile, the second is the 50th percentile (which is also the median) and the third is the 75th percentile. The difference between the first and the third quartile is called the interquartile range.

Table:2


We have now understood the basic definitions, Let's jump into the ways or techniques to find the outliers.


Outliers:

Outliers are more outside of the common values present in the dataset either very high values or very less values.

There are many techniques to find the outliers, however I am going to discuss using 3 common techniques


  1. Histogram

  2. Box Plots

  3. Standard deviation


About the dataset:

I took an example dataset(Healthcare_Diabetes) from Kaggle which is present in the link below


This is a dataset about prediction of diabetes based on some diagnostic measurements from NIDDK (National Institute of Diabetes and Digestive and Kidney Diseases)


Imported the dataset to Jupyter notebook using python, and the data set looks like below with 768 observations for 9 variables.

Predicting diabetes beyond a certain age may not be as useful as predicting it earlier in life for several reasons:


  1. Preventive Measures - Identifying diabetes at a younger age allows for the implementation of preventive measures such as lifestyle changes (diet, exercise) and early medical intervention (medications, monitoring) to delay or prevent the onset of diabetes or its complications.

  2. Healthier Aging: Addressing risk factors for diabetes earlier in life can promote healthier aging by reducing the likelihood of developing chronic conditions associated with diabetes, such as heart disease, stroke, and kidney disease.

  3. Improved Quality of Life: Early detection and management of diabetes can lead to better control of blood sugar levels and reduce the risk of diabetes-related complications, thereby improving the quality of life for individuals living with diabetes.

  4. Cost-Effectiveness: Detecting diabetes earlier may be more cost-effective in the long run by reducing healthcare costs associated with treating complications of uncontrolled diabetes.

  5. Public Health Impact: Identifying and addressing risk factors for diabetes in younger populations can have broader public health implications by reducing the overall burden of diabetes and its associated healthcare costs.


So predicting diabetes beyond certain age are the outliers.


Histogram:

Let me create a histogram for age in Python to show the outliers.

in this histogram, age 80 is very obvious outlier. It is needs to be handled in order to do the reasonable analysis.


Box Plot:

The another way to find the outlier is using boxplot. Before creating boxplot, recollecting the definitions of Quartiles will be very helpful as explained above. The width of the box is the "InterQuartile Range (IQR)", which is the middle 50% of the data. Any value farther away from 1.5 times of IQR (1.5*IQR) from each side of the box is considered as outliers.


IQR = middle 50% of data

Min = 1.5*IQR(lowest value side)

Max = 1.5*IQR(highest value side)

Outliers = any value out of Min/Max Range

The marked outliers needs to be handled to do the reasonable analysis.


Standard deviation:

The third method to find the outlier is using mathematical calculation - Standard deviation. Apply the definitions of variance and standard deviation in the below scatter plot, to understand better.

The red dotted lines in the above scatter plot is the mean value.

Values at least 3 standard deviations away from the mean are considered outliers. This is applicable mainly for the data that are normally distributed. The threshold of 3 standard deviation can be changed to 2 or 4+ depending on the data.


Lets calculate mean and standard deviation using python code,

Here the mean is 33, and the SD is approximately 12. Most of the values are present in between 33 + 12 = 45 or 33 - 12 = 21


However the 3 times SD are considered outliers, which is

max = 33 +(3*11.75) = 67.95

min = 33 -(3*11.75) = 2.25


beyond age 67 are considered as outliers, hence it needs to handled to do the reasonable analysis.


Conclusion:

While predicting diabetes beyond a certain age may not be as impactful as predicting it earlier, screening and risk assessment for diabetes should ideally be implemented throughout the lifespan to identify individuals at risk and intervene early to prevent or delay the onset of the disease and its complications.


References:

  1. Data science in Python: Data Prep & EDA - Maven Analytics, Alice Zhao

  2. The Complete Idiot's Guide to Statistics, by Robert A. Donnelly, Jr., Ph.D.

71 views0 comments

Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
bottom of page