top of page
Writer's pictureSoumi Sen

A Guide to Descriptive Statistics Analysis using PostgreSQL


Welcome to our journey into the world of descriptive analysis – where numbers tell stories! 📊 In this blog, we'll unravel the magic behind simple yet powerful statistical tools that help us make sense of data. From understanding averages to exploring how data spreads its wings, we're here to make descriptive analysis easy and exciting! PostgreSQL provides several advanced statistical functions for more sophisticated analysis. Let's dive in and uncover various descriptive statistics techniques and how to compute them in the PostgreSQL database.


What is Descriptive Statistics?

Descriptive statistics is to summarize and describe the main features of a dataset. Descriptive statistics provide a way to simplify and communicate large amounts of data in a meaningful and easily understandable manner.


Types of Descriptive Statistics:

1.    Measures of central tendency

2.    Measures of dispersion

3.    Distribution

4.    Statistical Dependence

 

 

1.   Measures of Central Tendency:

 

Measures of central tendency describe a variable with a single value in the middle or center of its distribution. The three main methods used to find the measure of central tendency are mean, median and mode.These statistics describe the center or average of a dataset.


   - Mean (Average): The sum of all values divided by the number of values.

      In PostgreSQL, the AVG() function calculates the mean. The ROUND() function round off the result from the AVG().

A select statement to calculate the mean BMI for patient in Covid dataset:



A select statement to calculate the mean BMI for patient in Covid dataset group by SEX:



   - Median: 

The median is the middle value of a variable when data is sorted in either ascending or descending order.

In PostgreSQL, the PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY column_name) clause computes the median.

A select statement to calculate the median BMI for patient in Covid dataset:



A select statement to calculate the median BMI for patient in Covid dataset group by SEX:



 

   - Mode:

 The mode is the value that appears most frequently in a data set(variable).

In PostgreSQL, the MODE() WITHIN GROUP (ORDER BY column_name) clause computes the mode.

A select statement to calculate the mode BMI for patient in Covid dataset:

 


A select statement to calculate the median BMI for patient in Covid dataset group by SEX:



2. Measures of Dispersion (Variability):

   - These statistics describe the extent to which numerical data spread from the middle value. It interprets the variability of data. The most frequently used measures of dispersion are Range, Variance, Standard Deviation and Interquartile Range.

 

   - Range: The difference between the maximum and minimum values.

In PostgreSQL, the MAX() function computes the maximum value, and the MIN() function the minimum value in a column. The difference between the two values is the range.

 

A select statement to calculate the BMI range for patient in Covid dataset:



A select statement to calculate the BMI range for patient in Covid dataset group by SEX:



- Variance: A measure of how spread out the values are from the mean.

In PostgreSQL, the VARIANCE() function computes the variance of a variable. Also, there are two more functions, VAR_POP() is used for population data while VAR_SAMP() for sample data.

A select statement to calculate the BMI Variance for patient in Covid dataset:




A select statement to calculate the BMI Variance for patient in Covid dataset group by SEX:



- Standard Deviation: The square root of the variance. It provides an indication of how spread out the values are from the mean/average of the dataset. A higher standard deviation indicates greater variability, while a lower standard deviation indicates that the values are more tightly clustered around the mean.

In PostgreSQL, the STDDEV() function computes the standard deviation of a variable. Also, there are two more functions to compute standard deviation in PostgreSQL. STDDEV_POP() calculates the population standard deviation and STDDEV_SAMP() for the sample standard deviation.



A select statement to calculate the BMI Standard Deviation for patient in Covid dataset:



A select statement to calculate the BMI Standard Deviation for patient in Covid dataset group by SEX:



- Interquartile Range (IQR): The range between the first quartile (Q1) and the third quartile (Q3).It is a measure of statistical dispersion that describes the range containing the middle 50% of a dataset.

It is particularly useful in situations where extreme values might disproportionately influence other measures of spread, such as the range or standard deviation. IQR can be used to identify potential outliers in a dataset. Values beyond a certain multiple of the IQR from Q1 or Q3 are considered outliers.

In PostgreSQL, the PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY column_name) clause computes the third quartile and the PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY column_name) clause the first quartile.

A select statement to calculate the BMI Interquartile range for patient in Covid dataset:



A select statement to calculate the BMI Interquartile range for patient in Covid dataset group by SEX:



3. Frequency Distributions:

   - Organizing and summarizing data into a table, graph, or chart, showing the frequency of each value or range of values.

 

These descriptive statistics provide insights into the characteristics of a dataset and help in understanding the shape, center, and spread of the data. They are fundamental for making sense of data in various fields, such as research, business, and decision-making.

A select statement to compute the frequency of the values of BMI for patients in COVID dataset:




A select statement to compute the CUMMULATIVE frequency of the values of BMI for patients in COVID dataset:



4.   Statistical Dependence:

Statistical dependence refers to the relationship or association between two or more variables in a dataset. When two variables are statistically dependent, the value of one variable provides information about the likely value of another. In other words, the occurrence or value of one variable is related to the occurrence or value of another variable.

It has a score of from -1 to 1. A score of zero (0) between two variables indicates that there is no relationship between them. A score of -1 indicates a perfect inverse relationship between the variables, and a score of 1 indicates a perfect linear relationship.


The CORR() function computes the pearson correlation coefficient in PostgreSQL.

 

A select statement to compute the correlation between BMI and SMOKER for the COVID dataset:



CONCLUSION:

 

In this blog, we understood how various descriptive statistics helped to understand the valuable insight hidden in the dataset. Here in this blog, I explained each type of descriptive statistics with the help of BMI variable in COVID dataset.

We've navigated through measures of central tendency, exploring the mean, median, and mode, which provided a glimpse into the center of our data's distribution. Delving into measures of dispersion, such as the standard deviation and interquartile range, we uncovered the extent of variability and spread within our dataset.


Thank you for reading!

 


266 views

Recent Posts

See All
bottom of page