# Descriptive Statistics for the Total Beginner

*If you are completely clueless & totally uninitiated to the world of descriptive statistics, yet eager to understand it, you’ve come to the right place. As this post will be the brief & basic primer you need.*

## What is Statistics?

Let’s begin with understanding what statistics is..

Statistics is defined as the practice or science of collecting and analyzing numerical data in large quantities, especially for the purpose of inferring proportions in a whole from those in the representative sample.

Though simply put, statistics is essentially the science of learning from data.

**And it can be broadly classified into two major areas**

Descriptive Statistics

Inferential Statistics

In this post, I’ll focus on descriptive statistics, and will come to inferential statistics in a future blog post.

## What is Descriptive Statistics?

Descriptive statistics is a way to describe and summarize data in a simple and easy to understand manner.

For instance, assume you have some amount of data, and you need to tell someone about the data without giving them all of the data. And of-course you need to tell them something that is representative of that data. This is where descriptive statistics comes in handy.

**Descriptive Statistics broadly deals with two things**

Measures of Central Tendency

Measures of Variability

**Measures of Central Tendency
**

A measure of central tendency is a single value that describes a set of data by identifying the central position of that set of data.

Take for example.. you have a small dataset of numbers, as follows. 5, 3, 1, 5, 2

And you need to describe these numbers to someone, without giving them all of the numbers, and just giving them one number.

If it were me, I would want to come up with some number that is indicative of all the numbers in the set. Or some number that will give the central tendency of the data.

The most well known measure of central tendency, that you might be familiar with is the average or the arithmetic mean. Though there are others like the median & the mode.

And depending on the situation it might be more appropriate to use one over the other.. which by the way, is a topic for a whole separate post. For now, let’s focus on understanding what they mean (no pun intended 😄 ).

#### Mean

The mean or the arithmetic mean, also known as the average, is the most popular measure of central tendency. It’s a simple sum of the values of the dataset divided by the number of values in the dataset.

For instance, if we take the same small dataset of numbers used earlier 5, 3, 1, 5, 2 The arithmetic mean would be (5+3+1+5+2)/5 = 3.2

#### Median

The median is the middle value of a set of data that is arranged in the order of magnitude.

So for finding the median of the example dataset, we’d first need to arrange the dataset in order of magnitude. Original dataset: 5, 3, 1, 5, 2 Re-ordered dataset: 1, 2, 3, 5, 5 The median or the middle value of this dataset is 3.

Of-course, here the dataset has odd number of values, so the middle value is pretty clear. This will not be the case if the dataset had even number of values. In which case, the median will be the arithmetic mean of the middle two values of the dataset that is arranged in the order of magnitude.

For instance, let’s add another number to our dataset to make it an even number of values, and calculate the median for the dataset. New dataset (with even number of values): 5, 3, 1, 5, 2, 6 Re-ordered dataset: 1, 2, 3, 5, 5, 6 The median of this dataset will be the arithmetic mean of the middle two values, which will be (3+5)/2 = 4

#### Mode

The final measure of central tendency is the mode, and it is probably the simplest of all. It’s simply the most frequent value appearing in the dataset.

For instance, the mode of our original dataset of 5, 3, 1, 5, 2 = 5. As it appears the most number of times in the dataset.

In the above examples, I’ve taken an extremely small dataset of 5 or 6 values, just to share the concept. But real world data will more often than not include millions and millions of values. And these measures of central tendency will be just as effective to find out the central value of any given dataset.

Though of-course like I mentioned earlier, depending on the situation, and a few other factors (which I will get into in a future blog post) one may be more appropriate to use than the other.

That being said.. to accurately describe a dataset, it’s always useful to use a measure of central tendency in conjunction with a measure of variability or dispersion, as you will see below.

**Measures of Variability (Dispersion)**

Measures of variability, also known as measures of dispersion, helps us understand how far the data is spread in the dataset, in terms of how far the data values in the dataset are from the central tendency of the dataset, and also from each other.

This is important, because it’s not always possible to effectively describe the dataset just from the central tendency of the data. Using a measure of variability in conjunction with a measure of central tendency will give a more clearer picture about the dataset than using either of these measures by themselves.

Here’s an example to illustrate the point.

Consider these two datasets Dataset 1: -5, 0, 5, 10, 15 Dataset 2: 3, 4, 5, 6, 7 Let’s calculate the mean of the two datasets. Mean of Dataset 1: (-5 + 0 + 5 + 10 + 15) / 5 = 25/5 = 5 Mean of Dataset 2: (3 + 4 + 5+ 6 + 7) / 5 = 25/5 = 5 As you see, the mean of both the datasets is exactly the same, even though the datasets are pretty different from each other. For instance, in the first dataset, the numbers are relatively farther apart from the mean (even the closest data point to the mean is 5 away from the mean) as compared to the second dataset (where even the farthest data point is only 2 away from the mean). This is why a measure of central tendency alone, cannot always effectively describe a dataset. And why we need to understand measures of variability or dispersion.

Let’s start with some of the most commonly used measures of variability, which are — range, variance & standard deviation.

#### Range

The range is simply the difference between the highest value in the dataset and the lowest value of the dataset.

Sticking to the same datasets we used, let’s calculate the range of the two datasets Dataset 1: -5, 0, 5, 10, 15 Range of Dataset 1: 15 - -5 = 20 Dataset 2: 3, 4, 5, 6, 7 Range of Dataset 2: 7–3 = 4

#### Variance

Variance is essentially the average of the squared difference from the mean.

And I know when I put it that way, it sounds complicated, but it’s really simple if you break it down. Which is exactly what I’ll do, step by step, as follows.

So to calculate variance..

You first need the mean of the dataset (which is the simple average of the values in the dataset).

Then you need to subtract each data point in the dataset from the mean and square the difference (this is where you will get the squared difference from) Side note: The values are squared to handle the negative values. I can elaborate on the reasoning in a different blog post, but for now, it’s sufficient to understand that squaring enables us to get only positive values.

Finally you calculate the average of those squared differences. And that’s it. You have the variance of a dataset.

Variance is best understood with an example, so —

Let’s calculate the variance of dataset 1 Dataset 1: -5, 0, 5, 10, 15 Mean of Dataset 1: 5 So Variance of Dataset 1 is — {(-5–5)² + (0–5)² + (5–5)² + (10–5)² + (15–5)²} / 5 {(-10)² + (-5)² + (0)² + (5)² + (10)²} / 5 {100 + 25 + 0 + 25 + 100} / 5 250/5 = 50

Let’s calculate the variance of dataset 2 Dataset 2: 3, 4, 5, 6, 7 Mean of Dataset 2: 5 So Variance of Dataset 2 is — {(3–5)² + (4–5)² + (5–5)² + (6–5)² + (7–5)²} / 5 {(-2)² + (-1)² + (0)² + (1)² + (2)²} / 5 {4 + 1 + 0 + 1 + 4} / 5 10 / 5 = 2

#### Standard Deviation

One of the concerns with variance is if we consider the units of measurement.

For instance, if the values of the dataset we considered were in meters, then we will get the variance of dataset in meter squared (m²).

As in the variance of Dataset 1 is 50 meter squared and the variance of Dataset 2 is 2 meter squared, which let’s face it, is kind of an odd unit of measurement.

Which is why, many find it easier to talk in terms of standard deviation, which is simply the square root of variance.

As this also helps keep the units of measurement simpler to understand. Because if the original values of the dataset are in meters, then the standard deviation will also be in meters.

So let’s calculate the standard deviation of Dataset 1 & 2 Dataset 1: -5, 0, 5, 10, 15 Variance of Dataset 1: 50 So Standard Deviation of Dataset 1 is — √50 = 7.07 Dataset 2: 3, 4, 5, 6, 7 Variance of Dataset 2: 2 So Standard Deviation of Dataset 2 is — √2 = 1.41

Now after exploring both, the measures of central tendency & the measures of variability, with the examples of Dataset 1 & 2, I believe you see why it’s best to use them both in conjunction with each other, to effectively describe a dataset.

As when you look at just the mean of these two datasets, you might think they are very similar, as they both have the same mean of 5.

But once you start factoring in the measures of variability, you realize how different these two datasets are from each other.

Even if we look at range alone.. the first dataset has a much larger range of 20, whereas the second dataset has a range of only 4. This itself tells us that the first dataset is more dispersed in terms of values as compared to the second.

Which is why, it’s always helpful to describe a dataset using both a measure of central tendency and a measure of dispersion. As together, they give a much more accurate idea of the dataset, over using either of these measures on their own.

And that’s it for this introductory post on descriptive statistics. I hope you found it useful.