top of page Search

# Statistics for Data Analysis What is Statistics?

Statistics is a branch of applied mathematics that collects, organizes, analyzes, and interprets quantitative data. Statistics is very helpful in Data analytics because data analysts and data scientists use statistics to analyze, review, gather, and draw conclusions from the given data. Statistics branches out into 2 types. They are descriptive and inferential.

Types of Statistics:

Descriptive Statistics:

In short, descriptive data either summarizes or describes a set of data. There are 2 types of descriptive data:

1. Measure of Central tendency

2. Measures of Variability

1. Measure of Central tendency: This type of descriptive data gives you the center of the set of data. Mean, median, and mode are all examples of this type of data and they all provide you in one way or another, with the center of the data.

• Mean/Average(Arithmetic): You can find the Arithmetic mean by adding all of the numbers and then dividing by the number of numbers.

Example: Find the mean of 1, 2, 3, 4, 5

Step-1: Add all the numbers together 1+2+3+4+5=15

Step-2: 15/5=3 You divide by 5 because there are 5 numbers in the set of numbers.

• Sample Mean: The sample mean shows the average in the set of data. When you have a large data set and you want to take a sample of the data and you want to find the mean of the sample this is how you would do it.

Example: What is the average height of women in Brazil? It is not possible to measure every single woman in Brazil so we would take a sample of the population and find the mean of the sample because as we discussed earlier it is not possible to find the mean of the population. Let's say we got 5 different women with heights of 5.5ft, 6ft, 5ft, 5ft, and 5.5ft.

Formula for sample mean: x̄ = ( Σ Xi ) / n .

Here represents the sample mean.

Xi represents all of the x values

n refers to the number of sample terms in the data set.

Note: Σ means “the sum of”

1st step: This may look intimidating at first look but this can easily be summed up into 2 steps. First, all you have to do is add up all the sample terms. In our case it would be 5.5ft+6ft+5ft+5ft+5.5ft= 27 feet.

2nd step: You divide by the number of terms. In our case, we have 5 terms. Then you divide by the sum of the terms. 27/5=5.4ft. The mean is 5.4ft.

• Population mean: Population mean is the average of the entire set of data.

Formula for population mean: μ = (Σ Xi )/ N .

μ represents the population mean.

Xi represents all the individual items in the set of data.

N represents the number of items in the data set.

Note:Σ means “the sum of.”

Example: What is the average amount of meals someone had at the hospital?

1 meal (2 people)

2 meals (3 people)

3 meals (57 people)

Step-1: Add all of the individual items. In our case, we would add all of the meals. That would be 179 meals.

Step-2: Divide by the number of items in the data set. In our case, we would divide 179 by 62 which is 2.88709677419. Our mean would be 2.88709677419.

• Median: The median is just the middle of the set of numbers. You would have to write down the numbers in ascending order and then find the middle. If you have an even amount of numbers, then you simply find the mean of the 2 middle numbers.

Example-1: Find the median of 2,1,4,3, and 5.

Step-1: Write them down in ascending order. 1,2,3,4, and 5.

Step-2: Find the middle. In this case, since there is an odd amount of numbers in this set of numbers, you simply find the number in the middle. In this case, it would be 3.

Example-2: Find the median of 1,2,3,4,5, and 6.

Step-1: Write them down in ascending order. 1,2,3,4,5, and 6

Step-2: Since there are 2 numbers in the middle, 3 and 4, all you do is find the average of the 2 numbers. In this case, it would be (3+4)/2= 3.5. The median of this data set is 3.5

• Mode: The mode of the data set is simply the numbers that appear the most. It is possible to have multiple modes. Having 2 modes is known as (bimodal).

Example-1: Find the mode of 1,2,3,4,5,6,1,2,8,9,4,1, and 1.

Find which number is repeated the most. In this case, it would be 1.

Example-2: Find the mode of 1,2,3,4,5,6,7,8,9,1, and 2.

Find which number is repeated the most. Since there are 2 numbers repeated the same amount of times, it will be 1 and 2.

• Measures of Variability: This form of Descriptive Data describes the variability of the data set. Standard deviation, variance, minimum and maximum variables, kurtosis, and skewness are all ways to measure the scatter of the data set. Below will be explained few of the above types.

1. Variance: Variance tells you how scattered the data is. (The larger the variance the more spread out the data is from the mean)

Example: The teacher wanted to find the variance of her student's test scores. Her students' test scores were 38, 72, and 54.

Step-1: Find the mean. 38+72+55/3=55

Step-2: Find the difference between the test score from the mean.

38-55=-17

72-55=17

55-55=0

Step-3: Square all of the differences. -17x-17=289, 17x17=289, and 0x0=0

Step-4: Add all of the numbers. 289+289+0=578

Step-5: divide by the number of differences. 578/3=192.67(Rounded)

2. Standard deviation: This tells you the normal distribution rate for the data set.

Example: We will use the example from above.

Step-1: First find the variance. We will use 192.67 from the above example. Then find its square root. The square root of 192.67 is 13.88 (rounded).

Inferential Statistics:

Jumping off of descriptive data, Inferential data uses all the variables that descriptive data finds and uses them to make predictions. Inferential data is used to make observations about a group (parameter) by using the sample which is drawn from the group.

Example: Estimate ting average demand for a product by surveying a sample of customers' shopping habits to predict how much more they will have to produce that product.

Happy Analyzing!