Have you ever collected pebbles on the beach? Remember sorting them by size and color? Statistics is like that, but for any kind of data! It's a way to make sense of information, whether it's pebbles, surveys, or social media trends.
This blog will be your guide to understand fundamental statistics, step-by-step.
When you were 5 years old, imagine you and your friends might have done something similar to collecting different pebbles from the beach. Statistics is just another fun way to learn more about all the pebbles your group collected. You might have unknowingly used statistics while:
You were counting who got how many pebbles ?
Who had most number of red colored pebbles ?
What different shapes of pebbles all of you collected together?
Which colors are unique among the collected pebbles etc.
So, statistics is nothing but a superpower of drawing insights or conclusion out of the available data. Before understanding these concepts, we need to understand the data.
Data Types
Data is nothing but what exists everywhere. Its the pebbles that you collected with your friends. But, information is when you make sense out of the data.
There are different ways to categorize the data related to pebbles:
Categorical Data: This is dividing your pebbles by color (Ex: Red/Blue/Green etc.). Categorical data groups things together based on similarities. Data which can be grouped without order is called nominal data. (Ex: Colors of pebbles), Whereas any categorical data that has orders to it is called ordinal data (Ex: Size of pebbles: Small/Medium/Large)
Numerical Data: This is sorting the pebbles by sizes (Ex: smallest to largest). There are two type of numerical data: continuous & discrete.
Discrete data can only take whole number values and often represent things that can be counted (ex the number of colored pebbles), If you want to measure the weights of each pebble in grams (gm). The weights could be of any value like 3.25 gm or 7.56 gm etc.. Its not limited to whole number of colored pebbles. This kind of data, where values can fall anywhere within a continuous range, is called continuous data.
Proportional data
This is a type of numerical data which compares a part to its whole. Ex: Lets say all of yours friends collected 20 pebbles in a basket. Out of those, 2 are reds colored pebbles. To figure out the proportion of red pebbles, you would divide numbers of red pebbles (2) by total number of pebbles (20). This gives us 0.1 which is equivalent to 10% of the total collection.
Question #1: Lets say you have 5 pebbles of 2.5 grams each. Is this proportional data, continuous data or discrete data? (Please share yours answers in comments)
Distributions
Distribution is all about understanding how your pebbles are clustered. This helps us look at bucket of pebbles with a perspective:
Normal Distribution: If mostly of pebbles weighed similar (between 2 grams to 8 grams) with few outliers or none it would a normal distribution (Bell Curve)
Uniform distribution: If most of pebbles weighted exact same (2 grams or 8 grams) with no exceptions it would be uniform distribution (Flat Line)
Bi Modal distributions: If half of pebbles were smaller sizes (2-4 grams) and others half were comparatively heavy (20-40 grams), it would be bi-modal distribution (double Peaks)
Skewed Distribution: If some of pebbles were small (10%) but majority of them were large (90%), it would be skewed distributions (Leaning Curve)
Question #2: Lets say you have 50 pebbles. Five of them are 2.5 grams each and 3 pebbles of 7.5 grams each. Weight of all other pebbles is more then 15 grams. Which distribution will this data denote (normal/uniform/bi-modal/skewed) ?(Please share yours answers in comments)
Sampling Distribution
Sampling is when we take only some part of the entire data for analysis purposes (Ex: 10% of data). The curves of such a distribution would become skinny. As probability of something happening gets reduced, the evaluation gets averaged out. For example: Lets say there were 10% chances of getting red pebble out of 20, with a sample sizes of 5 the overall chances will now get reduced to 2.5% because of the sampling.
By carefully collecting a representative sample and considering a sampling size, we can gain valuable insights into over-all distributions.
Hypothesis
The concept of hypothesis consist of following:
Null Hypothesis:
Imagine you are curious about how much a pebble weighs. A null hypothes would be saying, "Maybe the weights of all pebbles is same." It's like a starting points where we assume there's any reason for the specific outcome.
For our example null hypotheses from above would be that we are not expecting to find bunch of pebbles all weighing exact 5 grams each. It's like saying it's probably just coincidence if few pebbles happen to weigh the same.
Variance:
Variance looks at how scattered pebble weights are. If we weigh them all, are weights clustered together (low variances) or scattered all over places (high variances)
Low Variance: Imagine most of the pebbles weigh around 3-4 grams, with just few light or heavier. This low variance situation makes it less likely for our null hypothesis (weights are exactly the same) to be true. There just wouldn't be that many pebbles close to 5 grams to begin with.
High Variance: But what if pebble weights are all over map, ranging from 1 gram to 10 grams? This high variance situation makes it more likely that we might find few pebbles coincidentally weigh 5 grams each. After all, with so many weights spread out, there's highs chances of some pebbles landing at 5 grams marker purely by chance.
So, null hypothesis sets starting points, and variances helps us understand how likely it is for the coincidences explanations (null hypothesis) to hold true base on how spread out the pebble weights actually are.
Conclusion
So, we've explored some fundamental statistical concepts: data types, distributions, sampling, and hypothesis testing. Remember, statistics is like having a superpower for drawing insights from data. By understanding these basic principles, you'll be well on your way to analyzing information with confidence. Don't forget to leave your answers and questions in the comments below – let's explore the world of data together!