We are surrounded by numerous facts pushed to us via various media sources. Blogs, online ads, tv advertisements, word-of-mouth, social media posts etc. Unknowingly, this influences one's thoughts either positively or negatively and daily decisions of what to buy, who to believe etc. If the sources are biased and untruthful, some important decisions based on the perception can irrevocably harm the individual.
One age old example is the cigarette ads of mid 1900s.
This example shows how in the 1930s, a nationwide advertisement claimed that a significant number of physicians endorsed the use of cigarettes. As we now know, usage of tobacco is harmful and the fact behind that advertisement is false. But the damage that false statistical figures cause can be inferred.
It is the role of the statistician or the data analyst/scientist to analyze, verify and publish findings in the most truthful, objective and unbiased manner from the sample size that makes the most sense from the context of the research taken.
What is Statistics?
The Brittanica website defines statistics as the science of collecting, analyzing, presenting, and interpreting data. Data collection method and analytic approaches determine the veracity of the statistical claim. Statistics lays down applicable approaches depending on the statistical problem chosen.
Why is statistics important?
Statistical claims form the basis of many daily life decisions. But unless the figures are quite incredible, most people believe rather than refute the claims. So, it is important to know, what are the ways that statistical data can be misrepresented to determine what is believable and what is not. Cherry picking data is selective picking of data points to present a desired statistical outcome. Overgeneralization is a conclusion drawn based on flawed sample. Biased samples are very similar to cherry picking except this means deliberately selecting a sample from a very specific sub group that do not truly reflect the actual outcome.
Sometimes statistics are reported without a margin of error and for very clear incidence of a certain outcome – the absence of this strengthens the claim and is therefore, incorrect.
Association of Correlation with causation is another misleading way to represent data since correlation does not always mean causation.
How do we know the data is correct?
This is where we circle back to the statistical problem. This should give us what kind of dataset we are looking for. Broadly, they may be population or sample. Population considers all members of the target group while sample is a specific subset of the target group.
Population data collection is often expensive and time consuming so it may be better depending on the statistical problem, to collect data from a representative sample of the population.
While selecting or determining the sample, one needs to ensure that it is an accurate representation of the larger group and is unbiased.
Bias originates in multiple ways. Sampling bias often targets a specific subset of the target group to collect data unwittingly by medium or method of collection and excludes other subsets.
Self-selection bias is where data collected is not a true reflection of the target group as they may be subjective to the participant’s cognition that are geared towards personal opinions. Confirmation bias may influence the statistician’s approach in interpretation or search for sample and is rooted on preconceived notions or beliefs of the statistician’s own cognition.
Inaccurate fitting (overfitting or underfitting) where either too many datapoints are used or too little datapoints are used to analyze also cause incorrect basis of data analysis.
The popular way to counter biased sampling is randomizing or stratifying data. The latter means to include further level of detail in the data to qualify where the samples are captured from to provide a truer perspective.
What kinds of data is collected?
In broad terms, most data can be classified into qualitative and quantitative data. Quantitative data can be measured while qualitative data can be categorized and labelled.
What are the different analytic approaches?
The most common types of analytic approaches are descriptive and inferential.
Descriptive statistics summarize data using measures of central tendency and measures of variation/dispersion. This means that descriptive statistics postulate a figure or a range of data points to give an impression of the data. A common example of the former is mean while an example of the latter is variance.
Another approach is inferential statistics which may include predictive analysis. Inferential statistics looks at the data and infers hypothesis that are tested using various statistical tests to determine the veracity. These may be followed by predictive analysis, where statistical models are employed to predict outcomes based on historical/existing data. Examples would be forecast. This is more helpful in future planning of resources or supplies in some contexts.
What are some of the ways data can statistically be presented?
Distribution tries to draw a range around all the datapoints of the quantitative data and may be descriptive or predictive.
Correlation tries to find a linear relationship between two different quantitative datatypes.
Probability is the likelihood of an outcome happening based on a fixed number of outcomes.
What should we remember?
Daily statistical claims from multiple sources of media should be looked with an objective eye to understand the truthfulness of the statistics. Some of the key points to consider before we believe these facts are:-
· Is the sampling of the data large enough to be representative of the target population and unbiased?
· Is the analytic approach of the data applicable and suited?
· Are the inferences drawn based on the correct methods?
If we are assured of these questions in the affirmative, it is quite likely the statistical fact in question will shape our opinions going forward.