EDA is like "interviewing" the data. The Data Analyst gets to know and learn about the interesting things that data has to say. Analysts should explore the data for potential research questions before jumping into confirming the answers with hypothesis and inferential statistics.
EDA involves the following steps:
Classifying variables as continuous, categorical, and etc.,
Summarizing variables using descriptive statistics.
Visualizing variables using charts.
Now, let's explore each step in-detail.
I. Classifying variables.
What are variables?
Variables are something that vary across observations. Each variable provides different information about our observation. Classifying the variables will provide us with some sort of distinctions that would be helpful in our analysis.
Classifying variables is somewhat arbitrary and built on rules of thumb rather than hard-and-fast-criteria.
Categorical (Qualitative) variables describe a quality or characteristic of an observation. A typical question answered by categorical variables is" Which kind/type?". They are often represented by non-numeric values.
Binary Variables - these variables can only take two levels; often stated as yes/no responses. Some examples are: * Married? (Y/N) * Sex(F/M) * Vegan diet(Y/N)
Nominal Variables - any qualitative variables with more than two levels. Some examples are: * Country of Origin * Favorite color * Favorite travel destinations
Ordinal Variables - these variables take more than two levels, and there is an intrinsic ordering between the levels. Some examples are: * Beverage size (small, medium, large) * Class (freshman, sophomore, junior, senior)
Quantitative variables describe a measurable quantity of an observation. A typical question answered by quantitative variables is "How much?" or "How many?". They are mostly represented by numbers.
Continuous Variables - these variables can take an infinite number of values between any two other values. Some examples are: * Height * Surface area
Discrete Variables - these variables can take only a fixed number of countable values between any two values. Some examples are: * Number of individuals in a household * Total strength in a classroom
II. Summarizing variables using descriptive statistics.
Descriptive statistics can help in summarizing data in the form of simple quantitative measures such as percentages, means or visual summaries such as histograms and box plots. Using descriptive statistics, both categorical and quantitative variables can be described. In the case of more than one variable, descriptive statistics help summarize relationships between variables using tools such as scatter plots. Descriptive statistics can be broadly classified into:
Sorting/grouping Sorting and grouping are often done using frequency distribution tables. For continuous variables, it is better to use groups in the frequency table. Another form of presenting frequency distributions is the “stem and leaf” diagram, which is considered to be a more accurate form of description. For example, let's consider the height of ten 7th grader's: 60, 64, 62, 58, 67, 70, 59, 69, 62, 66. The stem & leaf plot for this data is as shown below.
Illustration/Visual display of data The most common tools used for visual display are frequency diagrams, bar charts (for discrete variables) and histograms (for continuous variables). Composite bar charts can be used to compare variables. A pie chart can be used to show how a certain quantity is divided among its constituent variables. Scatter diagrams can be used to illustrate the relationship between two variables.
III. Visualizing variables using charts.
Visualizing variables using charts make the data more interesting, dynamic, relevant, and well-received by diverse audiences. Using charts, one can shine a light on hidden information and details that you wouldn’t uncover in a spreadsheet, bar chart, or pie graph. Following are the most popular form of charts:
Bar charts - effective at comparing categories within a single measure.
Bullet charts - show progress against a goal by comparing measures.
Line graph - connects several distinct data points, presenting them as one continuous evolution.
Histograms and box plots - show where the data is clustered and can compare categories.
Maps - useful for visualizing location-specific questions or aiding geographical exploration.
Pie charts - very powerful for adding detail to other visualizations but aren’t as effective on their own.
Finally, a few tips and techniques to make visualization more intuitive and interesting:
Choose the right charts/graphs.
Use predictable patterns for layouts.
Tell data stories quickly with clear color clues.
Add contextual clues with shapes and designs.
Use size to visualize values.
Apply text appropriately, as needed.