Structured Query Language or SQL

One of the most popular and easy-to-learn tools for data analysis is SQL. It let us **retrieve**, **filter**, **sort**, **join** and **aggregate** data. We can uncover patterns, trends, and outliers in our datasets using simple SQL queries.

Exploratory data analysis involves **understanding** the dataset that we are working with. It is a process of **familiarizing** with the dataset. EDA is a fundamental step in the analytical process.

Exploratory Data Analysis (EDA)withSQL

It is** **a process of using SQL (Structured Query Language) queries to explore and analyze tables in a database management system (DBMS).

In this blog, we will go through the steps involved in EDA with SQL.

I’ve used **PostgreSQL** as the DBMS to demonstrate. If you’re using a different SQL database, refer to its documentation for the correct syntax. Most commands and functions discussed here should be applicable across other SQL systems.

**Dataset** used is the *Hospital Dataset*** **from** **** Kaggle**. https://www.kaggle.com/datasets/namratakapoor1/hcdatasetTo simulate real-life scenario, I created a

**database**named

**and**

*Hospital***schema**named

*hospitaldata***,**as shown in the figure below,

**and**

**loaded this dataset in PostgreSQL DBMS. The steps to load the data by creating tables and loading csv files are not included here, as in real world projects you would be most likely working with pre-loaded databases for exploration.**

ERDofHospitalData.

On gaining access to a database, the first thing you’ll want to explore are the tables, columns, and their relationships. As shown in the above figure, there are 6 tables in ** hospitaldata** schema.

**and**

*Patients***are the base tables holding patients and healthcare providers information respectively. Rest of the tables hold transactional data.**

*Providers*** Entity Relationship Diagram (ERD) **illustrates the various tables (entities), their columns (attributes), and the relationships between these tables. ERD helps us

**understand the structure and connections in a database**

**.**

Describing the Structure of a Table.

In PostgreSQL, the **information_schema.columns view** contains **metadata** about all the columns in the database. Querying it helps us understand the structure of a table, in terms of table schema, column names, data types, if columns are nullable or not, etc. Here’s an example query to describe the ** patients** table in the

*hospitaldata***schema.**

Likewise we can look at the other tables from this database. Lets look at table *edvisits*

Retrieving data from database; The SELECT statement.

Once we’re familiar with the tables and columns, the next step is to understand the data stored in those tables. To do this, we typically examine a few rows of data. SELECT statement lets us retrieve data from the tables.

**SELECT * FROM table_name**; Lets us look at the rows, it also tells us the total number of rows in a queried table. For instance, here the **Total Rows are 945**

SELECT * FROM hospitaldata.patients;

**Note**: In “hospitaldata.patients” , ** hospitaldata** is the

**schema name**followed by

**table name**

*patients*Filtering Data; The WHERE clause.

Where clause lets us filter data on the basis of some condition. Let’s filter rows from ** providers** table where the

**is ‘Surgery’**

*providerspeciality*Sorting and Ordering data.

Sorting data lets us organize data in a sequence and improves readability.

We can sort data in SQL using the **ORDER BY** clause.

Let’s sort ** readmissionregistry** table by

*admissionid*Limit clause

In real life, as data analysts we would be working with massive datasets containing millions of rows. We can optimize performance of the select statement with a **Limit** clause by limiting the number of rows returned.

COUNT Function & Missing values.

We can use the **COUNT** function to get the count of rows in a table.

If we give a column name to the **COUNT** function, it will give us the count of non null values from that column. Below is the query for getting the count of values in the ** race** column of

**table.**

*patients*Value returned is **939** which is lesser than the count of total rows in the ** Patients** table that was

**945**. This hints that there are missing values in the

**column.**

*race*We can validate this using **IS NULL** condition in the **WHERE** clause as shown below. Query returned all the 6 rows where values for the ** race** column were missing or NULL.

Checking forduplicateswithGroup By clause.

Finding duplicates can be tricky at times. We have five tables with primary keys ensuring uniqueness at the primary key level. We still need to check for duplicates manually.

Here is an example query that checks for duplicate records in the ** patients** table based on a combination of columns:

This query groups records by ** dateofbirth**,

**,**

*gender***,**

*race***and counts the number of occurrences for each group.**

*ptlanguage***Having**clause with

**count(*) > 1**returns rows with duplicates. We get 6 rows with count 2. we need to investigate these records further to determine as to what needs to be done with these.

** readmissionregistry **table has no primary keys. We can use the same logic to check for duplicates.

There are none in this table as the query returned 0 rows.

Joins

We can explore how the tables are related to each other using joins. So far we have queried single tables. With SQL joins we can query multiple related tables. Join used in the query below is **inner **(default) join. It returns only the rows where there is match in both the tables ** patients** and

**.**

*admissionsdischarges*Summarizing data

We can get basic statistical summary with SQL using functions like AVG, MIN, MAX, STDDEV, VARIANCE

**MIN** and **MAX** functions let us find the **range** for a column. We can also find out if there are any extremely odd values or outliers using these functions.

**AVG** function gives the arithmetic mean. It lets us look at the central tendency of the column values.

Standard deviation and variance measure the amount of dispersion in a set of values. They tell us how far spread values are from their mean. We have **STDDEV **and** VARIANCE** functions in** PostgreSQL**

Since variance and standard deviation can be computed for a **sample** of data or for the **population**. **PostgreSQL **also provides us with corresponding functions for sample and population; viz ** var_samp**,

**,**

*stddev_samp***,**

*var_pop***respectively. These functions are good to have in our toolkit which will let us explore data in a meaningful and statistical way.**

*stddev_pop*Summarizing and aggregating data

We can also calculate the summary statistics for a grouped result. For instance, if we want to calculate the average ** acuity **(severity) as per the reason for visit , we group by

**column and then get the average of**

*reasonforvisit***on the grouped data.**

*acuity*Exploratory Data Analysis with SQL is a powerful way to uncover insights from your data, but this is just the beginning. Don’t restrict yourself to what’s covered here. I sincerely hope these steps would help you get started with EDA in SQL. Keep exploring, and let your curiosity guide you to even deeper insights!