As a data analyst, SQL is considered as the best method for data retrieval and calculating basic statistics. For in-depth analysis of data, python is very much handy.
Jupyter notebook is one of the best tools for this ,we are able to use both SQL and python in it.So for large dataset analysis ,it would be good to connect the Jupyter to the Postgres
Image:Photo by John Barkiple on Unsplash
Jupyter notebook
Jupyter notebooks are one of the most useful tools for any Data Scientist/Data Analyst. It supports 40+ programming languages and facilitates web-based interactive programming IDE. We can put comments, headings, codes, and output in one single document. This document maintains the context to the original data source which means we can re-execute the code whenever we need it. This feature facilitates Data scientists/Data analysts to play with the code during the presentations. Also, these notebooks are very handy in sharing and can be shared easily across the teams.
Prerequisites
1.Install Postgres SQL
PostgreSQL is used as the primary data store or data warehouse for many web, mobile, geospatial, and analytics applications. The latest major version is PostgreSQL 12. PostgreSQL is an advanced version of SQL that provides support to different functions of SQL like foreign keys, subqueries, triggers, and different user-defined types and functions. Postgres allows you to store large and sophisticated data safely. It helps developers to build the most complex applications, run administrative tasks and create integral environments
We need the username, password, and port number for our connection code in Jupyter notebook.
I am using the DVD Rental database for my example, it is already imported into my Postgres, it has multiple tables in it
2.Install the Anaconda package:
Once you successfully installed the Anaconda, it will take you to this page,
click on Jupyter notebook
steps:
we need to install all python libraries for our task
Pyodbc:
This is the main one we need for the connection. It is used to connect the databases with ODBC, the ODBC is an API in the computer system that stands for open database connectivity to access databases in the system.
So first we need to install the library for it.
Let’s import the libraries we need:
pandas.io.sql.read_sql:
We need to install this pandas.io.sql library to convert our SQL query or database table into a data frame.
Then let’s import it
The below code is used for getting connection between Postgres and Jupyter Notebook
conn2=ps.connect(dbname="dvdrental",
user="postgres",password= "yourpassword",host="localhost",
port="5432")
dbname : the database name you want to analyze
user :the user name for Postgres,"Postgres" is the default username
password : double quotes, enter your password for Postgres
host : "local host" for your local computer
port :you can check your port number in Postgres, by default 5432 is the port number
To see what all tables are there in our database
sql=""" SELECT * FROM pg_catalog.pg_tables """
by this running code,we can see all the tables in our database
So our main work in this task is the installation of Anaconda and Postgress,once it is done with just 3 steps .We can connect to the database in Postgres and start exploring our data.
Now let’s try an example ,to get the Actor id, Name from the actor table
sql=""" SELECT actor_id, first_name, last_name, last_update FROM actor """
To convert this to a data frame we need to use the panda.io.sqlio
Now we can use sqlio from pandas.io.sql to execute the SQL command. It will return a pandas Data Frame so we can simply assign a variable name to it:
df=sqlio.read_sql_query(sql,conn2)
df
That’s all for now! I hope your life as a data analyst/ data scientist becomes slightly easier with these simple methods to use SQL inside Jupyter notebooks.