top of page
ammufredy

How to connect Jupyter notebook to PostgreSQL for Data Analysis

As a data analyst, SQL is considered as the best method for data retrieval and calculating basic statistics. For in-depth analysis of data, python is very much handy.

Jupyter notebook is one of the best tools for this ,we are able to use both SQL and python in it.So for large dataset analysis ,it would be good to connect the Jupyter to the Postgres


Image:Photo by John Barkiple on Unsplash

Jupyter notebook

Jupyter notebooks are one of the most useful tools for any Data Scientist/Data Analyst. It supports 40+ programming languages and facilitates web-based interactive programming IDE. We can put comments, headings, codes, and output in one single document. This document maintains the context to the original data source which means we can re-execute the code whenever we need it. This feature facilitates Data scientists/Data analysts to play with the code during the presentations. Also, these notebooks are very handy in sharing and can be shared easily across the teams.


Prerequisites

1.Install Postgres SQL

PostgreSQL is used as the primary data store or data warehouse for many web, mobile, geospatial, and analytics applications. The latest major version is PostgreSQL 12. PostgreSQL is an advanced version of SQL that provides support to different functions of SQL like foreign keys, subqueries, triggers, and different user-defined types and functions. Postgres allows you to store large and sophisticated data safely. It helps developers to build the most complex applications, run administrative tasks and create integral environments

  • We need the username, password, and port number for our connection code in Jupyter notebook.

  • I am using the DVD Rental database for my example, it is already imported into my Postgres, it has multiple tables in it


2.Install the Anaconda package:

Once you successfully installed the Anaconda, it will take you to this page,





click on Jupyter notebook

steps:

  • we need to install all python libraries for our task

Pyodbc:

This is the main one we need for the connection. It is used to connect the databases with ODBC, the ODBC is an API in the computer system that stands for open database connectivity to access databases in the system.

So first we need to install the library for it.



Let’s import the libraries we need:



pandas.io.sql.read_sql:

We need to install this pandas.io.sql library to convert our SQL query or database table into a data frame.


Then let’s import it


  • The below code is used for getting connection between Postgres and Jupyter Notebook

conn2=ps.connect(dbname="dvdrental",

user="postgres",password= "yourpassword",host="localhost",

port="5432")

dbname : the database name you want to analyze

user :the user name for Postgres,"Postgres" is the default username

password : double quotes, enter your password for Postgres

host : "local host" for your local computer

port :you can check your port number in Postgres, by default 5432 is the port number


  • To see what all tables are there in our database

sql=""" SELECT * FROM pg_catalog.pg_tables """

by this running code,we can see all the tables in our database

So our main work in this task is the installation of Anaconda and Postgress,once it is done with just 3 steps .We can connect to the database in Postgres and start exploring our data.

Now let’s try an example ,to get the Actor id, Name from the actor table

sql=""" SELECT actor_id, first_name, last_name, last_update FROM actor """


To convert this to a data frame we need to use the panda.io.sqlio

Now we can use sqlio from pandas.io.sql to execute the SQL command. It will return a pandas Data Frame so we can simply assign a variable name to it:

df=sqlio.read_sql_query(sql,conn2)

df






That’s all for now! I hope your life as a data analyst/ data scientist becomes slightly easier with these simple methods to use SQL inside Jupyter notebooks.


3,987 views

Recent Posts

See All
bottom of page