Photo by Mika Baumeister on Unsplash
As a life of Data Analyst, We have to deal with lots of Data. As Data can be collected with
different sources . Among them CSV files are the great source for Data gathering.
CSV (Comma Separated Files ) :
A Comma Separated Values (CSV) file is a plain text file that contains a list of data. These files are often used for exchanging data between different applications. For example, databases and contact managers often support CSV files.
Load CSV File:
To read CSV using pandas, we will use read_csv function, and it’s like this:
import Pandas as Pd
df = pd.read_csv('train.csv')
df.head()
The output of execution will be following:
Opening CSV files from URL :
If a CSV is on some URL then we use this code to fetch that CSV file.
import requests
from io import StringIO
url = "https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0"}
req = requests.get(url, headers=headers)
data = StringIO(req.text)
pd.read_csv(data)
Sep Parameter :
CSV file doesn’t necessarily use the comma , character for field separation, it could be any character like \t, | … etc. There is one option in read_csv to help loading CSV file with specific field separator character "Sep".
pd.read_csv('movies_title_metadata.tsv',sep='\t')
Names Parameter :
Some time when we fetch CSV files it doesn't have column name i.e First row index use as column name like this :
To overcome this problem we use "name" parameter . We can specify each column name.
pd.read_csv('movies_title_metadata.tsv',sep='\t',
names=['sn','name','release_year','rating','votes','genres'])
Outcome like this
Header Parameter :
Sometimes in CSV file column name works as a row like this.
To overcome this problem we use "header " parameter and set it as 1
pd.read_csv('test.csv', header = 1)
Usecols Paramater :
Sometimes in analysis we don't require all the columns name and that we know at the time of data loading . So to eliminate those column we use "usecols" parameter.
If we take example of above data set and we want only three column (enrollee_id , gender, education_level) at the time of data loading we can do like this
pd.read_csv('tarin.csv',usecols=
['enrollee_id','gender','education_level'])
Outcome will be
Handling Dates :
Sometimes when we load a CSV file with date as a column, the datatype of that field comes out as "object" i.e string in pandas so we can't perform date function(extract day , month ) on that column. like this
Datatype in above data frame comes as "Object". To resolve that we use parse_dates parameter
pd.read_csv('ipl_data.csv',parse_dates=['date']).info()
Loading a Huge dataset in chunks :
Sometimes we have very huge dataset that takes lots of space in memory, so we can not perform some functions in a single time . To overcome with this problem we divide our dataset in small dataset using parameter "chunksize"
For example we have this dataset with 19158 rows and we want to divide it in small dataset of size 5000 rows in one . We can do that as
dfs = pd.read_csv('tarin.csv', chunksize=5000)
To read this
for chunk in dfs:
print(chunk.shape)
(5000, 14)
(5000, 14)
(5000, 14)
(4158, 14)
So it divided whole dataset in four chunk of shape (5000,14)
Conclusion
So, here I have introduced with you how to read CSV file in pandas in short tutorial, along with some common-use parameters.