top of page
hand-businesswoman-touching-hand-artificial-intelligence-meaning-technology-connection-go-

Pandas - My love returns to me

I have always loved pandas and wondered in my childhood that if I can make them my pet they can be with me forever.

Well realizing its not possible , I bought a panda soft toy, pen drive, head-rest and what not.

Never knew I will get an opportunity to work with them , yes not with the live animal but Python Pandas. :)




Introduction

Pandas is a very easy to use and yet a powerful library for data analysis. Like Numpy, in Pandas too mathematical operations and basic operations like changing the structure of it can be done very fast and easily.

Here I am going to discuss on some basic but powerful functions of pandas.


1) To work with numpy and pandas we have to first of all import pandas like below:


import numpy as np

import pandas as pd


Just open Kaggle notebook and open a pre-stored data csv file on it.


2) Opening a csv file or a text file.


housing=pd.read_csv('../input/california-housing-prices/housing.csv')


df = pd.read_csv(file_path, sep=’,’, header = 0, index_col=False,names=None)



Explanation:‘read_csv’ function has a lot of parameters and I have specified only a few, ones that you may use most often.

A few key points:

a) header=0 means you have the names of columns in the first row in the file and if you don’t you will have to specify header=None

b) index_col = False means to not use the first column of the data as an index in the data frame, you might want to set it to true if the first column is really an index.

c) names = None implies you are not specifying the column names and want it to be inferred from csv file, which means that your header = some_number contains column names. Otherwise, you can specify the names in here in the same order as you have the data in the csv file. If you are reading a text file separated by space or tab, you could simply change the sep to be:sep = " " or sep='\t'


3. How to visualize the top and bottom few values in a data frame.


In example above we can do it by:

a) housing.head() #By default head gives first 5 rows of dataframe.


for more values or for getting specified rows we can do as below:

b) housing.head(10) #Returns 10 rows


for bottom values we can use:

a) housing.tail() #By default tail gives first 5 rows of dataframe.


for more values or for getting specified rows we can do as below:

b) housing.tail(10) #Returns 10 rows


housing.sample() # this by default gives any 1 row randomly

housing.sample(5) #random 5 rows


4) To get the column list


col_nm=housing.columns.tolist()


Formatting and slicing of data


5) To get few columns we can do:

housing[['total_rooms','households']] #returns all rows of 2 columns specified.


Now if we want to get only top 20 rows of the specified columns we can do.


housing[['total_rooms','households']].head(20)


6) To return number of rows with specified column value


housing['population'].value_counts()[1039] # returns number of rows where population value is 1039.


7) To return rows with multiple conditions we can do as below:


mask=(housing['housing_median_age']==52) & (housing['ocean_proximity']=='NEAR BAY')


housing[mask] #returns the housing data in which median age =52 and ocean proximity is NEAR BAY.


8) To return specified rows for specified columns one can use as below:

housing.iloc[[0,1,2,3,4,5,6],[5,6,7,8]] #returns rows in 0-6 index for column index 5,6,7,8.


9) To get the frequency of values in a series

housing['housing_median_age'].value_counts() #returns the count of specified key for each value


Nor for specified value count can be seen as

housing['housing_median_age'].value_counts()[52] # Returns the number of rows for which housing_median_age=52


10) To get rows for specified values one can write


age_range=[52]

housing[housing['housing_median_age'].isin(age_range)] # Returns all rows where value of housing_median_age is '52'


a) to filter specified values we can use:

age_range=[52]

housing[~housing['housing_median_age'].isin(age_range)] # Returns all rows where value of housing_median_age is not '52'


11) To group-by column values and aggregate over another column


housing.groupby(['housing_median_age']).agg({'households':'mean'}) # returns mean of households grouped by housing_median_age.


12) To group-by column values and return list of another column


housing.groupby(['housing_median_age']).agg(list) # returns list of households grouped by housing_median_age.


This is just a beginning and there is more to come.

Thanks for reading!




50 views0 comments
bottom of page