I have always loved pandas and wondered in my childhood that if I can make them my pet they can be with me forever.
Well realizing its not possible , I bought a panda soft toy, pen drive, head-rest and what not.
Never knew I will get an opportunity to work with them , yes not with the live animal but Python Pandas. :)
Introduction
Pandas is a very easy to use and yet a powerful library for data analysis. Like Numpy, in Pandas too mathematical operations and basic operations like changing the structure of it can be done very fast and easily.
Here I am going to discuss on some basic but powerful functions of pandas.
1) To work with numpy and pandas we have to first of all import pandas like below:
import numpy as np
import pandas as pd
Just open Kaggle notebook and open a pre-stored data csv file on it.
2) Opening a csv file or a text file.
housing=pd.read_csv('../input/california-housing-prices/housing.csv')
df = pd.read_csv(file_path, sep=’,’, header = 0, index_col=False,names=None)
Explanation:‘read_csv’ function has a lot of parameters and I have specified only a few, ones that you may use most often.
A few key points:
a) header=0 means you have the names of columns in the first row in the file and if you don’t you will have to specify header=None
b) index_col = False means to not use the first column of the data as an index in the data frame, you might want to set it to true if the first column is really an index.
c) names = None implies you are not specifying the column names and want it to be inferred from csv file, which means that your header = some_number contains column names. Otherwise, you can specify the names in here in the same order as you have the data in the csv file. If you are reading a text file separated by space or tab, you could simply change the sep to be:sep = " " or sep='\t'
3. How to visualize the top and bottom few values in a data frame.
In example above we can do it by:
a) housing.head() #By default head gives first 5 rows of dataframe.
for more values or for getting specified rows we can do as below:
b) housing.head(10) #Returns 10 rows
for bottom values we can use:
a) housing.tail() #By default tail gives first 5 rows of dataframe.
for more values or for getting specified rows we can do as below:
b) housing.tail(10) #Returns 10 rows
housing.sample() # this by default gives any 1 row randomly
housing.sample(5) #random 5 rows
4) To get the column list
col_nm=housing.columns.tolist()
Formatting and slicing of data
5) To get few columns we can do:
housing[['total_rooms','households']] #returns all rows of 2 columns specified.
Now if we want to get only top 20 rows of the specified columns we can do.
housing[['total_rooms','households']].head(20)
6) To return number of rows with specified column value
housing['population'].value_counts()[1039] # returns number of rows where population value is 1039.
7) To return rows with multiple conditions we can do as below:
mask=(housing['housing_median_age']==52) & (housing['ocean_proximity']=='NEAR BAY')
housing[mask] #returns the housing data in which median age =52 and ocean proximity is NEAR BAY.
8) To return specified rows for specified columns one can use as below:
housing.iloc[[0,1,2,3,4,5,6],[5,6,7,8]] #returns rows in 0-6 index for column index 5,6,7,8.
9) To get the frequency of values in a series
housing['housing_median_age'].value_counts() #returns the count of specified key for each value
Nor for specified value count can be seen as
housing['housing_median_age'].value_counts()[52] # Returns the number of rows for which housing_median_age=52
10) To get rows for specified values one can write
age_range=[52]
housing[housing['housing_median_age'].isin(age_range)] # Returns all rows where value of housing_median_age is '52'
a) to filter specified values we can use:
age_range=[52]
housing[~housing['housing_median_age'].isin(age_range)] # Returns all rows where value of housing_median_age is not '52'
11) To group-by column values and aggregate over another column
housing.groupby(['housing_median_age']).agg({'households':'mean'}) # returns mean of households grouped by housing_median_age.
12) To group-by column values and return list of another column
housing.groupby(['housing_median_age']).agg(list) # returns list of households grouped by housing_median_age.
This is just a beginning and there is more to come.
Thanks for reading!
Comments