One of the popular and powerful programming language in recent years. Python is often used to build websites, software applications including API's, automation tasks, conduct data analysis. It is beginner friendly programming language and very popular among developers.
What are Pandas?
Pandas is a Python library used for working with data sets. It has functions for analyzing, cleaning, exploring, and manipulating data. The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008. Pandas is a fast, powerful, flexible and easy to use open-source data analysis and manipulation tool, built on top of the Python programming language.
Pandas helps analyzing big data and make conclusions based on statistical theories. Pandas can clean messy data sets and make them readable and relevant. Relevant data is very important in data science.
Benefits of Pandas
Identify correlations between 2 or more data columns.
Calculate max, min and average values using functions.
Delete irrelevant rows containing wrong values, empty or Null values, duplicate records, incorrect format etc., and hence cleanse the data.
Provides data representation in extremely streamlined forms and hence helping in better understanding and analyzing in a simpler and better way. Simpler representation is key for data science projects.
More work with lesser code is the prominent feature of Pandas Library. 100s of lines of code can be reduced to a dozen lines without compromising the features/functionality. This helps readability and one can focus on analysis part.
One liner can perform more complex operations including filtering, sorting and indexing.
Can handle large data efficiently and hence saving space and time.
Most features are customizable to suit the project or function needs.
Critical & Handy Functions
Function with Usage
Reads .csv, .tsv & .txt files
Reads spreadsheet including extensions xls and xlsx
Reads JSON formatted file (.json)
Write to a CSV (comma separate) file
Write to an excel file
Write to a JSON file
Insert into a Database table
Displays Panda version.
Usage & Examples of critical Functions
There are various pandas & Dataframe functions which are beneficial in data science & analysis. Few very handy commands are listed here with appropriate outputs which can be leveraged during report generations.
Displays top n rows from the data set.
Displays series having the memory usage of each column in bytes in Dataframe. Can be handy while calculating the size of the table.
Helps slicing the data set and display the required ones as per the requirement.
Converts an object to datetime format. Date time conversion functions are very useful while creating reports and doing date calculations in complex programming.
df_order['Sale_Date'] = pd.to_datetime(df_order['Sale_Date'])
Counts the unique values of a column and display the same.
Helps removing duplicates from the record set. Duplicate records in tables are a common scenario and this is very useful to get rid of the same.
Very important aggregation function to summarize the data and perform certain arithmetic operations.
Different operations include sum, count, median, quantile, min, max, mean, var, std etc. Analytical functions including rank and dense rank can be used to achieve the needed results. Few operations are shown below with examples,
Merge function is used to club/combine multiple data frames based on a key. This can be used similar to JOIN statement in RDBMS SQL's where primary key can be used to combine the data. Left and right outer joins/merge can be used if there is a need of displaying the records of one table fully even if there is no matching primary key for any particular record.
Helps in sorting the data which is very relevant in data science/analysis. Sort can be ascending or descending, and can be achieved with different parameters.
Default sorting is ascending, and for sorting in descending order, parameter with "ascending=False" to be used. Below is the example of default ascending sort.
loc accessor can help in reversing the order of the rows.
Index displays the range of indexes of each row in the table.
Drop function can be utilized to get rid of the unwanted columns and/or rows from the data. In below example Salary column was dropped.
melt function converts rows into columns.
Concatenation function is used to combine multiple dataframes together. This gives single display of all the frames. New columns will be added if column names differs between the frames.
The assign function helps to add a new calculated column if needed.
hist() function helps to plot histogram for each column of the dataset. Area is proportional to the frequency of the data elements and width is equal to the class interval. In the below example, frequency of same department is plotted in the graph.
This is a length function which helps displaying the count of records in the dataset.
NaN is the default value displayed for the records with no data or null value in Python. fillna() function helps to replace null value with default custom meaningful value to improve the relevance of the output.
NaN is the default value displayed for the records with no data or null value in Python. dropna() function helps removing null value rows or records from the display. Irrelevant or unwanted records can be avoided by using dropna function.