When working with data science and machine learning projects, we will have to spend a lot of time analyzing the data and performing data preprocessing activities to clean the dataset. Pandas is undoubtedly the most widely-used open-source library for data science and analysis, mostly preferred for ad-hoc data manipulation operations. It is very likely that the dataset we use might contain missing data , null values or duplicate data for which we would like to modify the data accordingly , or we might just want to drop the column because we think that the feature is not important for creating the model.
In my last blog, I already discussed dropna() and fillna() functions in Pandas, which can be used to deal with the missing data or NaN values. As a continuation to that, I want to discuss two other powerful in-built functions in Pandas, drop() and drop_duplicates() which are widely used for data preprocessing activities, in this blog.
Let’s begin by importing the Pandas library.
Pandas : drop() function
Pandas drop() function is used for removing or dropping required rows and/or columns from dataframe.
Syntax:
The definition of the parameters in the syntax are as follows:
labels : single label or list – In this parameter index or column names which are required to be dropped are provided.
axis : default 0 – It refers to the orientation (row or column) in which data is dropped. If specified as 0, it will be dropped from index(rows) and if specified as 1, it will be dropped from columns.
inplace – This parameter takes a boolean value. This makes the changes in the DataFrame itself if True. If false, the original DataFrame is not modified, but a separate copy with the changes (i.e. dropped rows/columns) is returned.
Example use-cases:
Let’s first create a sample DataFrame to test drop() function with various parameters.
1. Dropping Rows/Columns
For removing rows or columns, we can either specify the labels and the corresponding axis or they can be removed by using index values.
To drop columns, we can use either one of the syntax as shown below.
To drop the rows, we can use the index values.
2. Using “inplace” parameter
If “inplace” parameter is set to true, the dataframe is modified or else a copy is created with modified values as shown below.
Pandas : drop_duplicates() function
Pandas drop_duplicates() function is useful in removing duplicate rows from dataframe.
Syntax:
The definition of the parameters in the syntax are as follows:
subset : column label or sequence of labels – This parameter specifies the columns for identifying duplicates. By default all the columns are considered.
keep : {first,last,False},default ‘first’ – This determines which duplicates should be kept in the dataframe.If specified as first, then all the duplicates except first are dropped. Similarly, if specified as last, then all the duplicates except last are dropped. If false is specified, then all the duplicates are dropped.
inplace – This parameter takes a boolean value. This makes the changes in the DataFrame itself if True. If false, the original DataFrame is not modified, but a separate copy with the changes (i.e. dropped rows/columns) is returned.
Example use-cases:
Let’s first create a sample DataFrame to test drop_duplicates() function with various parameters.
1. Dropping duplicate values in the columns
In this, the duplicate rows are dropped where the first duplicate values are not dropped. This is the reason, 0th index row is not dropped but the other two duplicates i.e. 1st and 2nd rows are dropped.
2. Using “keep” parameter
When we assign the value to “keep” parameter as false in the drop_duplicates function, all the duplicate rows are dropped.
3. Using “subset” parameter
A subset of dataframe is used to drop columns, when we use “subset” parameter. So the duplicate values in only A and B columns are removed.
Hope I was able to explain the usage of drop(0 and drop_duplicates() to the extent possible so that it will be useful for everyone who is working on data preprocessing tasks.
Happy Analyzing!
Comments