Introduction to interpolation
Interpolation is one of the methods of filling null values. Before learning about interpolation, let us learn why do we need interpolation. For example, we are taking the recordings of the temperature, blood pressure, and pulse rate of a person in the ICU. Blood pressure can be taken once an hour, the temperature can be taken once in half an hour and pulse rate is continuously monitored. So, when we look at the collected data, there would be null values in the blood pressure column in half hour times. This is a situation of simple measures in the ICU. When the data is big and there are a lot of null values, we need to find a method to fill the null values to get the proper analysis of the data. Interpolation is one such method of filling data.
Interpolation is a technique in Python used to estimate unknown data points between two known data points. Interpolation is mostly used to impute missing values in the dataframe or series while pre-processing data. It is not always the best method to fill the missing values with the average values as this may affect the data accuracy.
Interpolation can be done for series data and dataframe. It is mostly used in time series data. For example, temperature increase is gradual and not sudden, so we cannot just copy the values from earlier recordings into the empty values.
The function interpolate() is used for interpolation in python. It will return the same datatype as the input . The function can be executed by passing different parameters as per the requirement. The following are the details about interpolation
Syntax: DataFrame.interpolate(method=’linear’, axis=0, limit=None, inplace=False, limit_direction=’forward’, limit_area=None, downcast=None, **kwargs)
Parameters :
method : {‘linear’, ‘time’, ‘index’, ‘values’, ‘nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’, ‘barycentric’, ‘krogh’, ‘polynomial’, ‘spline’, ‘piecewise_polynomial’, ‘from_derivatives’, ‘pchip’, ‘akima’}
axis : 0 fill column-by-column and 1 fill row-by-row.
limit : Maximum number of consecutive NaNs to fill. Must be greater than 0.
limit_direction : {‘forward’, ‘backward’, ‘both’}, default ‘forward’
limit_area : None (default) no fill restriction. inside Only fill NaNs surrounded by valid values (interpolate). outside Only fill NaNs outside valid values (extrapolate). If limit is specified, consecutive NaNs will be filled in this direction.
inplace : Update the NDFrame in place if possible.
downcast : Downcast dtypes if possible.
kwargs : keyword arguments to pass on to the interpolating function.
Returns : Series or DataFrame of same shape interpolated at the NaNs
The syntax shows the default parameters of the function .
Types of interpolation
We have 3 types of interpolation in python,
Linear interpolation
Polynomial interpolation
Interpolation through padding
Linear Interpolation:
In linear interpolation, the estimated point is assumed to lie on the line joining the nearest points to the left and right.
The following code shows the method of interpolation in a series
import pandas as pd
import numpy as np
s = pd.Series([1, 2, 3, np.nan, 5])
print(s)
s.interpolate()
The following code shows the interpolation in a dataframe
import pandas as pd
# Creating the dataframe
df = pd.DataFrame({"A":[12, 4, 7, None, 2],
"B":[None, 3, 57, 3, None],
"C":[20, 16, None, 3, 8],
"D":[14, 3, None, None, 6]})
df
df.interpolate()
As said in the details above, by default, the interpolation will be in forward direction. This is the reason ,B0 is showing null value even after the interpolation
If we make the direction as both sides, then the interpolation will be done in both the directions.
df.interpolate(limit_direction='both')
df.interpolate(limit_direction='backward')
We can interpolate only a single column also
df['C'].interpolate()
We can interpolate the values by column or by rows. The ‘axis’ attribute is used for this. By default , it is taken by columns, to take rows we need to explicitly mention the axis as 1
df.interpolate(axis=1)
Polynomial Interpolation:
Linear interpolation have some chances of introducing error. A more precise approach is the polynomial interpolation which is applicable for series data.
In mathematics, a polynomial is an expression consisting of indeterminates (also called variables) and coefficients, that involves only the operations of addition, subtraction, multiplication, and non-negative integer exponentiation of variables. The value of the largest exponent is called the degree of the polynomial. Polynomial of degree 1 is called a linear polynomial. So, linear interpolation is a polynomial interpolation with degree 1.
If a set of data contains n known points, then there exists exactly one polynomial of degree n-1 or smaller that passes through all of those points. The polynomial's graph can be thought of as "filling in the curve" to account for data between the known points. This methodology, known as polynomial interpolation, often (but not always) provides more accurate results than linear interpolation.
import pandas as pd
import numpy as np
s=pd.Series([0, 1, np.nan, 3,4,5,7])
s.interpolate(method='polynomial',order=3)
Interpolation with padding:
Interpolation through padding means copying the value just before a missing entry. While using padding interpolation, you need to specify a limit. The limit is the maximum number of nans the method can fill consecutively. This is always in the forward direction only .
df.interpolate(method='pad')
Conclusion:
We have learnt different types of interpolation for imputation of the null values .
コメント