Regression Algorithm Part 3: Polynomial Linear Regression Using R Language


What is a Polynomial Linear Regression?


Polynomial Linear Regression is similar to the Multiple Linear Regression but the difference is, in Multiple Linear Regression the variables are different whereas in Polynomial Linear Regression, we have the same variable but it is in a different power.



Why it is called a Linear Regression if it’s a Polynomial Regression?


Polynomial regression fits a nonlinear relationship between the value of x and the corresponding conditional mean of y, denoted E(y | x). In which x is non-linear because it is in a different power but here we talking about the coefficient. Coefficients are unknown but when you find using regression function E(y | x) then the function is linear that’s why it is called Polynomial Linear Regression.

Let’s understand Polynomial Linear Regression using the Position_Salaries data set which is available on Kaggle. This data set contains 3 columns and 10 rows of different positions and salaries.


Required R package

First, you need to install the caTools and ggplot2 package and load the caTools and ggplot2 library then after you can able to perform the following operations.


  • Import libraries

install.packages('caTools')   
install.packages('ggplot2')
library(caTools)
library(ggplot2)

Note: If you use R studio then packages need to be installed only once.


  • Importing the dataset

dataset <- read.csv('../input/polynomial-position-salary-data/Position_Salaries.csv')
dataset <- dataset[2:3]
dim(dataset)

The read.csv() function is used to read the csv file and the dim() function is used to know the csv file contains how many rows and columns. In this data set, the column Position and Level have the same meaning therefore, we choose the Level column. Also, the data set is very small so don’t split into training and test set.


The problem statement is that the candidate with level 6.5 had a previous salary of 160000. In order to hire the candidate for a new role, the company would like to confirm if he is being honest about his last salary so it can make a hiring decision. In order to do this, we will make use of Polynomial Linear Regression to predict the accurate salary of the employee.


  • Apply Linear Regression to the dataset

linear <- lm(formula <- Salary ~ ., data <- dataset)
summary(linear)

The lm() function used to create a Linear Regression model. If you look at the data set, we have one dependent variable salary and one independent variable Level. Therefore, the notation formula <- Salary ~ . means that the salary is proportional to Level. Now, the second argument takes the data set on which you want to train your regression model. After running this above code your regression model will ready. If you check the summary of the regression model then you can see the stars and P-value doesn’t much statistically significant.


  • Apply Polynomial Regression to the dataset

dataset$Level2 <- dataset$Level^2
dataset$Level3 <- dataset$Level^3
dataset$Level4 <- dataset$Level^4
polynomial <- lm(formula = Salary ~ ., data <- dataset)
summary(polynomial)

Now, using the lm() function creates a Polynomial Linear Regression model. The accuracy of Polynomial Linear regression increases with the increase in the degree of the Polynomial. Compare the summary of the both Linear and Polynomial regression model and notice the difference.


  • Visualize the Linear Regression results

ggplot() +
geom_point(aes(x <- dataset$Level, y <- dataset$Salary), colour = 'red') +
geom_line(aes(x <- dataset$Level, y <- predict(linear, dataset)), colour = 'blue') +
ggtitle('Linear Regression') +
xlab('Level') +
ylab('Salary')

The Linear Regression model represents the blue straight line doesn’t fit well on the data because for some observation points the prediction is pretty far from the real observation.


  • Visualize the Polynomial Regression results

ggplot() +
geom_point(aes(x <- dataset$Level, y <- dataset$Salary), colour = 'red') +
geom_line(aes(x <- dataset$Level, y <- predict(polynomial, dataset)), colour = 'blue') +
ggtitle('Polynomial Regression') +
xlab('Level') +
ylab('Salary')

The Polynomial Linear Regression model represents the blue curve that fits well on the data because all the prediction very close to the real values.


  • Visualize the Regression Model results for higher resolution and smoother curve

x_grid = seq(min(dataset$Level), max(dataset$Level), 0.1)
ggplot() +
geom_point(aes(x <- dataset$Level, y <- dataset$Salary), colour = 'red') +
geom_line(aes(x <- x_grid, y <- predict(polynomial, data.frame(Level = x_grid, 
                                                               Level2 = x_grid^2,
                                                               Level3 = x_grid^3,
                                                               Level4 = x_grid^4))),colour = 'blue') +
ggtitle('Polynomial Regression') +
xlab('Level') +
ylab('Salary')

When you increase the degree of the Polynomial, it gives a higher resolution, smoother curve, and higher accuracy.


  • Predicting a new result with Linear Regression

predict(linear, data.frame(Level <- 6.5))

This code predicts the salary associated with 6.5 level according to a Linear Regression Model but, it gives us the pretty far prediction to 160 k so it’s not an accurate prediction.


  • Predicting a new result with Polynomial Regression

predict(polynomial, data.frame(Level <- 6.5,
                             Level2 <- 6.5^2,
                             Level3 <- 6.5^3,
                             Level4 <- 6.5^4))

This code predicts the salary associated with 6.5 level according to a Polynomial Regression Model. And gives us a very close prediction to 160 k.


The code is available on my GitHub account.


The previous part of the series part1 and part2 covered the Linear Regression and Multiple Linear Regression.


If you like the blog or found it helpful please leave a clap!


Thank you.

8 views0 comments
 

© Numpy Ninja.