Simple Linear Regression In Python/NumPY

So recently, I have been watching some videos about Data Science and came across two series of videos about regression. The first series was titled "Simple Linear Regression" and the second was "Multiple Linear Regression". When I first saw them, I was like "I thought regression was just regression and not separated into two...". Now if you want follow with me, go to www.kaggle.com to write the code, and I will be writing the code in the blog.

-------------------------------------------------QUESTIONS---------------------------------------------------------

Now before we get cracking, I want to answer a short amount of a few not-so-long questions that maybe could help through this blog.

  1. What is simple linear regression?

  2. What coding languages other than NumPY is simple linear regression compatible with?

  3. How do you use simple linear regression?

Answer to Question 1: Simple linear regression is a type of linear regression model with only a single variable being used. It uses two-dimensional sample points with one independent variable and one dependent variable that finds an linear function (a non-vertical straight line) that, as accurately as possible, predicts the dependent variable's values as a function of the independent variables.

Answer to Question 2: Python and R

Answer to Question 3: Simple linear regression is used to estimate the relationship between two variables. You can use simple linear regression when you want to know: How strong the relationship is between two variables.


So hopefully with some of this information that you got, you might be able to understand what I am going to be doing in this blog.

--------------------------------------INSTRUCTIONS+CODE-----------------------------------------------------

Instructions for using Kaggle:

  1. Make a new account (or log back into your account)

  2. Create a new notebook

  3. Start writing the code from the blog as you read through the blog

Now we are going to start the instructions and code and if you want, you can follow along with me. :)


1. Import the libraries

First, I went and imported the libraries needed

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

2. Import the dataset

After that, I imported a dataset that was based on salary and years of experience (this might sound familiar)

dataset = pd.read_csv('/content/sample_data/Salary_Data.csv')
X = dataset.iloc[:, :-1].values
= dataset.iloc[:, -1].values

Now let's see what the output of the independent and dependent variables are.

print (X)

After printing X (or the years of experience), you should have gotten:

[[ 1.1]  [ 1.3]  [ 1.5]  [ 2. ]  [ 2.2]  [ 2.9]  [ 3. ]  [ 3.2]  [ 3.2]  [ 3.7]  [ 3.9]  [ 4. ]  [ 4. ]  [ 4.1]  [ 4.5]  [ 4.9]  [ 5.1]  [ 5.3]  [ 5.9]  [ 6. ]  [ 6.8]  [ 7.1]  [ 7.9]  [ 8.2]  [ 8.7]  [ 9. ]  [ 9.5]  [ 9.6]  [10.3]  [10.5]]

Now for y, we do the same exact thing we did with X.

print (y)

After printing y (or the salary), you should have gotten:

[[ 39343.  46205.  37731.  43525.  39891.  56642.  60150.  54445.  64445.   57189.  63218.  55794.  56957.  57081.  61111.  67938.  66029.  83088.   81363.  93940.  91738.  98273. 101302. 113812. 109431. 105582. 116969.  112635. 122391. 121872.]

3. Splitting the dataset into the Training set and Test set

Now, we are going to split the model so that each variable is used for a different thing

from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)

Now let's see if each variable has an value/output

print (X_train)
[[ 2.9]  [ 5.1]  [ 3.2]  [ 4.5]  [ 8.2]  [ 6.8]  [ 1.3]  [10.5]  [ 3. ]  [ 2.2]  [ 5.9]  [ 6. ]  [ 3.7]  [ 3.2]  [ 9. ]  [ 2. ]  [ 1.1]  [ 7.1]  [ 4.9]  [ 4. ]]

X_train has an output. What about X_test?

print (X_test)
[[ 1.5]  [10.3]  [ 4.1]  [ 3.9]  [ 9.5]  [ 8.7]  [ 9.6]  [ 4. ]  [ 5.3]  [ 7.9]]

X_test has an output a bit smaller than X_train, but what about the output of y_train?

print (y_train)
[ 56642.  66029.  64445.  61111. 113812.  91738.  46205. 121872.  60150.   39891.  81363.  93940.  57189.  54445. 105582.  43525.  39343.  98273.   67938.  56957.]

It turns out that y_train also has an output as long as X_train. Now let's see what y_test gives us.

print (y_test)
[ 37731. 122391.  57081.  63218. 116969. 109431. 112635.  55794.  83088.  101302.]

The y_test output looks as short as the X_test does.

After reading through those outputs, it is definitely true that all four variables have an output and I also came across the fact that there is a pattern. Basically, the X_train and y_train both have the same amount of numbers (20:20) and the X_test and y_test have the same amount of numbers as well (10:10). Enough talk now, and let's get cracking on with the next step!

4. Training the Simple Linear Regression model on the Training set

Now we will be training our simple linear regression by using a function called .fit. If you aren't familiar with the term, I will define it in a few words. The .fit function will train your variables linked to your simple linear regression model if you type the code in a specific way and it looks like this (______ stand for gaps that could be filled with the variables)

__________.fit(____, ______)

So in order to basically train the simple linear regression model, you only have to write three lines of code.

from sklearn.linear_model import LinearRegression 
regressor = LinearRegression() 
regressor.fit(X_train, y_train) 

Now after you run it along with the first three steps of code (running each step of code in this blog is very essential), you should get an output that looks like this:

LinearRegression(copy_X = True, fit_intercept = True, n_jobs = None, normalize = False)

Let's jump into the next step now!

5. Predicting the Test set results

This is the simplest and easiest step in this blog because for this step, all you need to do is write one line of code to predict the Test set results. This is what the line of code looks like:

y_pred = regressor.predict(X_test)

If you want to print y_pred, this is what the output will give us:

[ 40835.10590871 123079.39940819  65134.55626083  63265.36777221  115602.64545369 108125.8914992  116537.23969801  64199.96201652   76349.68719258 100649.1375447 ]

Those numbers are the predictions that the simple linear regression model has made for the upcoming plot.

Let's dive into the next step now!

6. Visualizing the Training set results

After that, we shall now write a few lines of code that will create and print a plot out based on the Training set results. The lines of code for this are down below:

plt.scatter(X_train, y_train, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('Salary vs Experience (Training set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()

With these lines of code, you should now be able to see a plot once you run it. When I did it, I saw a result that looks like this:

Let's move on to the final step in this blog now.

7. Visualizing the Test set results

In the final step, we will also be creating and printing a plot based on the Test set results. The lines of code for this are:

 plt.scatter(X_test, y_test, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'green')
plt.title('Salary vs Experience (Test set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()

Now after running this piece of code, you should be able to see your plot. When I ran it, I got a plot that looks like this:



----------------------------------------------------ENDING---------------------------------------------------

That's all from me for this blog. I hope you enjoyed reading it!

103 views0 comments

Recent Posts

See All