# Simple Linear Regression In Python/NumPY

So recently, I have been watching some videos about Data Science and came across two series of videos about regression. The first series was titled "Simple Linear Regression" and the second was "Multiple Linear Regression". When I first saw them, I was like "I thought regression was just regression and not separated into two...". Now if you want follow with me, go to __www.kaggle.com__ to write the code, and I will be writing the code in the blog.

-------------------------------------------------QUESTIONS---------------------------------------------------------

Now before we get cracking, I want to answer a short amount of a few not-so-long questions that maybe could help through this blog.

What is simple linear regression?

What coding languages other than NumPY is simple linear regression compatible with?

How do you use simple linear regression?

Answer to Question 1: Simple linear regression is a type of linear regression model with only a single variable being used. It uses two-dimensional sample points with one independent variable and one dependent variable that finds an linear function (a non-vertical straight line) that, as accurately as possible, predicts the dependent variable's values as a function of the independent variables.

Answer to Question 2: Python and R

Answer to Question 3: Simple linear regression is used to estimate the relationship between two variables. You can use simple linear regression when you want to know: How strong the relationship is between two variables.

So hopefully with some of this information that you got, you might be able to understand what I am going to be doing in this blog.

--------------------------------------INSTRUCTIONS+CODE-----------------------------------------------------

Instructions for using Kaggle:

Make a new account (or log back into your account)

Create a new notebook

Start writing the code from the blog as you read through the blog

Now we are going to start the instructions and code and if you want, you can follow along with me. :)

1. **Import the libraries**

First, I went and imported the libraries needed

```
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
```

2. **Import the dataset**

After that, I imported a dataset that was based on salary and years of experience (this might sound familiar)

```
dataset = pd.read_csv('/content/sample_data/Salary_Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
```

Now let's see what the output of the independent and dependent variables are.

`print (X)`

After printing X (or the years of experience), you should have gotten:

`[[ 1.1] [ 1.3] [ 1.5] [ 2. ] [ 2.2] [ 2.9] [ 3. ] [ 3.2] [ 3.2] [ 3.7] [ 3.9] [ 4. ] [ 4. ] [ 4.1] [ 4.5] [ 4.9] [ 5.1] [ 5.3] [ 5.9] [ 6. ] [ 6.8] [ 7.1] [ 7.9] [ 8.2] [ 8.7] [ 9. ] [ 9.5] [ 9.6] [10.3] [10.5]]`

Now for y, we do the same exact thing we did with X.

`print (y)`

After printing y (or the salary), you should have gotten:

`[[ 39343. 46205. 37731. 43525. 39891. 56642. 60150. 54445. 64445. 57189. 63218. 55794. 56957. 57081. 61111. 67938. 66029. 83088. 81363. 93940. 91738. 98273. 101302. 113812. 109431. 105582. 116969. 112635. 122391. 121872.]`

3. **Splitting the dataset into the Training set and Test set**

Now, we are going to split the model so that each variable is used for a different thing

```
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)
```

Now let's see if each variable has an value/output

`print (X_train)`

`[[ 2.9] [ 5.1] [ 3.2] [ 4.5] [ 8.2] [ 6.8] [ 1.3] [10.5] [ 3. ] [ 2.2] [ 5.9] [ 6. ] [ 3.7] [ 3.2] [ 9. ] [ 2. ] [ 1.1] [ 7.1] [ 4.9] [ 4. ]]`

X_train has an output. What about X_test?

`print (X_test)`

`[[ 1.5] [10.3] [ 4.1] [ 3.9] [ 9.5] [ 8.7] [ 9.6] [ 4. ] [ 5.3] [ 7.9]]`

X_test has an output a bit smaller than X_train, but what about the output of y_train?

`print (y_train)`

`[ 56642. 66029. 64445. 61111. 113812. 91738. 46205. 121872. 60150. 39891. 81363. 93940. 57189. 54445. 105582. 43525. 39343. 98273. 67938. 56957.]`

It turns out that y_train also has an output as long as X_train. Now let's see what y_test gives us.

`print (y_test)`

`[ 37731. 122391. 57081. 63218. 116969. 109431. 112635. 55794. 83088. 101302.]`

The y_test output looks as short as the X_test does.

After reading through those outputs, it is definitely true that all four variables have an output and I also came across the fact that there is a pattern. Basically, the X_train and y_train both have the same amount of numbers (20:20) and the X_test and y_test have the same amount of numbers as well (10:10). Enough talk now, and let's get cracking on with the next step!

4. **Training the Simple Linear Regression model on the Training set**

Now we will be training our simple linear regression by using a function called .fit. If you aren't familiar with the term, I will define it in a few words. The .fit function will train your variables linked to your simple linear regression model if you type the code in a specific way and it looks like this (______ stand for gaps that could be filled with the variables)

`__________.fit(____, ______)`

So in order to basically train the simple linear regression model, you only have to write three lines of code.

```
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
```

Now after you run it along with the first three steps of code (running each step of code in this blog is very essential), you should get an output that looks like this:

`LinearRegression(copy_X = True, fit_intercept = True, n_jobs = None, normalize = False)`

Let's jump into the next step now!

5. __Predicting the Test set results__

This is the simplest and easiest step in this blog because for this step, all you need to do is write one line of code to predict the Test set results. This is what the line of code looks like:

`y_pred = regressor.predict(X_test)`

If you want to print y_pred, this is what the output will give us:

`[ 40835.10590871 123079.39940819 65134.55626083 63265.36777221 115602.64545369 108125.8914992 116537.23969801 64199.96201652 76349.68719258 100649.1375447 ]`

Those numbers are the predictions that the simple linear regression model has made for the upcoming plot.

Let's dive into the next step now!

6. __Visualizing the Training set results__

After that, we shall now write a few lines of code that will create and print a plot out based on the Training set results. The lines of code for this are down below:

```
plt.scatter(X_train, y_train, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('Salary vs Experience (Training set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()
```

With these lines of code, you should now be able to see a plot once you run it. When I did it, I saw a result that looks like this:

Let's move on to the final step in this blog now.

7. __V____isualizing the Test set results__

In the final step, we will also be creating and printing a plot based on the Test set results. The lines of code for this are:

```
plt.scatter(X_test, y_test, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'green')
plt.title('Salary vs Experience (Test set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()
```

Now after running this piece of code, you should be able to see your plot. When I ran it, I got a plot that looks like this:

----------------------------------------------------ENDING---------------------------------------------------

That's all from me for this blog. I hope you enjoyed reading it!