Beginner's Guide to Simple Linear Regression in Python
Linear Regression is the linear approximation of the relationship between two or more variables. First, let's understand what is a variable. A variable is an element, feature, or factor that is liable to vary or change. Here, we have generally two types of variables: x and y. We generally take x as an independent variable and y as a dependent variable. What this means is that the values of x are chosen while the values of y are a function of the values of x. The aim of linear regression is to discover this mapping function.
Linear Regression is broadly categorized into three types:
Simple Linear Regression
Multiple Linear Regression
Polynomial Linear Regression
Here, we are going to talk about Simple Linear Regression and implement a simple linear model using Python.
Simple Linear Regression is the linear approximation between two variables. We have one variable as x and another as y. Depending upon x, we will find the value of y where 'x' and 'y' can be anything. For example: finding house price (y) based on area(x), finding the number of survived fish (y) based on water temperature (x) and many more. It can be represented by the following linear equation.
Let's not focus on random error term, also referred to as residual error, as represented in the equation. The residual value is the difference between the actual value and the predicted value by a linear regression model. I will talk more about it later in other blogs. For now, the basic formula is:
y = m*x + c
Now, let's directly delve into implementing simple linear regression. The dataset I will be using consists of two columns: High School GPA and SAT Score. At the end of this blog, we will have the linear model of relationship between high school GPA and SAT score. In other words, we will be predicting SAT score(y-variable) based on high school GPA (x-variable) of students.
Modules Implemented
Scikit learn (for Linear Regression)
Pandas (to load and manipulate the data)
Matplotlib (to visualize the linear model)
Importing the standard libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
Importing the dataset
df = pd.read_csv('gpa.csv')
Let's check the first 5 rows of the dataset
df.head()