Implementing Linear Regression with Gradient Descent

This post is my understanding of Linear Regression, Please feel free to comment and point out any errors if seen.

The Cost Function

The cost function is used to measure the correctness of the current solution(hypothesis), the function can be mean of squares/square roots of the errors. It is represented as the following

J($\theta_0$, $\theta_1$,..$\theta_n$)

where $\theta_0$ is a constant, $\theta_1$…$\theta_n$ are the parameters of the equation we are trying to solve

In simple ML terms(I think)
Each one of the parameter represent the weight of each feature in the dataset, that we are using to build the model

This is the formula for the cost function with mean of square differences.$^0$

12mi=1m(hθ(x)y)2\frac{1}{2m}\sum_{i=1}^{m} (h_\theta(x) - y)^2

where $h_0(x)$ is the hypothesis or the predicted value for $y$ and $m$ is the number of training examples

A sample pseudocode for the mean of squares of errors would be

  • Calculate the sum of square differences / errors between each value $(X*\theta)$ vector and y vector
    • For each training example
      • Multiply the feature values $X_1$, $X_2$,..$X_n$ with it’s corresponding weights $\theta_1$, $\theta_2\cdots\theta_n$, and add the constant, $\theta_0$.

      • Subtract the above value from the $y$ target value of that example and square the difference.

    • Sum all the differences / errors
  • Take the mean of the differences by the dividing with the number of training examples.

It can be represented like $h_0(x) = \theta_0+\theta_1X_1+\theta_2X_2+\cdots+\theta_nX_n$ where $n$ is the number of features we are using for the problem.

Gradient Descent

We need to update the parameters $\theta_0,\theta_1, \theta_2\cdots\theta_n$ so that the cost function

J(θ0,θ1,θn)=12mi=1m(hθ(xi)yi)2J(\theta_0, \theta_1,\cdots\theta_n) = \frac{1}{2m}\sum_{i=1}^{m} (h_\theta(x^i) - y^i)^2 can be minimized.

So to find the minimum for the parameters $\theta_0,\theta_1,\theta_2\cdots\theta_n$ the update is performed like below

θj:=θjαθjJ(θ0,θ1,θj)\theta_j := \theta_j *\alpha \frac{\partial}{\partial \theta_j}*J(\theta_0, \theta_1,\cdots\theta_j)

Where,

  • The := is assignment operator
  • The $\theta_j$ is the parameter to update, j is the feature index number
  • The $\alpha$ is the learning rate
  • The $\frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1,\cdots\theta_j)$ is the derivative term of the cost function, it is like slope$^2$ of the line tangent$^1$ to the curve touching on where the $\theta$ is present in that curve.

For each feature in the dataset the update has to be done simultaneously for each parameter $\theta$, until the convergence / error given by cost function at its minimum.

Deriving the derivative term $\frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1,\cdots\theta_j)$

$^3$To simplify the problem I am considering that we have 2 parameters $\theta_0$ and $\theta_1$, our hypothesis $h_0(x)$ function becomes $\theta_0+\theta_1X$

Now let $g(\theta_0, \theta_1)$ be our derivative term

g(θ0,θ1)=J(θ0,θ1)g(\theta_0, \theta_1)=J(\theta_0, \theta_1) =(12mi=1m(hθ(xi)yi)2)= (\frac{1}{2m}\sum_{i=1}^{m} (h_\theta(x^i) - y^i)^2) =(12mi=1m(θ0+θ1Xiyi)2)=(\frac{1}{2m}\sum_{i=1}^{m} (\theta_0+\theta_1X^i - y^i)^2)

consider f(θ0,θ1)=hθ(xi)yif(\theta_0, \theta_1) = h_\theta(x^i) - y^i

now our equation becomes

g(θ0,θ1)=12mi=1m(f(θ0,θ1)i)2g(\theta_0, \theta_1)=\frac{1}{2m}\sum_{i=1}^{m} (f(\theta_0, \theta_1)^i)^2

subsutituting the value of $f(\theta_0, \theta_1)$ the equation becomes

g(f(θ0,θ1)i)=12mi=1m(θ0+θ1Xiyi)2(1)g(f(\theta_0, \theta_1)^i)=\frac{1}{2m}\sum_{i=1}^{m} (\theta_0+\theta_1X^i - y^i)^2 \tag{1}

Now let us derive the partial derivative for $(1)$

θjg(f(θ0,θ1)i)=θj12mi=1m(f(θ0,θ1)i)2\frac{\partial}{\partial \theta_j}g(f(\theta_0, \theta_1)^i) =\frac{\partial}{\partial \theta_j}\frac{1}{2m}\sum_{i=1}^{m} (f(\theta_0, \theta_1)^i)^2

Let j be 0

θ0g(f(θ0,θ1)i)=θ012mi=1m(f(θ0,θ1)i)2\frac{\partial}{\partial \theta_0}g(f(\theta_0, \theta_1)^i) =\frac{\partial}{\partial \theta_0}\frac{1}{2m}\sum_{i=1}^{m} (f(\theta_0, \theta_1)^i)^2

since we are performing the partial derivative with respect to $\theta_0$ other variables are considered constant, the following is similar to $\frac{\partial}{\partial x}$ of $(x^2+y)$ which is $2x$

=1mi=1mf(θ0,θ1)i=\frac{1}{m}\sum_{i=1}^{m} f(\theta_0, \theta_1)^i =1mi=1m(hθ(x)y)=\frac{1}{m}\sum_{i=1}^{m} (h_\theta(x) - y)

This is because of the chain rule, when we take derivative of a function like $(1)$, we need to use this formula below

θjg(f(θ0,θ1))=θjg(θ0,θ1)θjf(θ0,θ1)(2)\frac{\partial}{\partial \theta_j}g(f(\theta_0, \theta_1)) = \frac{\partial}{\partial \theta_j}g(\theta_0, \theta_1) * \frac{\partial}{\partial \theta_j} f(\theta_0, \theta_1)\tag{2}

In case when j = 0 the partial derivative of $g$ becomes

θ0g(θ0,θ1)=12m2(i=1mf(θ0,θ1)i)21=1mi=1mf(θ0,θ1)i\frac{\partial}{\partial \theta_0}g(\theta_0, \theta_1) =\frac{1}{\cancel2m}*\cancel2(\sum_{i=1}^{m} f(\theta_0, \theta_1)^i)^{2-1}=\frac{1}{m}\sum_{i=1}^{m} f(\theta_0, \theta_1)^i

and the partial derivative of $f$ becomes

θ0f(θ0,θ1)=θ0(hθ(xi)yi)(3)\frac{\partial}{\partial \theta_0}f(\theta_0, \theta_1) = \frac{\partial}{\partial \theta_0}(h_\theta(x^i) - y^i)\tag{3} θ0f(θ0,θ1)=θ0(θ0+θ1Xiyi)=1\frac{\partial}{\partial \theta_0}f(\theta_0, \theta_1) = \frac{\partial}{\partial \theta_0}(\theta_0+\theta_1X^i - y^i) = 1

since other variables are considered constants, that gives us

θjg(θ0,θ1)θjf(θ0,θ1)=1mi=1mf(θ0,θ1)i1=1mi=1mf(θ0,θ1)i\frac{\partial}{\partial \theta_j}g(\theta_0, \theta_1) * \frac{\partial}{\partial \theta_j} f(\theta_0, \theta_1)=\frac{1}{m}\sum_{i=1}^{m} f(\theta_0, \theta_1)^i * 1 = \frac{1}{m}\sum_{i=1}^{m} f(\theta_0, \theta_1)^i

Let’s start from Equation $(1)$ to perform the partial derivative when j = 1

The partial derivative of $g(\theta_0, \theta_1)$ with respect to $\theta_1$ is same from the last derivation but the partial derivative of $f(\theta_0, \theta_1)$ becomes

Consider $(3)$ when j = 1

θ1f(θ1,θ1)=θ1(hθ(xi)yi)=θ1(θ0+θ1Xiyi)\frac{\partial}{\partial \theta_1}f(\theta_1, \theta_1) = \frac{\partial}{\partial \theta_1}(h_\theta(x^i) - y^i) = \frac{\partial}{\partial \theta_1}(\theta_0+\theta_1X^i - y^i)

variables other than $\theta_1$ are considered constants, so they become 0 and $\frac{\partial}{\partial \theta_1}\theta_1$ = 1, so our equation becomes

θ1f(θ0,θ1)=0+1Xi0=Xi\frac{\partial}{\partial \theta_1}f(\theta_0, \theta_1)= 0+1* X^i-0 =X^i

and according to the chain rule $(2)$ and replacing the partials

θ1g(f(θ0,θ1))=θ1g(θ0,θ1)θ1f(θ0,θ1)\frac{\partial}{\partial \theta_1}g(f(\theta_0, \theta_1)) = \frac{\partial}{\partial \theta_1}g(\theta_0, \theta_1) * \frac{\partial}{\partial \theta_1} f(\theta_0, \theta_1) =1mi=1mf(θ0,θ1)iXi= \frac{1}{m}\sum_{i=1}^{m} f(\theta_0, \theta_1)^i * X^i =1mi=1m(θ0+θ1Xiyi)Xi= \frac{1}{m}\sum_{i=1}^{m} (\theta_0+\theta_1X^i - y^i) * X^i

Now we can use the derivatives in the Gradient Descent algorithm

θj:=θjαθjJ(θ0,θ1,θj)\theta_j := \theta_j *\alpha \frac{\partial}{\partial \theta_j}*J(\theta_0, \theta_1,\cdots\theta_j)

repeat until convergence { θ0:=θ0α1mi=1m(θ0+θ1Xiyi)\theta_0 := \theta_0 *\alpha \frac{1}{m}\sum_{i=1}^{m} (\theta_0+\theta_1X^i - y^i)

θ1:=θ1α1mi=1m(θ0+θ1Xiyi)Xi\theta_1 := \theta_1 *\alpha \frac{1}{m}\sum_{i=1}^{m} (\theta_0+\theta_1X^i - y^i)* X^i }

One disadvantage in Gradient Descent is that depending on the position it is initialized at the start, but in linear regression the cost function (mean of sqaured errors) is a convex function (ie) it is in shape of a bowl when plotted on the graph.

I tried to implement this in python which can be found here

Resources and references

ref image

  • Image from wikihow

  • $^3$ Derivation referred from here