Gradient Descent — Introduction and Implementation in Python

Introduction

Gradient Descent is an optimization algorithm in machine learning used to minimize a function by iteratively moving towards the minimum value of the function.

We basically use this algorithm when we have to find the least possible values that can satisfy a given cost function. In machine learning, more often that not we try to minimize loss functions (like Mean Squared Error). By minimizing the loss function , we can improve our model, and Gradient Descent is one of the most popular algorithms used for this purpose.

Gradient descent for a cost function

The graph above shows how exactly a Gradient Descent algorithm works.

We first take a point in the cost function and start moving in steps towards the minimum point. The size of that step, or how quickly we have to converge to the minimum point is defined by Learning Rate. We can cover more area with higher learning rate but at the risk of overshooting the minima. On the other hand, small steps/smaller learning rates will consume a lot of time to reach the lowest point.

Now, the direction in which algorithm has to move (towards minimum) is also important. We calculate this by the use of derivatives. You must be familiar with derivatives from calculus. A derivative is basically calculated as the slope of the graph at any particular point. We get that by finding the tangent line to the graph at that point. The more steep the tangent, would mean that more steps would be needed to reach minimum point, less steep would mean lesser steps are required to reach the minimum point.

Let’s take an example.

Cost function f(x) = x³-4x²+6

Derivative of f(x) [x_derivative], f’(x) = 3x²-8x (This will give the slope of any point x along f(x) )

Start with any value of x. Lets say 0.5 and learning_rate = 0.05

Iterate over a number of times and keep calculating value of x.

x = x – (x_derivative * learning_rate)

Notice the Negative Sign, this is so that we keep moving towards the minimum point and not the other way.

x = 0.5 – (-3.25*0.05) = 0.6625

Use x = 0.6625 in second iteration

x = 0.6625+(3.983*0.05) = 0.86165

and so on until we stop seeing any change in the value of x.

The number of iterations can be fixed and given by the user. Or, if you have a precision in mind (~0.001). You can stop calculating once you reach this value of precision.

We will explore both the approaches.

Python Implementation

We will implement a simple form of Gradient Descent using python. Let’s take the polynomial function in the above section and treat it as Cost function and attempt to find a local minimum value for that function.

Cost function f(x) = x³- 4x²+6

Let’s import required libraries first and create f(x). Also generate 1000 values from -1 to 4 as x and plot the curve of f(x).

# Importing required libraries
import numpy as np
import matplotlib.pyplot as pltf_x = lambda x: (x**3)-4*(x**2)+6
x = np.linspace(-1,4,1000)#Plot the curve
plt.plot(x, f_x(x))
plt.show()
f(x) = x³ – 4x² + 6

Let’s find out the derivative of f(x).

d f(x)/dx = 3x² – 8x. Let’s create a lambda function in python for the derivative.

f_x_derivative = lambda x: 3*(x**2)-8*x

Let’s create a function to plot gradient descent and also a function to calculate gradient descent by passing a fixed number of iterations as one of the inputs.

def plot_gradient(x, y, x_vis, y_vis):
    plt.subplot(1,2,2)
    plt.scatter(x_vis, y_vis, c = "b")
    plt.plot(x, f_x(x), c = "r")
    plt.title("Gradient Descent")
    plt.show()plt.subplot(1,2,1)
    plt.scatter(x_vis, y_vis, c = "b")
    plt.plot(x,f_x(x), c = "r")
    plt.xlim([2.0,3.0])
    plt.title("Zoomed in Figure")
    plt.show()
    
def gradient_iterations(x_start, iterations, learning_rate):
    
    # These x and y value lists will be used later for visualization.
    x_grad = [x_start]
    y_grad = [f_x(x_start)]
    # Keep looping until number of iterations
    for i in range(iterations):
        
        # Get the Slope value from the derivative function for x_start
        # Since we need negative descent (towards minimum), we use '-' of derivative
        x_start_derivative = - f_x_derivative(x_start)
        
        # calculate x_start by adding the previous value to 
        # the product of the derivative and the learning rate calculated above.
        x_start += (learning_rate * x_start_derivative)        
        
        x_grad.append(x_start)
        y_grad.append(f_x(x_start))print ("Local minimum occurs at: {:.2f}".format(x_start))
    print ("Number of steps: ",len(x_grad)-1)
    plot_gradient(x, f_x(x) ,x_grad, y_grad)

Now that we have defined these functions let’s call gradient_iterations functions by passing x_start = 0.5, iterations = 1000, learning_rate = 0.05

gradient_iteration(0.5, 1000, 0.05)
gradient_iteration(0.5, 1000, 0.05)

We are able to find the Local minimum at 2.67 and as we have given the number of iterations as 1000, Algorithm has taken 1000 steps. It might have reached the value 2.67 at a much earlier iteration. But since we don’t know at what point will our algorithm reach the local minimum with the given learning rate, we give a high value of iteration just to be sure that we find our local minimum.

This doesn’t sound to be very optimal because of the unnecessary number of loop iterations even after it has found the local minimum.

Let’s take another approach of fixing the number of iterations by using precision.

In this approach , Since we know the dataset, we can define the level of precision that we want and stop the algorithm once we reach that level of precision.

For this example let’s write a new function which takes precision instead of iteration number.

def gradient_precision(x_start, precision, learning_rate):
    
    # These x and y value lists will be used later for visualisation.
    x_grad = [x_start]
    y_grad = [f_x(x_start)]while True:
        
        # Get the Slope value from the derivative function for x_start
        # Since we need negative descent (towards minimum), we use '-' of derivative
        x_start_derivative = - f_x_derivative(x_start)
        
        # calculate x_start by adding the previous value to 
        # the product of the derivative and the learning rate calculated above.
        x_start += (learning_rate * x_start_derivative)
        
        
        x_grad.append(x_start)        
        y_grad.append(f_x(x_start))
        # Break out of the loop as soon as we meet precision.
        if abs(x_grad[len(x_grad)-1] - x_grad[len(x_grad)-2]) <= precision:
            breakprint ("Local minimum occurs at: {:.2f}".format(x_start))
    print ("Number of steps taken: ",len(x_grad)-1)
    plot_gradient(x, f_x(x) ,x_grad, y_grad)

Now let’s call this function with parameters x_start = 0.5, precision = 0.001, learning rate = 0.05

gradient_precision(0.5, 0.001, 0.05)
gradient_precision(0.5, 0.001, 0.05)

Local Minimum = 2.67

Number of Steps = 20

Our gradient Descent algorithm was able to find the local minimum in just 20 steps! So, in the previous method we were unnecessarily running 980 iterations!

Now that we are able to successfully minimize f(x) i.e. find the minimum value of x for which f(x) is minimum, Let’s play around with learning rate values and see how it affects the algorithm output.

Learning rate = 0.01

gradient_precision(0.5, 0.001, 0.01)
x_start = 0.5 ,precision = 0.001 , learning rate = 0.01

Since learning rate was lesser, which means the number of steps taken to reach local minimum was higher (85). As we can see in the graph, 85 x values plotted in blue, meaning our Algorithm was slower in finding local minimum.

Learning rate = 0.05

gradient_precision(0.5, 0.001, 0.05)
x_start = 0.5 ,precision = 0.001 , learning rate = 0.05

For the same precision value and x_start value, but learning rate = 0.05, we see that our Algorithm was able to find local minimum in just 20 steps. This shows that by increasing learning rate , the algorithm reaches local minimum faster.

Learning rate = 0.14

gradient_precision(0.5, 0.001, 0.14)
x_start = 0.5 ,precision = 0.001 , learning rate = 0.14

By increasing the learning rate to 0.14, the Algorithm was able to find local minimum in just 6 steps.

Hold up! Don’t fall into the trap that increasing learning rate will always reduce the number of iterations the algorithm takes to find the local minimum. Let’s just increase the learning rate by 0.01 and see the results.

x_start = 0.5 ,precision = 0.001 , learning rate = 0.15

Whoops! the number of steps taken increased this time!

Looks like learning rate = 0.14 is the sweet spot for precision = 0.001.

One thing to be noted is that this implementation will work for cases where the Cost function has only one variable x. In case of multiple variables (x,y,z….) The implementation will change and probably will post it in another article. Also There are different types of Gradient Descent as well

Batch Gradient Descent
Stochastic Gradient Descent
Mini Batch Gradient Descent

We shall see in depth about these different types of Gradient Descent in further posts.


That’s it for this post !. The jupyter notebook for this is in my github.

I do write at medium as well. Please follow me @fahadanwar10 .

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s