Beyond Gradient Descent

3 min readMar 12, 2022

The term gradient descent is familiar to anyone who is in the field of machine learning. Gradient Descent is the most commonly used optimization algorithm for training machine learning models and neural networks. It helps in determining the best parameter values for minimizing a loss function. We can traverse the error surface in a systematic manner using gradient descent to reach the minimum point.

Traversing error surface to reach minimum point [Source]

Looking at the error surface, it is evident that when the curve is steep, the gradient is large, and the parameter values receive a large update. When the curve is gentle, however, the gradient is small, and the parameters receive a small update as well.

As a result, if gradient descent is initialized on a nearly flat surface, it can be stuck there for a long time. The parameters won’t receive any substantial updates, and the model will not be able to converge.

Some variations of vanilla gradient descent have been developed to address these issues.

MOMENTUM BASED GRADIENT DESCENT

Gradient descent based on momentum enables faster convergence by building momentum with each update. It calculates the new gradient by looking at the history of updates and using exponential weighted averages of gradients from prior iterations.

Typically, the momentum term (gamma) is set to 0.9 or a similar number.

In general, momentum-based gradient descent takes larger steps when asked to go in the same direction repeatedly.

NESTEROV ACCELERATED GRADIENT (NAG)

Because of the accumulated momentum, momentum-based gradient descent has a tendency to overshoot the minimum value, resulting in oscillations around the minima. The Nesterov accelerated gradient (NAG) solves this problem by utilizing a look-ahead mechanism. NAG first performs a large jump in the direction of the prior accumulated gradients, calculates the gradient at this point, and then corrects itself based on the new gradient.

NAG is basically built on the notion of look before you leap. This foresight allows NAG to correct its course more quickly and avoid large oscillations around the minima.

STOCHASTIC AND MINI-BATCH GRADIENT DESCENT

Before making a single update, vanilla gradient descent looks over all of the training data. This guarantees that the gradient we’re calculating is the true gradient that minimizes the loss function. However, because we only make one update after going through all of the data, it is quite slow.

In contrast, stochastic gradient descent performs one update for each training example. As a result, stochastic gradient descent is significantly faster than vanilla gradient descent. However, since it is not calculating the true gradient, there is no guarantee that loss will reduce with each step.

Stochastic GD makes more oscillations compared to Vanilla GD [Source]

Mini-batch gradient descent combines vanilla gradient descent and stochastic gradient descent. It performs one update after going through k training examples, where k is the size of the mini-batch. Typical k values are 16, 32, 64, and so on. A higher k value improves accuracy but slows down the gradient descent.

CONCLUSION

The different variants discussed in this article serve to mitigate some of the problems associated with vanilla gradient descent. Knowing about these gradient descent variations is crucial because they are frequently used in many machine learning and neural network architectures.

ACKNOWLEDGMENTS

CS6910: Deep Learning course, IIT Madras