Gradient Descent Convergence: From Convex Optimization to Deep Learning

Authors

  • Editor-in-Chief Author

Keywords:

gradient descent, convergence analysis, convex optimization, non-convex functions, learning rate, conjugate gradient method, adaptive optimization, deep learning

Abstract

Gradient descent is a fundamental optimization algorithm commonly used in machine learning and deep learning to minimize objective functions. Despite its simplicity, understanding its convergence behaviour in different optimization landscapes remains critical for ensuring efficient and stable training of models. This study examines the theoretical and practical convergence properties of gradient descent in convex, strongly convex, and non-convex scenarios. The aim was to analyse how factors such as function smoothness, step size, and algorithmic variants affect convergence to minima or stationary points. We used both theoretical proofs and numerical experiments to assess performance. Convergence proofs were presented for convex and smooth functions, demonstrating sub-linear convergence for gradient descent, and for strongly convex functions, showing linear convergence. In non-convex settings, we showed that the gradient descent converges to stationary points under standard assumptions, supported by theoretical guarantees. To validate these results, we apply gradient descent, conjugate gradient, and an Adaptive Modified Gradient-Type (AMGT) method to optimize convex and bivariate quadratic functions. Our simulations indicated that while standard gradient descent ensured stable but slower convergence, conjugate gradient methods offered faster descent in convex quadratic problems. AMGT demonstrated improved convergence speed with appropriate tuning but diverged with excessively high learning rates, highlighting the importance of hyperparameter sensitivity. In conclusion, the study confirms that the convergence behaviour of gradient-based methods depends significantly on problem structure and learning rate selection. For convex problems, using fixed learning rates below the inverse of the Lipschitz constant ensures convergence. In non-convex domains, adaptive methods such as Adam and learning rate scheduling can improve performance, especially in deep learning applications. We recommend careful step size tuning and method selection based on problem characteristics to balance convergence speed and stability in practical applications.

Downloads

Published

2025-10-18