Stochastic Gradient Descent with Preconditioned Polyak Step-size
- URL: http://arxiv.org/abs/2310.02093v1
- Date: Tue, 3 Oct 2023 14:36:05 GMT
- Title: Stochastic Gradient Descent with Preconditioned Polyak Step-size
- Authors: Farshed Abdukhakimov, Chulu Xiang, Dmitry Kamzolov, Martin Tak\'a\v{c}
- Abstract summary: Gradient Descent with Polyak Step-size (SPS) is a method that offers an update rule that alleviates the need of fine-tuning the learning rate of an dataset.
In this paper, we propose an extension of SPS that employs preconditioning techniques, such as Hutchinson's and Ada's, to improve its performance on badly scaled and/or ill-conditioned datasets.
- Score: 1.3300175008796402
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stochastic Gradient Descent (SGD) is one of the many iterative optimization
methods that are widely used in solving machine learning problems. These
methods display valuable properties and attract researchers and industrial
machine learning engineers with their simplicity. However, one of the
weaknesses of this type of methods is the necessity to tune learning rate
(step-size) for every loss function and dataset combination to solve an
optimization problem and get an efficient performance in a given time budget.
Stochastic Gradient Descent with Polyak Step-size (SPS) is a method that offers
an update rule that alleviates the need of fine-tuning the learning rate of an
optimizer. In this paper, we propose an extension of SPS that employs
preconditioning techniques, such as Hutchinson's method, Adam, and AdaGrad, to
improve its performance on badly scaled and/or ill-conditioned datasets.
Related papers
- No learning rates needed: Introducing SALSA -- Stable Armijo Line Search Adaptation [4.45108516823267]
We identify problems of current state-of-the-art line search methods, propose enhancements, and rigorously assess their effectiveness.
We evaluate these methods on orders of magnitude and more complex data domains than previously done.
Our work is publicly available in a Python package, which provides a simple Pytorch.
arXiv Detail & Related papers (2024-07-30T08:47:02Z) - Non-convergence of Adam and other adaptive stochastic gradient descent optimization methods for non-vanishing learning rates [3.6185342807265415]
Deep learning algorithms are the key ingredients in many artificial intelligence (AI) systems.
Deep learning algorithms are typically consisting of a class of deep neural networks trained by a gradient descent (SGD) optimization method.
arXiv Detail & Related papers (2024-07-11T00:10:35Z) - Learning rate adaptive stochastic gradient descent optimization methods: numerical simulations for deep learning methods for partial differential equations and convergence analyses [5.052293146674794]
It is known that the standard descent (SGD) optimization method, as well as accelerated and adaptive SGD optimization methods such as the Adam fail to converge if the learning rates do not converge to zero.
In this work we propose and study a learning-rate-adaptive approach for SGD optimization methods in which the learning rate is adjusted based on empirical estimates.
arXiv Detail & Related papers (2024-06-20T14:07:39Z) - SANIA: Polyak-type Optimization Framework Leads to Scale Invariant
Stochastic Algorithms [1.21748738176366]
Techniques such as Adam, AdaGrad, and AdaHessian utilize a preconditioner that the search impacts by incorporating the curvature of the objective function.
This paper proposes SANIA to tackle these challenges.
arXiv Detail & Related papers (2023-12-28T21:28:08Z) - Learning to Optimize Permutation Flow Shop Scheduling via Graph-based
Imitation Learning [70.65666982566655]
Permutation flow shop scheduling (PFSS) is widely used in manufacturing systems.
We propose to train the model via expert-driven imitation learning, which accelerates convergence more stably and accurately.
Our model's network parameters are reduced to only 37% of theirs, and the solution gap of our model towards the expert solutions decreases from 6.8% to 1.3% on average.
arXiv Detail & Related papers (2022-10-31T09:46:26Z) - An Empirical Evaluation of Zeroth-Order Optimization Methods on
AI-driven Molecule Optimization [78.36413169647408]
We study the effectiveness of various ZO optimization methods for optimizing molecular objectives.
We show the advantages of ZO sign-based gradient descent (ZO-signGD)
We demonstrate the potential effectiveness of ZO optimization methods on widely used benchmark tasks from the Guacamol suite.
arXiv Detail & Related papers (2022-10-27T01:58:10Z) - Logistic Q-Learning [87.00813469969167]
We propose a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs.
The main feature of our algorithm is a convex loss function for policy evaluation that serves as a theoretically sound alternative to the widely used squared Bellman error.
arXiv Detail & Related papers (2020-10-21T17:14:31Z) - Automatically Learning Compact Quality-aware Surrogates for Optimization
Problems [55.94450542785096]
Solving optimization problems with unknown parameters requires learning a predictive model to predict the values of the unknown parameters and then solving the problem using these values.
Recent work has shown that including the optimization problem as a layer in a complex training model pipeline results in predictions of iteration of unobserved decision making.
We show that we can improve solution quality by learning a low-dimensional surrogate model of a large optimization problem.
arXiv Detail & Related papers (2020-06-18T19:11:54Z) - A Primer on Zeroth-Order Optimization in Signal Processing and Machine
Learning [95.85269649177336]
ZO optimization iteratively performs three major steps: gradient estimation, descent direction, and solution update.
We demonstrate promising applications of ZO optimization, such as evaluating and generating explanations from black-box deep learning models, and efficient online sensor management.
arXiv Detail & Related papers (2020-06-11T06:50:35Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Stochastic Polyak Step-size for SGD: An Adaptive Learning Rate for Fast
Convergence [30.393999722555154]
We propose a variant of the classical Polyak step-size (Polyak, 1987) commonly used in the subgradient method.
The proposed Polyak step-size (SPS) is an attractive choice for setting the learning rate for gradient descent.
arXiv Detail & Related papers (2020-02-24T20:57:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.