Learning to Accelerate by the Methods of Step-size Planning
- URL: http://arxiv.org/abs/2204.01705v1
- Date: Fri, 1 Apr 2022 19:59:40 GMT
- Title: Learning to Accelerate by the Methods of Step-size Planning
- Authors: Hengshuai Yao
- Abstract summary: Gradient descent is slow to converge for ill-conditioned problems and non-dimensional problems.
Step-size adaptation is an important technique for acceleration.
We show that our methods can surpass the convergence rate of Nesterov's accelerated rate.
- Score: 11.65690857661528
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Gradient descent is slow to converge for ill-conditioned problems and
non-convex problems. An important technique for acceleration is step-size
adaptation. The first part of this paper contains a detailed review of
step-size adaptation methods, including Polyak step-size, L4, LossGrad, Adam
and IDBD. In the second part of this paper, we propose a new class of methods
of accelerating gradient descent that are quite different from existing
techniques. The new methods, which we call {\em step-size planning}, use the
{\em update experience} to learn an improved way of updating the parameters.
The methods organize the experience into $K$ steps away from each other to
facilitate planning. From the past experience, our planning algorithm, Csawg,
learns a step-size model which is a form of multi-step machine that predicts
future updates. We extends Csawg to applying step-size planning multiple steps,
which leads to further speedup. We discuss and highlight the projection power
of the diagonal-matrix step-size for future large scale applications. We show
for a convex problem, our methods can surpass the convergence rate of
Nesterov's accelerated gradient, $1 - \sqrt{\mu/L}$, where $\mu, L$ are the
strongly convex factor of the loss function $F$ and the Lipschitz constant of
$F'$. On the classical non-convex Rosenbrock function, our planning methods
achieve zero error below 500 gradient evaluations, while gradient descent takes
about 10000 gradient evaluations to reach a $10^{-3}$ accuracy. We discuss the
connection of step-size planing to planning in reinforcement learning, in
particular, Dyna architectures. We leave convergence and convergence rate
proofs and applications of the methods to high-dimensional problems for future
work.
Related papers
- Unified Gradient-Based Machine Unlearning with Remain Geometry Enhancement [29.675650285351768]
Machine unlearning (MU) has emerged to enhance the privacy and trustworthiness of deep neural networks.
Approximate MU is a practical method for large-scale models.
We propose a fast-slow parameter update strategy to implicitly approximate the up-to-date salient unlearning direction.
arXiv Detail & Related papers (2024-09-29T15:17:33Z) - Class Gradient Projection For Continual Learning [99.105266615448]
Catastrophic forgetting is one of the most critical challenges in Continual Learning (CL)
We propose Class Gradient Projection (CGP), which calculates the gradient subspace from individual classes rather than tasks.
arXiv Detail & Related papers (2023-11-25T02:45:56Z) - ELRA: Exponential learning rate adaption gradient descent optimization
method [83.88591755871734]
We present a novel, fast (exponential rate), ab initio (hyper-free) gradient based adaption.
The main idea of the method is to adapt the $alpha by situational awareness.
It can be applied to problems of any dimensions n and scales only linearly.
arXiv Detail & Related papers (2023-09-12T14:36:13Z) - How Two-Layer Neural Networks Learn, One (Giant) Step at a Time [24.773974771715956]
We investigate theoretically how the features of a two-layer neural network adapt to the structure of the target function.
We compare the influence of batch size and that of multiple (but finitely many) steps.
We show that a batch-size of $n = mathcalO(d)$ is indeed enough to learn multiple target directions satisfying a staircase property.
arXiv Detail & Related papers (2023-05-29T17:43:44Z) - Formal guarantees for heuristic optimization algorithms used in machine
learning [6.978625807687497]
Gradient Descent (SGD) and its variants have become the dominant methods in the large-scale optimization machine learning (ML) problems.
We provide formal guarantees of a few convex optimization methods and proposing improved algorithms.
arXiv Detail & Related papers (2022-07-31T19:41:22Z) - Continuous-Time Meta-Learning with Forward Mode Differentiation [65.26189016950343]
We introduce Continuous Meta-Learning (COMLN), a meta-learning algorithm where adaptation follows the dynamics of a gradient vector field.
Treating the learning process as an ODE offers the notable advantage that the length of the trajectory is now continuous.
We show empirically its efficiency in terms of runtime and memory usage, and we illustrate its effectiveness on a range of few-shot image classification problems.
arXiv Detail & Related papers (2022-03-02T22:35:58Z) - Cutting Some Slack for SGD with Adaptive Polyak Stepsizes [35.024680868164445]
We consider the family of SPS (Stochastic gradient with a Polyak Stepsize) adaptive methods.
We first show that SPS and its recent variants can all be seen as extensions of the Passive-Aggressive methods applied to nonlinear problems.
We use this insight to develop new variants of the SPS method that are better suited to nonlinear models.
arXiv Detail & Related papers (2022-02-24T19:31:03Z) - Adapting Stepsizes by Momentumized Gradients Improves Optimization and
Generalization [89.66571637204012]
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
arXiv Detail & Related papers (2021-06-22T03:13:23Z) - Balancing Rates and Variance via Adaptive Batch-Size for Stochastic
Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error.
Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z) - Gradient Free Minimax Optimization: Variance Reduction and Faster
Convergence [120.9336529957224]
In this paper, we denote the non-strongly setting on the magnitude of a gradient-free minimax optimization problem.
We show that a novel zeroth-order variance reduced descent algorithm achieves the best known query complexity.
arXiv Detail & Related papers (2020-06-16T17:55:46Z) - Explore Aggressively, Update Conservatively: Stochastic Extragradient
Methods with Variable Stepsize Scaling [34.35013145885164]
Extragradient methods have become a staple for solving large-scale saddlepoint problems in machine learning.
We show in this paper that running vanilla extragradient with gradients may jeopardize its convergence, even in simple bilinear models.
We show that this modification allows the method to converge even with gradients, and we derive sharp convergence rates under an error bound condition.
arXiv Detail & Related papers (2020-03-23T10:24:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.