ZO-AdaMU Optimizer: Adapting Perturbation by the Momentum and
Uncertainty in Zeroth-order Optimization
- URL: http://arxiv.org/abs/2312.15184v1
- Date: Sat, 23 Dec 2023 07:46:31 GMT
- Title: ZO-AdaMU Optimizer: Adapting Perturbation by the Momentum and
Uncertainty in Zeroth-order Optimization
- Authors: Shuoran Jiang, Qingcai Chen, Youchen Pan, Yang Xiang, Yukang Lin,
Xiangping Wu, Chuanyi Liu, Xiaobao Song
- Abstract summary: This study proposes ZO-AdaMU to adapt the simulated perturbation with momentum in its approximation.
Our convergence analysis and experiments prove this is a better way to improve convergence stability and rate in ZO-SGD.
- Score: 18.02643194439027
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Lowering the memory requirement in full-parameter training on large models
has become a hot research area. MeZO fine-tunes the large language models
(LLMs) by just forward passes in a zeroth-order SGD optimizer (ZO-SGD),
demonstrating excellent performance with the same GPU memory usage as
inference. However, the simulated perturbation stochastic approximation for
gradient estimate in MeZO leads to severe oscillations and incurs a substantial
time overhead. Moreover, without momentum regularization, MeZO shows severe
over-fitting problems. Lastly, the perturbation-irrelevant momentum on ZO-SGD
does not improve the convergence rate. This study proposes ZO-AdaMU to resolve
the above problems by adapting the simulated perturbation with momentum in its
stochastic approximation. Unlike existing adaptive momentum methods, we
relocate momentum on simulated perturbation in stochastic gradient
approximation. Our convergence analysis and experiments prove this is a better
way to improve convergence stability and rate in ZO-SGD. Extensive experiments
demonstrate that ZO-AdaMU yields better generalization for LLMs fine-tuning
across various NLP tasks than MeZO and its momentum variants.
Related papers
- A Stochastic Approach to Bi-Level Optimization for Hyperparameter Optimization and Meta Learning [74.80956524812714]
We tackle the general differentiable meta learning problem that is ubiquitous in modern deep learning.
These problems are often formalized as Bi-Level optimizations (BLO)
We introduce a novel perspective by turning a given BLO problem into a ii optimization, where the inner loss function becomes a smooth distribution, and the outer loss becomes an expected loss over the inner distribution.
arXiv Detail & Related papers (2024-10-14T12:10:06Z) - Zeroth-Order Fine-Tuning of LLMs in Random Subspaces [66.27334633749734]
As language models grow in size, memory demands for backpropagation increase.
Zeroth-order (ZOZO) optimization methods offer a memory-efficient alternative.
We show that SubZero enhances fine-tuning and achieves faster results compared to standard ZOZO approaches.
arXiv Detail & Related papers (2024-10-11T17:01:43Z) - Moreau Envelope ADMM for Decentralized Weakly Convex Optimization [55.2289666758254]
This paper proposes a proximal variant of the alternating direction method of multipliers (ADMM) for distributed optimization.
The results of our numerical experiments indicate that our method is faster and more robust than widely-used approaches.
arXiv Detail & Related papers (2023-08-31T14:16:30Z) - Exponential convergence rates for momentum stochastic gradient descent in the overparametrized setting [0.6445605125467574]
We prove bounds on the rate of convergence for the momentum gradient descent scheme (MSGD)
We analyze the optimal choice of the friction and show that the MSGD process almost surely converges to a local.
arXiv Detail & Related papers (2023-02-07T15:59:08Z) - SketchySGD: Reliable Stochastic Optimization via Randomized Curvature
Estimates [19.420605210427635]
SketchySGD improves upon existing gradient methods in machine learning by using randomized low-rank approximations to the subsampled Hessian.
We show theoretically that SketchySGD with a fixed stepsize converges linearly to a small ball around the optimum.
In the ill-conditioned setting we show SketchySGD converges at a faster rate than SGD for least-squares problems.
arXiv Detail & Related papers (2022-11-16T01:05:41Z) - Convergence and Stability of the Stochastic Proximal Point Algorithm
with Momentum [14.158845925610438]
We show how a gradient proximal algorithm with momentum (PPA) allows faster convergence to a neighborhood compared to the proximal algorithm (PPA) with better contraction factor.
arXiv Detail & Related papers (2021-11-11T12:17:22Z) - Stochastic Mirror Descent: Convergence Analysis and Adaptive Variants
via the Mirror Stochastic Polyak Stepsize [20.376216873620763]
We investigate the convergence of mirror descent (SMD) under in relatively smooth and smooth convex optimization.
We propose a new adaptive stepsize scheme -- the mirror Polyak stepsize (mSPS)
arXiv Detail & Related papers (2021-10-28T19:49:40Z) - On the Convergence of Stochastic Extragradient for Bilinear Games with
Restarted Iteration Averaging [96.13485146617322]
We present an analysis of the ExtraGradient (SEG) method with constant step size, and present variations of the method that yield favorable convergence.
We prove that when augmented with averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure.
arXiv Detail & Related papers (2021-06-30T17:51:36Z) - Fast Distributionally Robust Learning with Variance Reduced Min-Max
Optimization [85.84019017587477]
Distributionally robust supervised learning is emerging as a key paradigm for building reliable machine learning systems for real-world applications.
Existing algorithms for solving Wasserstein DRSL involve solving complex subproblems or fail to make use of gradients.
We revisit Wasserstein DRSL through the lens of min-max optimization and derive scalable and efficiently implementable extra-gradient algorithms.
arXiv Detail & Related papers (2021-04-27T16:56:09Z) - Positive-Negative Momentum: Manipulating Stochastic Gradient Noise to
Improve Generalization [89.7882166459412]
gradient noise (SGN) acts as implicit regularization for deep learning.
Some works attempted to artificially simulate SGN by injecting random noise to improve deep learning.
For simulating SGN at low computational costs and without changing the learning rate or batch size, we propose the Positive-Negative Momentum (PNM) approach.
arXiv Detail & Related papers (2021-03-31T16:08:06Z) - The Role of Momentum Parameters in the Optimal Convergence of Adaptive
Polyak's Heavy-ball Methods [12.93796690939018]
We prove that the adaptive Polyak's Heavy-ball (HB) method attains an optimal individual convergence rate of $O(frac1sqrtt)$.
Our new analysis shows how the HB momentum and its time-varying weight help us to achieve the acceleration in convex optimization.
arXiv Detail & Related papers (2021-02-15T02:57:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.