Cautious Optimizers: Improving Training with One Line of Code
- URL: http://arxiv.org/abs/2411.16085v1
- Date: Mon, 25 Nov 2024 04:36:01 GMT
- Title: Cautious Optimizers: Improving Training with One Line of Code
- Authors: Kaizhao Liang, Lizhang Chen, Bo Liu, Qiang Liu,
- Abstract summary: We rename AdamW as Cautious, e.g. C-AdamW.
A whole new family of outcomes is revealed by our theoretical insight.
- Score: 8.393403749426097
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: AdamW has been the default optimizer for transformer pretraining. For many years, our community searches for faster and more stable optimizers with only constraint positive outcomes. In this work, we propose a \textbf{single-line modification in Pytorch} to any momentum-based optimizer, which we rename Cautious Optimizer, e.g. C-AdamW and C-Lion. Our theoretical result shows that this modification preserves Adam's Hamiltonian function and it does not break the convergence guarantee under the Lyapunov analysis. In addition, a whole new family of optimizers is revealed by our theoretical insight. Among them, we pick the simplest one for empirical experiments, showing speed-up on Llama and MAE pretraining up to $1.47\times$. Code is available at https://github.com/kyleliang919/C-Optim
Related papers
- MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence [35.17459630834073]
We propose a new variant of the Adam that specifically minimizes memory overheads, while maintaining theoretical convergence guarantees.
We control the resulting compression error via a novel instance of the classical empherror feedback mechanism from distributed optimization.
We prove that the resulting approach maintains theoretical convergence guarantees competitive to those of AMSGrad, while providing good practical performance.
arXiv Detail & Related papers (2024-05-24T14:25:23Z) - Self-Adjusting Evolutionary Algorithms Are Slow on Multimodal Landscapes [0.0]
We show that the positive results do not extend to other types of local optima.
On the distorted OneMax benchmark, the self-adjusting $(1, lambda)$-EA is slowed down just as elitist algorithms because self-adaptation prevents the algorithm from escaping from local optima.
arXiv Detail & Related papers (2024-04-18T10:01:08Z) - Second-Order Fine-Tuning without Pain for LLMs:A Hessian Informed Zeroth-Order Optimizer [43.86938914517675]
Fine-tuning large language models (LLMs) with classic first-order memory entails prohibitive GPU due to the backpropagation process.
Recent works have turned to zeroth-orders for fine-tuning, which save substantial memory by using two forward passes.
In this work, we propose HiZOO, a diagonal Hessian informed zeroth-order trajectories.
arXiv Detail & Related papers (2024-02-23T08:11:55Z) - MADA: Meta-Adaptive Optimizers through hyper-gradient Descent [73.1383658672682]
We introduce Meta-Adaptives (MADA), a unified framework that can generalize several known convergences and dynamically learn the most suitable one during training.
We empirically compare MADA to other populars on vision and language tasks, and find that MADA consistently outperforms Adam and other populars.
We also propose AVGrad, a modification of AMS that replaces the maximum operator with averaging, which is more suitable for hyper-gradient optimization.
arXiv Detail & Related papers (2024-01-17T00:16:46Z) - AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models.
AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z) - ELRA: Exponential learning rate adaption gradient descent optimization
method [83.88591755871734]
We present a novel, fast (exponential rate), ab initio (hyper-free) gradient based adaption.
The main idea of the method is to adapt the $alpha by situational awareness.
It can be applied to problems of any dimensions n and scales only linearly.
arXiv Detail & Related papers (2023-09-12T14:36:13Z) - VeLO: Training Versatile Learned Optimizers by Scaling Up [67.90237498659397]
We leverage the same scaling approach behind the success of deep learning to learn versatiles.
We train an ingest for deep learning which is itself a small neural network that ingests and outputs parameter updates.
We open source our learned, meta-training code, the associated train test data, and an extensive benchmark suite with baselines at velo-code.io.
arXiv Detail & Related papers (2022-11-17T18:39:07Z) - Learning to Optimize Quasi-Newton Methods [22.504971951262004]
This paper introduces a novel machine learning called LODO, which tries to online meta-learn the best preconditioner during optimization.
Unlike other L2O methods, LODO does not require any meta-training on a training task distribution.
We show that our gradient approximates the inverse Hessian in noisy loss landscapes and is capable of representing a wide range of inverse Hessians.
arXiv Detail & Related papers (2022-10-11T03:47:14Z) - DOGE-Train: Discrete Optimization on GPU with End-to-end Training [28.795080637690095]
We present a fast, scalable, data-driven approach for solving relaxations of 0-1 integer linear programs.
We use a combination of graph neural networks (GNN) and the Lagrange decomposition based algorithm FastDOG.
arXiv Detail & Related papers (2022-05-23T21:09:41Z) - Training Learned Optimizers with Randomly Initialized Learned Optimizers [49.67678615506608]
We show that a population of randomly learneds can be used to train themselves from scratch in an online fashion.
A form of population based training is used to orchestrate this self-training.
We believe feedback loops of this type will be important and powerful in the future of machine learning.
arXiv Detail & Related papers (2021-01-14T19:07:17Z) - Divide and Learn: A Divide and Conquer Approach for Predict+Optimize [50.03608569227359]
The predict+optimize problem combines machine learning ofproblem coefficients with a optimization prob-lem that uses the predicted coefficients.
We show how to directlyexpress the loss of the optimization problem in terms of thepredicted coefficients as a piece-wise linear function.
We propose a novel divide and algorithm to tackle optimization problems without this restriction and predict itscoefficients using the optimization loss.
arXiv Detail & Related papers (2020-12-04T00:26:56Z) - Reverse engineering learned optimizers reveals known and novel
mechanisms [50.50540910474342]
Learneds are algorithms that can themselves be trained to solve optimization problems.
Our results help elucidate the previously murky understanding of how learneds work, and establish tools for interpreting future learneds.
arXiv Detail & Related papers (2020-11-04T07:12:43Z) - A new regret analysis for Adam-type algorithms [78.825194932103]
In theory, regret guarantees for online convex optimization require a rapidly decaying $beta_1to0$ schedule.
We propose a novel framework that allows us to derive optimal, data-dependent regret bounds with a constant $beta_1$.
arXiv Detail & Related papers (2020-03-21T19:19:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.