Adam-mini: Use Fewer Learning Rates To Gain More
- URL: http://arxiv.org/abs/2406.16793v7
- Date: Mon, 24 Feb 2025 11:29:08 GMT
- Title: Adam-mini: Use Fewer Learning Rates To Gain More
- Authors: Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Diederik P. Kingma, Yinyu Ye, Zhi-Quan Luo, Ruoyu Sun,
- Abstract summary: Adam-mini reduces memory cutting down the learning rate resources in Adam.<n>Adam-mini achieves on par or better performance than AdamW with 50% less memory footprint.
- Score: 29.170425801678952
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose Adam-mini, an optimizer that achieves on par or better performance than AdamW with 50% less memory footprint. Adam-mini reduces memory by cutting down the learning rate resources in Adam (i.e., $1/\sqrt{v}$). By investigating the Hessian structure of neural nets, we find Adam's $v$ might not function at its full potential as effectively as we expected. We find that $\geq$ 99.9% of these learning rates in $v$ could be harmlessly removed if we (1) carefully partition the parameters into blocks following our new principle on Hessian structure; (2) assign a single but good learning rate to each parameter block. We then provide one simple way to find good learning rates and propose Adam-mini. Empirically, we verify that Adam-mini performs on par or better than AdamW on various language models sized from 39M to 13B for pre-training, supervised fine-tuning, and RLHF. The reduced memory footprint of Adam-mini also alleviates communication overheads among GPUs, thereby increasing throughput. For instance, Adam-mini achieves 49.6% higher throughput than AdamW when pre-training Llama 2-7B on $2\times$ A800-80GB GPUs, which saves 33% wall-clock time for pre-training.
Related papers
- When Can You Get Away with Low Memory Adam? [48.30892531847662]
We show that $textitSlimAdam$ matches Adam's performance and stability while saving up to $98%$ of total second moments.
Code for $textitSlimAdam$ is available at https://github.com/dayal-kalra/low-memory-adam.
arXiv Detail & Related papers (2025-03-03T18:59:40Z) - Adam Exploits $\ell_\infty$-geometry of Loss Landscape via Coordinate-wise Adaptivity [6.270305440413688]
We find that Adam performs worse when the favorable $ell_infty$-geometry is SGD while provably remains unaffected.
Our experiments confirm that Adam performs much worse when the favorable $ell_infty$-geometry is SGD while provably remains unaffected.
arXiv Detail & Related papers (2024-10-10T17:58:53Z) - Symbolic Discovery of Optimization Algorithms [132.62397077095787]
We use efficient search techniques to explore an infinite and sparse program space.
Our method discovers a simple and effective optimization algorithm, $textbfLion$.
Lion is successfully deployed in production systems such as Google search ads CTR model.
arXiv Detail & Related papers (2023-02-13T20:27:30Z) - Provable Adaptivity of Adam under Non-uniform Smoothness [79.25087082434975]
Adam is widely adopted in practical applications due to its fast convergence.
Existing convergence analyses for Adam rely on the bounded smoothness assumption.
This paper studies the convergence of randomly reshuffled Adam with diminishing learning rate.
arXiv Detail & Related papers (2022-08-21T14:57:47Z) - Maximizing Communication Efficiency for Large-scale Training via 0/1
Adam [49.426602335460295]
1-bit communication is an effective method to scale up model training, and has been studied extensively on SGD.
We propose 0/1 Adam, which improves upon the state-of-the-art 1-bit Adam via two novel methods.
arXiv Detail & Related papers (2022-02-12T08:02:23Z) - Understanding AdamW through Proximal Methods and Scale-Freeness [57.47324825501137]
Adam is a generalization of the $ell$ regularizer Adam-$ell$.
AdamW decouples the gradient of Adam-$ell$ from the update rule of Adam-$ell$.
We show that AdamW exhibits an advantage over Adam-$ell$ and the degree to which we expect the gradients of the network to exhibit multiple scales.
arXiv Detail & Related papers (2022-01-31T21:00:55Z) - Adam$^+$: A Stochastic Method with Adaptive Variance Reduction [56.051001950733315]
Adam is a widely used optimization method for deep learning applications.
We propose a new method named Adam$+$ (pronounced as Adam-plus)
Our empirical studies on various deep learning tasks, including image classification, language modeling, and automatic speech recognition, demonstrate that Adam$+$ significantly outperforms Adam.
arXiv Detail & Related papers (2020-11-24T09:28:53Z) - EAdam Optimizer: How $\epsilon$ Impact Adam [7.0552555621312605]
We discuss the impact of the constant $epsilon$ for Adam in this paper.
Based on this finding, we propose a new variant of Adam called EAdam.
Our method can bring significant improvement compared with Adam.
arXiv Detail & Related papers (2020-11-04T06:39:44Z) - Adam with Bandit Sampling for Deep Learning [18.033149110113378]
We propose a generalization of Adam, called Adambs, that allows us to adapt to different training examples.
Experiments on various models and datasets demonstrate Adambs's fast convergence in practice.
arXiv Detail & Related papers (2020-10-24T21:01:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.