CAME: Confidence-guided Adaptive Memory Efficient Optimization
- URL: http://arxiv.org/abs/2307.02047v2
- Date: Mon, 7 Aug 2023 06:21:31 GMT
- Title: CAME: Confidence-guided Adaptive Memory Efficient Optimization
- Authors: Yang Luo, Xiaozhe Ren, Zangwei Zheng, Zhuo Jiang, Xin Jiang, Yang You
- Abstract summary: Adaptive gradient methods have demonstrated excellent performance in the training of large language models.
The need for maintaining second-moment estimates requires maintaining a high cost of extra memory overheads.
Several memory-efficients have been proposed to obtain a drastic reduction in auxiliary memory usage, but with a performance penalty.
We propose CAME to simultaneously achieve two goals: fast convergence as in traditional adaptive methods, and low memory usage as in memory-efficient methods.
- Score: 20.009302737137787
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Adaptive gradient methods, such as Adam and LAMB, have demonstrated excellent
performance in the training of large language models. Nevertheless, the need
for adaptivity requires maintaining second-moment estimates of the
per-parameter gradients, which entails a high cost of extra memory overheads.
To solve this problem, several memory-efficient optimizers (e.g., Adafactor)
have been proposed to obtain a drastic reduction in auxiliary memory usage, but
with a performance penalty. In this paper, we first study a confidence-guided
strategy to reduce the instability of existing memory efficient optimizers.
Based on this strategy, we propose CAME to simultaneously achieve two goals:
fast convergence as in traditional adaptive methods, and low memory usage as in
memory-efficient methods. Extensive experiments demonstrate the training
stability and superior performance of CAME across various NLP tasks such as
BERT and GPT-2 training. Notably, for BERT pre-training on the large batch size
of 32,768, our proposed optimizer attains faster convergence and higher
accuracy compared with the Adam optimizer. The implementation of CAME is
publicly available.
Related papers
- Efficient Adaptive Optimization via Subset-Norm and Subspace-Momentum: Fast, Memory-Reduced Training with Convergence Guarantees [5.399838579600896]
We introduce two complementary techniques for memory optimization.
One technique, Subset-Norm, reduces the momentum state's memory footprint by a low-dimensional subspace.
The other technique, Subspace-Momentum, reduces the momentum state's memory footprint by a low-dimensional subspace.
arXiv Detail & Related papers (2024-11-11T16:48:07Z) - Memory-Efficient Optimization with Factorized Hamiltonian Descent [11.01832755213396]
We introduce a novel adaptive, H-Fac, which incorporates a memory-efficient factorization approach to address this challenge.
By employing a rank-1 parameterization for both momentum and scaling parameter estimators, H-Fac reduces memory costs to a sublinear level.
We develop our algorithms based on principles derived from Hamiltonian dynamics, providing robust theoretical underpinnings in optimization dynamics and convergence guarantees.
arXiv Detail & Related papers (2024-06-14T12:05:17Z) - Adaptive Preference Scaling for Reinforcement Learning with Human Feedback [103.36048042664768]
Reinforcement learning from human feedback (RLHF) is a prevalent approach to align AI systems with human values.
We propose a novel adaptive preference loss, underpinned by distributionally robust optimization (DRO)
Our method is versatile and can be readily adapted to various preference optimization frameworks.
arXiv Detail & Related papers (2024-06-04T20:33:22Z) - Time-, Memory- and Parameter-Efficient Visual Adaptation [75.28557015773217]
We propose an adaptation method which does not backpropagate gradients through the backbone.
We achieve this by designing a lightweight network in parallel that operates on features from the frozen, pretrained backbone.
arXiv Detail & Related papers (2024-02-05T10:55:47Z) - ROAM: memory-efficient large DNN training via optimized operator
ordering and memory layout [8.99065455675796]
We propose ROAM which operates on graph level to derive memory-efficient execution plan with optimized operator order and tensor memory layout for models.
Experiments show that ROAM achieves a substantial memory reduction of 35.7%, 13.3%, and 27.2% compared to Pytorch and two state-of-the-art methods and offers a remarkable 53.7x speedup.
arXiv Detail & Related papers (2023-10-30T06:29:21Z) - Federated Learning of Large Language Models with Parameter-Efficient
Prompt Tuning and Adaptive Optimization [71.87335804334616]
Federated learning (FL) is a promising paradigm to enable collaborative model training with decentralized data.
The training process of Large Language Models (LLMs) generally incurs the update of significant parameters.
This paper proposes an efficient partial prompt tuning approach to improve performance and efficiency simultaneously.
arXiv Detail & Related papers (2023-10-23T16:37:59Z) - AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models.
AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z) - Online Convolutional Re-parameterization [51.97831675242173]
We present online convolutional re- parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution.
Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2x.
We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks.
arXiv Detail & Related papers (2022-04-02T09:50:19Z) - Efficient Few-Shot Object Detection via Knowledge Inheritance [62.36414544915032]
Few-shot object detection (FSOD) aims at learning a generic detector that can adapt to unseen tasks with scarce training samples.
We present an efficient pretrain-transfer framework (PTF) baseline with no computational increment.
We also propose an adaptive length re-scaling (ALR) strategy to alleviate the vector length inconsistency between the predicted novel weights and the pretrained base weights.
arXiv Detail & Related papers (2022-03-23T06:24:31Z) - Practical tradeoffs between memory, compute, and performance in learned
optimizers [46.04132441790654]
We identify and quantify the memory, compute, and performance trade-offs for many learned and hand-designeds features.
We leverage our analysis to construct a learned is both faster and more efficient than previous work.
arXiv Detail & Related papers (2022-03-22T16:36:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.