A Minimalist Optimizer Design for LLM Pretraining
- URL: http://arxiv.org/abs/2506.16659v1
- Date: Fri, 20 Jun 2025 00:10:35 GMT
- Title: A Minimalist Optimizer Design for LLM Pretraining
- Authors: Athanasios Glentis, Jiaxiang Li, Andi Han, Mingyi Hong,
- Abstract summary: Training large language models typically relies on adaptives such as Adam.<n>Recent works such as GaLore Fira, and APOLLO have proposed state-compressed variants to reduce memory consumption.<n>In this work, we investigate what is the minimal amount of state that is truly necessary to retain state-of-the-art performance in LLM pretraining.
- Score: 31.996047271119156
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training large language models (LLMs) typically relies on adaptive optimizers such as Adam, which require significant memory to maintain first- and second-moment matrices, known as optimizer states. While recent works such as GaLore, Fira, and APOLLO have proposed state-compressed variants to reduce memory consumption, a fundamental question remains: What is the minimal amount of optimizer state that is truly necessary to retain state-of-the-art performance in LLM pretraining? In this work, we systematically investigate this question using a bottom-up approach. We find that two memory- and compute-efficient optimization techniques are particularly effective: (1) column-wise gradient normalization significantly boosts the performance of plain SGD without requiring momentum; and (2) adding first-order momentum only to the output layer - where gradient variance is highest - yields performance competitive with fully adaptive methods such as Muon. Based on these insights, we propose SCALE (Stochastic Column-normalized Last-layer Momentum), a new optimizer that combines column-normalized SGD with last-layer momentum, where column normalization refers to normalizing the gradient along the output dimension. Across multiple LLaMA models (60M-1B), SCALE matches or exceeds the performance of Adam while using only 35-45% of the total memory. It also consistently outperforms memory-efficient optimizers such as GaLore, Fira, and APOLLO, making it a strong candidate for large-scale pretraining under memory constraints. For the LLaMA 7B model, SCALE outperforms the state-of-the-art method APOLLO in terms of both perplexity and memory consumption. In addition, our method serves as a minimalist baseline for more sophisticated optimizer design.
Related papers
- Low-rank Momentum Factorization for Memory Efficient Training [13.464518325870444]
Momentum Factorized (MoFaSGD) maintains a dynamically updated low-rank SVD representation of the first-order momentum.<n>We demonstrate MoFaSGD's effectiveness on large language model benchmarks, achieving a competitive trade-off between memory reduction (e.g. LoRA) and performance.
arXiv Detail & Related papers (2025-07-10T18:04:52Z) - SVD-Free Low-Rank Adaptive Gradient Optimization for Large Language Models [37.60342078872549]
We propose a two-step procedure to approximate SVD-based gradient projections into lower-dimensional spaces.<n>Our experiments on both pre-training and fine-tuning tasks demonstrate the effectiveness of our dual strategy.
arXiv Detail & Related papers (2025-05-23T14:37:00Z) - Training Deep Learning Models with Norm-Constrained LMOs [56.00317694850397]
We propose a new family of algorithms that uses the linear minimization oracle (LMO) to adapt to the geometry of the problem.<n>We demonstrate significant speedups on nanoGPT training using our algorithm, Scion, without any reliance on Adam.
arXiv Detail & Related papers (2025-02-11T13:10:34Z) - Gradient Multi-Normalization for Stateless and Scalable LLM Training [16.037614012166063]
Training large language models (LLMs) typically relies on adaptives like Adam which store additional state information to accelerate convergence but incur significant memory overhead.<n>Recent efforts, such as SWAN (Ma et al., 2024) address this by eliminating the need for states while achieving performance comparable to Adam via a multi-step preprocessing procedure applied to instantaneous gradients.<n>We introduce a novel framework for designing stateless gradients that normalizes gradients according to multiple norms. Experiments on pre-training LLaMA models with up to 1 billion parameters demonstrate a 3X speedup over Adam with significantly reduced memory requirements, outperforming other memory-efficient baseline
arXiv Detail & Related papers (2025-02-10T18:09:53Z) - APOLLO: SGD-like Memory, AdamW-level Performance [61.53444035835778]
Large language models (LLMs) are notoriously memory-intensive during training.<n>Various memory-efficient Scals have been proposed to reduce memory usage.<n>They face critical challenges: (i) costly SVD operations; (ii) significant performance trade-offs compared to AdamW; and (iii) still substantial memory overhead to maintain competitive performance.
arXiv Detail & Related papers (2024-12-06T18:55:34Z) - Zeroth-Order Fine-Tuning of LLMs in Random Subspaces [66.27334633749734]
As language models grow in size, memory demands for backpropagation increase.
Zeroth-order (ZOZO) optimization methods offer a memory-efficient alternative.
We show that SubZero enhances fine-tuning and achieves faster results compared to standard ZOZO approaches.
arXiv Detail & Related papers (2024-10-11T17:01:43Z) - Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models.
We learn the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model.
Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU.
arXiv Detail & Related papers (2024-06-15T09:31:03Z) - Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark [166.40879020706151]
This paper proposes a shift towards BP-free, zeroth-order (ZO) optimization as a solution for reducing memory costs during fine-tuning.
Unlike traditional ZO-SGD methods, our work expands the exploration to a wider array of ZO optimization techniques.
Our study unveils previously overlooked optimization principles, highlighting the importance of task alignment, the role of the forward gradient method, and the balance between algorithm complexity and fine-tuning performance.
arXiv Detail & Related papers (2024-02-18T14:08:48Z) - AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models.
AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.