Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam
- URL: http://arxiv.org/abs/2502.17055v2
- Date: Fri, 11 Apr 2025 19:48:37 GMT
- Title: Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam
- Authors: Tianjin Huang, Haotian Hu, Zhenyu Zhang, Gaojie Jin, Xiang Li, Li Shen, Tianlong Chen, Lu Liu, Qingsong Wen, Zhangyang Wang, Shiwei Liu,
- Abstract summary: Low-bit precision amplifies sensitivity learning rates and often causes unstable gradient norms.<n>We propose Stable-SPAM, which incorporates enhanced gradient normalization and clipping techniques.<n>Experiments show that Stable-SPAM effectively stabilizes gradient norms in 4-bit training, delivering superior performance compared to Adam and SPAM.
- Score: 94.00189300897694
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper comprehensively evaluates several recently proposed optimizers for 4-bit training, revealing that low-bit precision amplifies sensitivity to learning rates and often causes unstable gradient norms, leading to divergence at higher learning rates. Among these, SPAM, a recent optimizer featuring momentum reset and spike-aware gradient clipping, achieves the best performance across various bit levels, but struggles to stabilize gradient norms, requiring careful learning rate tuning. To address these limitations, we propose Stable-SPAM, which incorporates enhanced gradient normalization and clipping techniques. In particular, Stable-SPAM (1) adaptively updates the clipping threshold for spiked gradients by tracking their historical maxima; (2) normalizes the entire gradient matrix based on its historical $l_2$-norm statistics; and $(3)$ inherits momentum reset from SPAM to periodically reset the first and second moments of Adam, mitigating the accumulation of spiked gradients. Extensive experiments show that Stable-SPAM effectively stabilizes gradient norms in 4-bit LLM training, delivering superior performance compared to Adam and SPAM. Notably, our 4-bit LLaMA-1B model trained with Stable-SPAM outperforms the BF16 LLaMA-1B trained with Adam by up to $2$ perplexity. Furthermore, when both models are trained in 4-bit, Stable-SPAM achieves the same loss as Adam while requiring only about half the training steps. Code is available at https://github.com/TianjinYellow/StableSPAM.git.
Related papers
- ZClip: Adaptive Spike Mitigation for LLM Pre-Training [0.3574867616159909]
Training large language models (LLMs) presents numerous challenges, including gradient instability and loss spikes.
Traditional gradient clipping techniques, such as constant or norm-based methods, fail to address these issues effectively.
We propose ZClip, an adaptive gradient clipping algorithm that dynamically adjusts the clipping threshold based on statistical properties of gradient norms over time.
arXiv Detail & Related papers (2025-04-03T11:41:55Z) - AdaGC: Improving Training Stability for Large Language Model Pretraining [18.163318397205533]
Large LanguageText Models (LLMs) face increasing loss spikes during scaling.<n>While global clipping mitigates this, traditional approaches mitigate specific variations.<n>We show that AdaGC converges 25% faster than global clipping.
arXiv Detail & Related papers (2025-02-16T08:13:23Z) - SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training [60.9776082805359]
Large Language Models (LLMs) have demonstrated exceptional performance across diverse tasks, yet their training remains highly resource-intensive and susceptible to training instability.<n>This paper presents a comprehensive investigation into gradient spikes observed during LLM training, revealing their prevalence across multiple architectures and datasets.<n>We propose Spike-Aware Adam with Momentum Reset, a novel designed to counteract gradient spikes through momentum reset and spike-aware clipping.
arXiv Detail & Related papers (2025-01-12T15:21:22Z) - Zeroth-Order Fine-Tuning of LLMs in Random Subspaces [66.27334633749734]
As language models grow in size, memory demands for backpropagation increase.
Zeroth-order (ZOZO) optimization methods offer a memory-efficient alternative.
We show that SubZero enhances fine-tuning and achieves faster results compared to standard ZOZO approaches.
arXiv Detail & Related papers (2024-10-11T17:01:43Z) - Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint? [40.94505326255136]
Low-rank training has emerged as a promising approach for reducing memory usage in training Large Language Models.
We propose a new plug-and-play training framework for LLMs called Fira, as the first attempt to achieve this goal.
We show that Fira outperforms both LoRA and GaLore, achieving performance that is comparable to or even better than full-rank training.
arXiv Detail & Related papers (2024-10-02T14:58:27Z) - S-STE: Continuous Pruning Function for Efficient 2:4 Sparse Pre-training [20.113352600259226]
We propose S-STE, a simple yet powerful 2:4 training method that contains two parts: to continuously project weights to be 2:4 sparse, and to rescale sparse weights with a per-tensor fixed scaling factor.<n>Results show that our method surpasses previous 2:4 pre-training recipes and is comparable even with full parameter models.
arXiv Detail & Related papers (2024-09-13T08:29:36Z) - Inverse-Free Fast Natural Gradient Descent Method for Deep Learning [52.0693420699086]
We present a fast natural gradient descent (FNGD) method that only requires inversion during the first epoch.
FNGD exhibits similarities to the average sum in first-order methods, leading to the computational complexity of FNGD being comparable to that of first-order methods.
arXiv Detail & Related papers (2024-03-06T05:13:28Z) - Stable and low-precision training for large-scale vision-language models [108.62077651227607]
We introduce new methods for accelerating and stabilizing training for large language-vision models.
For acceleration, we introduce SwitchBack, a linear layer for int8 quantized training which provides a speed-up of 13-25%.
For stability, we analyze loss spikes and find they consistently occur 1-8 after the squared gradients become under-estimated.
arXiv Detail & Related papers (2023-04-25T17:38:18Z) - Large Scale Private Learning via Low-rank Reparametrization [77.38947817228656]
We propose a reparametrization scheme to address the challenges of applying differentially private SGD on large neural networks.
We are the first able to apply differential privacy on the BERT model and achieve an average accuracy of $83.9%$ on four downstream tasks.
arXiv Detail & Related papers (2021-06-17T10:14:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.