Related papers: AdaGC: Improving Training Stability for Large Language Model Pretraining

AdaGC: Improving Training Stability for Large Language Model Pretraining

URL: http://arxiv.org/abs/2502.11034v1
Date: Sun, 16 Feb 2025 08:13:23 GMT
Title: AdaGC: Improving Training Stability for Large Language Model Pretraining
Authors: Guoxia Wang, Shuai Li, Congliang Chen, Jinle Zeng, Jiabin Yang, Tao Sun, Yanjun Ma, Dianhai Yu, Li Shen,
Abstract summary: Large LanguageText Models (LLMs) face increasing loss spikes during scaling.<n>While global clipping mitigates this, traditional approaches mitigate specific variations.<n>We show that AdaGC converges 25% faster than global clipping.
Score: 18.163318397205533
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) face increasing loss spikes during scaling, undermining training stability and final performance. While gradient clipping mitigates this issue, traditional global approaches poorly handle parameter-specific gradient variations and decaying gradient norms. We propose **AdaGC**, an adaptive gradient clipping framework that automatically adjusts local thresholds per parameter through exponential moving average of gradient norms. Theoretical analysis proves AdaGC's convergence under non-convex conditions. Extensive experiments demonstrate significant improvements: On Llama-2 7B/13B, AdaGC completely eliminates loss spikes while reducing WikiText perplexity by 3.5% (+0.14pp LAMBADA accuracy) for 7B and achieving 0.65% lower training loss with 1.47% reduced validation perplexity for 13B compared to global clipping. For CLIP ViT-Base, AdaGC converges 25% faster than StableAdamW with full spike elimination. The method shows universal effectiveness across architectures (Llama-2 7B/13B) and modalities (CLIP), with successful integration into diverse optimizers like AdamW and Lion. Source code will be released on GitHub.

Related papers

Efficient Differentially Private Fine-Tuning of LLMs via Reinforcement Learning [0.9861588522527782]
We present RLDP, the first framework to cast DP optimization itself as a closed-loop control problem amenable to modern deep reinforcement learning (RL)<n>Across more than 1,600 experiments on GPT2-small, Llama-1B, Llama-3B, and Mistral-7B, RLDP delivers perplexity reductions of 1.3-3.0.5% and an average 5.6% downstream utility gain.
arXiv Detail & Related papers (2025-07-30T10:46:53Z)
Orthogonal Gradient Descent Improves Neural Calibration [0.0]
OnAR-10 with 10% labeled data, $perp$Grad matches SGD in accuracy but yields consistently improved calibration metrics.<n>These benefits persist under input corruption (CIFAR-10C) and extended training, where $perp$Grad models degrade more gracefully than SGD-trained counterparts.
arXiv Detail & Related papers (2025-06-04T22:12:46Z)
Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam [94.00189300897694]
Low-bit precision amplifies sensitivity learning rates and often causes unstable gradient norms. We propose Stable-SPAM, which incorporates enhanced gradient normalization and clipping techniques. Experiments show that Stable-SPAM effectively stabilizes gradient norms in 4-bit training, delivering superior performance compared to Adam and SPAM.
arXiv Detail & Related papers (2025-02-24T11:09:15Z)
GAQAT: gradient-adaptive quantization-aware training for domain generalization [54.31450550793485]
We propose a novel Gradient-Adaptive Quantization-Aware Training (GAQAT) framework for DG.<n>Our approach begins by identifying the scale-gradient conflict problem in low-precision quantization.<n>Extensive experiments validate the effectiveness of the proposed GAQAT framework.
arXiv Detail & Related papers (2024-12-07T06:07:21Z)
Beyond adaptive gradient: Fast-Controlled Minibatch Algorithm for large-scale optimization [1.6749379740049926]
We introduce F-CMA, a Fast-Controlled Mini-batch Algorithm with a random reshuffling method featuring a sufficient decrease condition and a line-search procedure to ensure loss reduction per epoch.<n>Tests show significant improvements, including a decrease in the overall training time by 68%, an increase in per-epoch efficiency by up to 20%, and in model accuracy by up to 5%.
arXiv Detail & Related papers (2024-11-24T11:46:47Z)
PACE: Marrying generalization in PArameter-efficient fine-tuning with Consistency rEgularization [35.922096876707975]
PACE is a generalization of PArameter-efficient fine-tuning with Consistency rEgularization.<n>It implicitly regularizes gradients for enhanced generalization, but also implicitly aligns the fine-tuned and pre-trained models to retain knowledge.<n>It also improves LoRA in text classification (GLUE) and mathematical reasoning.
arXiv Detail & Related papers (2024-09-25T17:56:00Z)
LoCo: Low-Bit Communication Adaptor for Large-scale Model Training [63.040522637816906]
Low-bit communication often degrades training quality due to compression information loss.<n>We propose Low-bit Communication Adaptor (LoCo), which compensates local local GPU nodes before, without compromising quality.<n> Experimental results show that across moving large-scale training model frameworks like Megatron-LM and PyTorchs FSDP, LoCo significantly improves compression communication efficiency.
arXiv Detail & Related papers (2024-07-05T13:01:36Z)
Thinking Forward: Memory-Efficient Federated Finetuning of Language Models [21.438831528354513]
Finetuning large language models (LLMs) in federated learning settings requires excessive memory for resource-constrained devices. In this paper, we introduce Spry, an FL algorithm that splits trainable weights of an LLM among participating clients. Spry achieves a low memory footprint, high accuracy, and fast convergence.
arXiv Detail & Related papers (2024-05-24T13:37:48Z)
Sparse is Enough in Fine-tuning Pre-trained Large Language Models [98.46493578509039]
We propose a gradient-based sparse fine-tuning algorithm, named Sparse Increment Fine-Tuning (SIFT) We validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning.
arXiv Detail & Related papers (2023-12-19T06:06:30Z)
Improved techniques for deterministic l2 robustness [63.34032156196848]
Training convolutional neural networks (CNNs) with a strict 1-Lipschitz constraint under the $l_2$ norm is useful for adversarial robustness, interpretable gradients and stable training. We introduce a procedure to certify robustness of 1-Lipschitz CNNs by replacing the last linear layer with a 1-hidden layer. We significantly advance the state-of-the-art for standard and provable robust accuracies on CIFAR-10 and CIFAR-100.
arXiv Detail & Related papers (2022-11-15T19:10:12Z)
IGLU: Efficient GCN Training via Lazy Updates [17.24386142849498]
Graph Convolution Networks (GCN) are used in numerous settings involving a large underlying graph as well as several layers. Standard SGD-based training scales poorly here since each descent step ends up updating node embeddings for a large portion of the graph. We introduce a new method IGLU that caches forward-pass embeddings for all nodes at various GCN layers.
arXiv Detail & Related papers (2021-09-28T19:11:00Z)
Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization [118.50301177912381]
We show that Adam can converge to different solutions of the objective with provably different errors, even with weight decay globalization. We show that if convex, and the weight decay regularization is employed, any optimization algorithms including Adam will converge to the same solution.
arXiv Detail & Related papers (2021-08-25T17:58:21Z)
Adapting Stepsizes by Momentumized Gradients Improves Optimization and Generalization [89.66571637204012]
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing. textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing. textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
arXiv Detail & Related papers (2021-06-22T03:13:23Z)
Gradient Centralization: A New Optimization Technique for Deep Neural Networks [74.935141515523]
gradient centralization (GC) operates directly on gradients by centralizing the gradient vectors to have zero mean. GC can be viewed as a projected gradient descent method with a constrained loss function. GC is very simple to implement and can be easily embedded into existing gradient based DNNs with only one line of code.
arXiv Detail & Related papers (2020-04-03T10:25:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.