AGGC: Adaptive Group Gradient Clipping for Stabilizing Large Language Model Training
- URL: http://arxiv.org/abs/2601.11864v1
- Date: Sat, 17 Jan 2026 01:11:07 GMT
- Title: AGGC: Adaptive Group Gradient Clipping for Stabilizing Large Language Model Training
- Authors: Zhiyuan Li, Yuan Wu, Yi Chang,
- Abstract summary: We propose Adaptive Group-wise Gradient Clipping (AGGC) to stabilize Large Language Models (LLMs)<n>AGGC constructs an adaptive interval to simultaneously gradient explosion and vanishing, while employing a time-dependent scheduling mechanism.<n>Experiments on LLaMA 2-7B, Mistral-7B, and Gemma-7B models show that AGGC consistently outperforms LoRA and frequently surpasses Full Fine-Tuning.
- Score: 23.07765612308513
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To stabilize the training of Large Language Models (LLMs), gradient clipping is a nearly ubiquitous heuristic used to alleviate exploding gradients. However, traditional global norm clipping erroneously presupposes gradient homogeneity across different functional modules, leading to an adverse "spill-over" effect where volatile parameters force unnecessary scaling on stable ones. To overcome this, we propose Adaptive Group-wise Gradient Clipping (AGGC). AGGC partitions parameters into groups based on functional types and regulates each according to its historical behavior using an Exponential Moving Average (EMA). Specifically, it constructs an adaptive interval to simultaneously mitigate gradient explosion and vanishing, while employing a time-dependent scheduling mechanism to balance exploration and convergence. Experiments on LLaMA 2-7B, Mistral-7B, and Gemma-7B models show that AGGC consistently outperforms LoRA and frequently surpasses Full Fine-Tuning. On the GSM8K benchmark, Mistral-7B fine-tuned with AGGC achieves an accuracy of 72.93%, exceeding LoRA's 69.5%. AGGC also effectively stabilizes Reinforcement Learning with Verifiable Rewards (RLVR), enhancing the logic deduction of Qwen 2.5 and Llama 3.2 models. Experimental results demonstrate that AGGC effectively addresses the limitations of traditional gradient clipping methods, particularly in overcoming gradient heterogeneity, by utilizing a modular, adaptive clipping strategy to stabilize the training process. Due to its lightweight design, AGGC can be seamlessly integrated into existing post-training pipelines with negligible overhead.
Related papers
- On GRPO Collapse in Search-R1: The Lazy Likelihood-Displacement Death Spiral [59.14787085809595]
We identify Lazy Likelihood Displacement (LLD) as the core mechanism driving this failure.<n>LLD emerges early and triggers a self-reinforcing LLD Death Spiral, where declining likelihood leads to low-confidence responses.<n>We propose a lightweight likelihood-preserving regularization LLDS for GRPO that activates only when a trajectory's likelihood decreases.
arXiv Detail & Related papers (2025-12-03T19:41:15Z) - Soft Adaptive Policy Optimization [67.61886077470528]
Reinforcement learning plays an increasingly important role in enhancing the reasoning capabilities of large language models.<n>Existing group-based policy optimization methods, such as GSPO and GRPO, alleviate this problem via hard clipping.<n>We propose Soft Adaptive Policy Optimization (SAPO), which replaces hard clipping with a smooth, temperature-controlled gate.
arXiv Detail & Related papers (2025-11-25T14:25:19Z) - PGTT: Phase-Guided Terrain Traversal for Perceptive Legged Locomotion [41.99844472131922]
Phase-Guided Terrain Traversal (PGTT) is a perception-aware deep-RL approach that enforces gait structure purely through reward shaping.<n>Trained in MuJoCo (MJX) on procedurally generated stair-like terrains with curriculum and domain randomization, PGTT achieves the highest success under push disturbances.
arXiv Detail & Related papers (2025-10-21T07:00:18Z) - From Local to Global: Revisiting Structured Pruning Paradigms for Large Language Models [27.774067682004745]
GISP-Global Iterative Structured Pruning removes attention heads and channels using first-order, loss-based important weights aggregated at the structure level with block-wise normalization.<n>An iterative schedule, rather than one-shot pruning, stabilizes accuracy at higher sparsity and mitigates perplexity collapse without requiring intermediate fine-tuning.<n>Because importance is defined by a model-level loss, GISP naturally supports task-specific objectives; we instantiate perplexity for language modeling and a margin-based objective for decision-style tasks.
arXiv Detail & Related papers (2025-10-20T19:04:09Z) - GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs [56.93583799109029]
GrAInS is an inference-time steering approach that operates across both language-only and vision-language models and tasks.<n>During inference, GrAInS hidden activations at transformer layers guided by token-level attribution signals, and normalizes activations to preserve representational scale.<n>It consistently outperforms both fine-tuning and existing steering baselines.
arXiv Detail & Related papers (2025-07-24T02:34:13Z) - Taming LLMs by Scaling Learning Rates with Gradient Grouping [49.91587150497186]
Training large language models (LLMs) poses challenges due to their massive scale and heterogeneous architectures.<n>This work introduces Scaling with Gradient Grouping (SGG), an gradient wrapper that improves adaptive learning rate estimation by dynamic grouping and group-specific scaling.
arXiv Detail & Related papers (2025-06-01T15:30:37Z) - Normalized Attention Guidance: Universal Negative Guidance for Diffusion Models [57.20761595019967]
We present Normalized Attention Guidance (NAG), an efficient, training-free mechanism that applies extrapolation in attention space with L1-based normalization and refinement.<n>NAG restores effective negative guidance where CFG collapses while maintaining fidelity.<n>NAG generalizes across architectures (UNet, DiT), sampling regimes (few-step, multi-step), and modalities (image, video)
arXiv Detail & Related papers (2025-05-27T13:30:46Z) - LoRA-MGPO: Mitigating Double Descent in Low-Rank Adaptation via Momentum-Guided Perturbation Optimization [16.360816770124874]
We introduce LoRA-MGPO, a framework that incorporates Momentum-Guided Perurbation Optimization (MGPO)<n>MGPO stabilizes training dynamics and guiding momentum vectors from the gradient's state.<n>Experiments show that LoRA-MGPO consistently achieves superior performance over LoRA and other PEFT methods.
arXiv Detail & Related papers (2025-02-20T13:14:41Z) - AdaGC: Improving Training Stability for Large Language Model Pretraining [18.163318397205533]
Large LanguageText Models (LLMs) face increasing loss spikes during scaling.<n>While global clipping mitigates this, traditional approaches mitigate specific variations.<n>We show that AdaGC converges 25% faster than global clipping.
arXiv Detail & Related papers (2025-02-16T08:13:23Z) - Sharpness-Aware Gradient Matching for Domain Generalization [84.14789746460197]
The goal of domain generalization (DG) is to enhance the generalization capability of the model learned from a source domain to other unseen domains.
The recently developed Sharpness-Aware Minimization (SAM) method aims to achieve this goal by minimizing the sharpness measure of the loss landscape.
We present two conditions to ensure that the model could converge to a flat minimum with a small loss, and present an algorithm, named Sharpness-Aware Gradient Matching (SAGM)
Our proposed SAGM method consistently outperforms the state-of-the-art methods on five DG benchmarks.
arXiv Detail & Related papers (2023-03-18T07:25:12Z) - Byzantine-Robust Learning on Heterogeneous Data via Gradient Splitting [58.91947205027892]
Federated learning has exhibited vulnerabilities to Byzantine attacks.
Byzantine attackers can send arbitrary gradients to a central server to destroy the convergence and performance of the global model.
A wealth of robust AGgregation Rules (AGRs) have been proposed to defend against Byzantine attacks.
arXiv Detail & Related papers (2023-02-13T03:31:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.