ZClip: Adaptive Spike Mitigation for LLM Pre-Training
- URL: http://arxiv.org/abs/2504.02507v1
- Date: Thu, 03 Apr 2025 11:41:55 GMT
- Title: ZClip: Adaptive Spike Mitigation for LLM Pre-Training
- Authors: Abhay Kumar, Louis Owen, Nilabhra Roy Chowdhury, Fabian Güra,
- Abstract summary: Training large language models (LLMs) presents numerous challenges, including gradient instability and loss spikes.<n>Traditional gradient clipping techniques, such as constant or norm-based methods, fail to address these issues effectively.<n>We propose ZClip, an adaptive gradient clipping algorithm that dynamically adjusts the clipping threshold based on statistical properties of gradient norms over time.
- Score: 0.3574867616159909
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Training large language models (LLMs) presents numerous challenges, including gradient instability and loss spikes. These phenomena can lead to catastrophic divergence, requiring costly checkpoint restoration and data batch skipping. Traditional gradient clipping techniques, such as constant or norm-based methods, fail to address these issues effectively due to their reliance on fixed thresholds or heuristics, leading to inefficient learning and requiring frequent manual intervention. In this work, we propose ZClip, an adaptive gradient clipping algorithm that dynamically adjusts the clipping threshold based on statistical properties of gradient norms over time. Unlike prior reactive strategies, ZClip proactively adapts to training dynamics without making any prior assumptions on the scale and the temporal evolution of gradient norms. At its core, it leverages z-score-based anomaly detection to identify and mitigate large gradient spikes, preventing malignant loss spikes while not interfering with convergence otherwise. Our code is available at: https://github.com/bluorion-com/ZClip.
Related papers
- Mjolnir: Breaking the Shield of Perturbation-Protected Gradients via Adaptive Diffusion [13.764770382623812]
We present the first attempt to break the shield of gradient perturbation protection in Federated Learning.<n>We introduce Mjolnir, a perturbation-resilient gradient leakage attack.<n>Mjolnir is capable of removing perturbations from gradients without requiring additional access to the original model structure or external data.
arXiv Detail & Related papers (2024-07-07T07:06:49Z) - To Clip or not to Clip: the Dynamics of SGD with Gradient Clipping in High-Dimensions [6.653325043862049]
We study gradient clipping in a least squares problem under streaming SGD.
We show that with Gaussian noise clipping cannot improve SGD performance.
We propose a simple for near optimal scheduling of the clipping threshold.
arXiv Detail & Related papers (2024-06-17T16:50:22Z) - Careful with that Scalpel: Improving Gradient Surgery with an EMA [30.8976309525556]
We show how one can improve performance by blending the gradients beyond a simple sum.
We demonstrate that our method, Bloop, can lead to much better performances on NLP and vision experiments.
arXiv Detail & Related papers (2024-02-05T13:37:00Z) - One-Step Forward and Backtrack: Overcoming Zig-Zagging in Loss-Aware
Quantization Training [12.400950982075948]
Weight quantization is an effective technique to compress deep neural networks for their deployment on edge devices with limited resources.
Traditional loss-aware quantization methods commonly use the quantized gradient to replace the full-precision gradient.
This paper proposes a one-step forward and backtrack way for loss-aware quantization to get more accurate and stable gradient direction.
arXiv Detail & Related papers (2024-01-30T05:42:54Z) - Sparse is Enough in Fine-tuning Pre-trained Large Language Models [98.46493578509039]
We propose a gradient-based sparse fine-tuning algorithm, named Sparse Increment Fine-Tuning (SIFT)
We validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning.
arXiv Detail & Related papers (2023-12-19T06:06:30Z) - DPSUR: Accelerating Differentially Private Stochastic Gradient Descent
Using Selective Update and Release [29.765896801370612]
This paper proposes Differentially Private training framework based on Selective Updates and Release.
The main challenges lie in two aspects -- privacy concerns, and gradient selection strategy for model update.
Experiments conducted on MNIST, FMNIST, CIFAR-10, and IMDB datasets show that DPSUR significantly outperforms previous works in terms of convergence speed.
arXiv Detail & Related papers (2023-11-23T15:19:30Z) - Point Cloud Denoising via Momentum Ascent in Gradient Fields [72.93429911044903]
gradient-based method was proposed to estimate the gradient fields from the noisy point clouds using neural networks.
We develop a momentum gradient ascent method that leverages the information of previous iterations in determining the trajectories of the points.
Experiments demonstrate that the proposed method outperforms state-of-the-art approaches with a variety of point clouds, noise types, and noise levels.
arXiv Detail & Related papers (2022-02-21T10:21:40Z) - SDGMNet: Statistic-based Dynamic Gradient Modulation for Local
Descriptor Learning [44.69439245287881]
We propose a dynamic gradient modulation, named SDGMNet, to improve triplet loss for local descriptor learning.
In this paper, we perform deep analysis on back propagation of general triplet-based loss and introduce included angle for distance measure.
Our novel descriptor surpasses previous state-of-the-arts on standard benchmarks including patch verification, matching and retrieval tasks.
arXiv Detail & Related papers (2021-06-08T15:10:31Z) - Unbiased Risk Estimators Can Mislead: A Case Study of Learning with
Complementary Labels [92.98756432746482]
We study a weakly supervised problem called learning with complementary labels.
We show that the quality of gradient estimation matters more in risk minimization.
We propose a novel surrogate complementary loss(SCL) framework that trades zero bias with reduced variance.
arXiv Detail & Related papers (2020-07-05T04:19:37Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Regularizing Meta-Learning via Gradient Dropout [102.29924160341572]
meta-learning models are prone to overfitting when there are no sufficient training tasks for the meta-learners to generalize.
We introduce a simple yet effective method to alleviate the risk of overfitting for gradient-based meta-learning.
arXiv Detail & Related papers (2020-04-13T10:47:02Z) - The Break-Even Point on Optimization Trajectories of Deep Neural
Networks [64.7563588124004]
We argue for the existence of the "break-even" point on this trajectory.
We show that using a large learning rate in the initial phase of training reduces the variance of the gradient.
We also show that using a low learning rate results in bad conditioning of the loss surface even for a neural network with batch normalization layers.
arXiv Detail & Related papers (2020-02-21T22:55:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.