GradES: Significantly Faster Training in Transformers with Gradient-Based Early Stopping
- URL: http://arxiv.org/abs/2509.01842v3
- Date: Thu, 16 Oct 2025 18:38:51 GMT
- Title: GradES: Significantly Faster Training in Transformers with Gradient-Based Early Stopping
- Authors: Qifu Wen, Xi Zeng, Zihan Zhou, Shuaijun Liu, Mehdi Hosseinzadeh, Ningxin Su, Reza Rawassizadeh,
- Abstract summary: Early stopping monitors global validation loss and halts all parameter updates simultaneously.<n>We propose textitGradES, a novel gradient-based early stopping approach that operates within transformer components.<n>textitGradES speeds up training time by 1.57--7.22$times$ while simultaneously enhancing generalization through early prevention of overfitting.
- Score: 9.8335797454886
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Early stopping monitors global validation loss and halts all parameter updates simultaneously, which is computationally costly for large transformers due to the extended time required for validation inference. We propose \textit{GradES}, a novel gradient-based early stopping approach that operates within transformer components (attention projections and Feed-Forward layer matrices). We found that different components converge at varying rates during fine-tuning for both language and vision-language models. \textit{GradES} tracks the magnitude of gradient changes in backpropagation for these matrices during training. When a projection matrix's magnitude of gradient changes fall below a convergence threshold $\tau$, we exclude that projection matrix from further updates individually, eliminating costly validation passes while allowing slow converging matrices to continue learning. \textit{GradES} speeds up training time by 1.57--7.22$\times$ while simultaneously enhancing generalization through early prevention of overfitting, resulting in 1.2\% higher average accuracy in language tasks and 3.88\% on multimodal benchmarks.
Related papers
- PreLoRA: Hybrid Pre-training of Vision Transformers with Full Training and Low-Rank Adapters [2.5547655072779]
We propose an approach to identify states of partial convergence and switch from full parameter training to Low-Rank Adaptation (LoRA) on the ViT-Large model.<n> Experimental results show that this approach preserves model accuracy while reducing the number of trainable parameters to 10% of its original size.
arXiv Detail & Related papers (2025-09-25T21:34:17Z) - PLUMAGE: Probabilistic Low rank Unbiased Min Variance Gradient Estimator for Efficient Large Model Training [21.695928776150808]
Accelerator memory and networking constraints have emerged as dominant bottlenecks when training large language models.<n>We propose PLUMAGE: Probabilistic Low rank Unbiased Minimum vAriance Gradient Estor.<n>We empirically demonstrate that PLUMAGE shrinks the full rank optimization's gap over the pre training evaluation loss by 33% on average across models and the average training loss across the GLUE benchmark by 28% within a similar computational and memory footprint as GaloRE.
arXiv Detail & Related papers (2025-05-23T19:17:55Z) - Mechanistic Insights into Grokking from the Embedding Layer [15.676058752772287]
Grokking, a delayed generalization in neural networks, has been observed in Transformers and stagnates, but the components driving it remain underexplored.<n>We show that embeddings are central to grokking: introducing them intos induces delayed generalization in modular arithmetic tasks.<n>Our methods not only improve grokking dynamics but also extend to broader challenges in Transformer optimization, where bilinear interactions hinder efficient training.
arXiv Detail & Related papers (2025-05-21T15:12:34Z) - Beyond Gradient Averaging in Parallel Optimization: Improved Robustness through Gradient Agreement Filtering [36.896695278624776]
Traditional distributed data-parallel gradient descent involves averaging gradients of microbatches to calculate a macrobatch that is then used to update model parameters.<n>We introduce a simple, computationally effective way to reduce gradient variance by computing the cosine distance between micro-gradients during training.<n>We show this technique consistently outperforms validation accuracy in some cases by up to 18.2% compared to traditional training approaches.
arXiv Detail & Related papers (2024-12-24T00:00:11Z) - Accelerating Transformer Pre-training with 2:4 Sparsity [19.64391647966267]
NVIDIA Ampere GPUs can execute a fine-grained 2:4 sparse matrix multiplication twice as fast as its dense equivalent.
We propose three techniques to preserve accuracy: to modify the sparse-refined straight-through estimator, to determine a feasible decay factor in warm-up stage, and to enhance the model's quality.
Our algorithm achieves similar convergence to dense training algorithms on several transformer pre-training tasks, while actual acceleration can be observed on different shapes of transformer block apparently.
arXiv Detail & Related papers (2024-04-02T11:12:42Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - GradViT: Gradient Inversion of Vision Transformers [83.54779732309653]
We demonstrate the vulnerability of vision transformers (ViTs) to gradient-based inversion attacks.
We introduce a method, named GradViT, that optimize random noise into naturally looking images.
We observe unprecedentedly high fidelity and closeness to the original (hidden) data.
arXiv Detail & Related papers (2022-03-22T17:06:07Z) - Tom: Leveraging trend of the observed gradients for faster convergence [0.0]
Tom is a novel variant of Adam that takes into account the trend observed for the gradients in the landscape in the loss traversed by the neural network.
Tom outperforms Adagrad, Adadelta, RMSProp and Adam in terms of both accuracy and has a faster convergence.
arXiv Detail & Related papers (2021-09-07T20:19:40Z) - Adapting Stepsizes by Momentumized Gradients Improves Optimization and
Generalization [89.66571637204012]
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
arXiv Detail & Related papers (2021-06-22T03:13:23Z) - Large Scale Private Learning via Low-rank Reparametrization [77.38947817228656]
We propose a reparametrization scheme to address the challenges of applying differentially private SGD on large neural networks.
We are the first able to apply differential privacy on the BERT model and achieve an average accuracy of $83.9%$ on four downstream tasks.
arXiv Detail & Related papers (2021-06-17T10:14:43Z) - Correcting Momentum in Temporal Difference Learning [95.62766731469671]
We argue that momentum in Temporal Difference (TD) learning accumulates gradients that become doubly stale.
We show that this phenomenon exists, and then propose a first-order correction term to momentum.
An important insight of this work is that deep RL methods are not always best served by directly importing techniques from the supervised setting.
arXiv Detail & Related papers (2021-06-07T20:41:15Z) - Dynamically Adjusting Transformer Batch Size by Monitoring Gradient
Direction Change [69.40942736249397]
We analyze how increasing batch size affects gradient direction.
We propose to evaluate the stability of gradients with their angle change.
Our approach dynamically determines proper and efficient batch sizes during training.
arXiv Detail & Related papers (2020-05-05T08:47:34Z) - On Layer Normalization in the Transformer Architecture [112.40350994368741]
We first study theoretically why the learning rate warm-up stage is essential and show that the location of layer normalization matters.
We show in experiments that Pre-LN Transformers without the warm-up stage can reach comparable results with baselines.
arXiv Detail & Related papers (2020-02-12T00:33:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.