Progressive Gradient Flow for Robust N:M Sparsity Training in
Transformers
- URL: http://arxiv.org/abs/2402.04744v1
- Date: Wed, 7 Feb 2024 10:55:59 GMT
- Title: Progressive Gradient Flow for Robust N:M Sparsity Training in
Transformers
- Authors: Abhimanyu Rajeshkumar Bambhaniya, Amir Yazdanbakhsh, Suvinay
Subramanian, Sheng-Chun Kao, Shivani Agrawal, Utku Evci, Tushar Krishna
- Abstract summary: N:M structured sparsity has garnered significant interest as a result of relatively modest overhead and improved efficiency.
There have been efforts to develop training recipes for N:M structured sparsity, they primarily focus on low-sparsity regions.
However, performance of models trained using these approaches tends to decline when confronted with high-sparsity regions.
- Score: 15.27677493050638
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: N:M Structured sparsity has garnered significant interest as a result of
relatively modest overhead and improved efficiency. Additionally, this form of
sparsity holds considerable appeal for reducing the memory footprint owing to
their modest representation overhead. There have been efforts to develop
training recipes for N:M structured sparsity, they primarily focus on
low-sparsity regions ($\sim$50\%). Nonetheless, performance of models trained
using these approaches tends to decline when confronted with high-sparsity
regions ($>$80\%). In this work, we study the effectiveness of existing sparse
training recipes at \textit{high-sparsity regions} and argue that these methods
fail to sustain the model quality on par with low-sparsity regions. We
demonstrate that the significant factor contributing to this disparity is the
presence of elevated levels of induced noise in the gradient magnitudes. To
mitigate this undesirable effect, we employ decay mechanisms to progressively
restrict the flow of gradients towards pruned elements. Our approach improves
the model quality by up to 2$\%$ and 5$\%$ in vision and language models at
high sparsity regime, respectively. We also evaluate the trade-off between
model accuracy and training compute cost in terms of FLOPs. At iso-training
FLOPs, our method yields better performance compared to conventional sparse
training recipes, exhibiting an accuracy improvement of up to 2$\%$. The source
code is available at
https://github.com/abhibambhaniya/progressive_gradient_flow_nm_sparsity.
Related papers
- PUMA: margin-based data pruning [51.12154122266251]
We focus on data pruning, where some training samples are removed based on the distance to the model classification boundary (i.e., margin)
We propose PUMA, a new data pruning strategy that computes the margin using DeepFool.
We show that PUMA can be used on top of the current state-of-the-art methodology in robustness, and it is able to significantly improve the model performance unlike the existing data pruning strategies.
arXiv Detail & Related papers (2024-05-10T08:02:20Z) - Attention-Driven Training-Free Efficiency Enhancement of Diffusion Models [29.863953001061635]
Diffusion Models (DMs) have exhibited superior performance in generating high-quality and diverse images.
Existing works mainly adopt a retraining process to enhance DM efficiency.
We introduce the Attention-driven Training-free Efficient Diffusion Model (AT-EDM) framework that leverages attention maps to perform run-time pruning of redundant tokens.
arXiv Detail & Related papers (2024-05-08T17:56:47Z) - Preparing Lessons for Progressive Training on Language Models [75.88952808979087]
The rapid progress of Transformers in artificial intelligence has come at the cost of increased resource consumption and greenhouse gas emissions.
We propose Apollo, which preptextbfares lessons for extextbfpanding textbfoperations by textbflayer functitextbfonality during training of low layers.
Experiments demonstrate that Apollo achieves state-of-the-art acceleration ratios, even rivaling methods using pretrained models.
arXiv Detail & Related papers (2024-01-17T13:04:14Z) - Bridging the Gap: Addressing Discrepancies in Diffusion Model Training
for Classifier-Free Guidance [1.6804613362826175]
Diffusion models have emerged as a pivotal advancement in generative models.
In this paper we aim to underscore a discrepancy between conventional training methods and the desired conditional sampling behavior.
We introduce an updated loss function that better aligns training objectives with sampling behaviors.
arXiv Detail & Related papers (2023-11-02T02:03:12Z) - Gradient-based Intra-attention Pruning on Pre-trained Language Models [21.444503777215637]
We propose a structured pruning method GRAIN (Gradient-based Intra-attention pruning)
GRAIN inspects and prunes intra-attention structures, which greatly expands the structure search space and enables more flexible models.
Experiments on GLUE, SQuAD, and CoNLL 2003 show that GRAIN notably outperforms other methods, especially in the high sparsity regime.
arXiv Detail & Related papers (2022-12-15T06:52:31Z) - Training Recipe for N:M Structured Sparsity with Decaying Pruning Mask [8.02992650002693]
We study and evaluate various training recipes for N:M sparsity in terms of the trade-off between model accuracy and compute cost.
We propose two new decay-based pruning methods, namely "pruning mask decay" and "sparse structure decay"
Our evaluations indicate that these proposed methods consistently deliver state-of-the-art (SOTA) model accuracy, comparable to unstructured sparsity.
arXiv Detail & Related papers (2022-09-15T21:30:55Z) - Powerpropagation: A sparsity inducing weight reparameterisation [65.85142037667065]
We introduce Powerpropagation, a new weight- parameterisation for neural networks that leads to inherently sparse models.
Models trained in this manner exhibit similar performance, but have a distribution with markedly higher density at zero, allowing more parameters to be pruned safely.
Here, we combine Powerpropagation with a traditional weight-pruning technique as well as recent state-of-the-art sparse-to-sparse algorithms, showing superior performance on the ImageNet benchmark.
arXiv Detail & Related papers (2021-10-01T10:03:57Z) - Towards Practical Lipreading with Distilled and Efficient Models [57.41253104365274]
Lipreading has witnessed a lot of progress due to the resurgence of neural networks.
Recent works have placed emphasis on aspects such as improving performance by finding the optimal architecture or improving generalization.
There is still a significant gap between the current methodologies and the requirements for an effective deployment of lipreading in practical scenarios.
We propose a series of innovations that significantly bridge that gap: first, we raise the state-of-the-art performance by a wide margin on LRW and LRW-1000 to 88.5% and 46.6%, respectively using self-distillation.
arXiv Detail & Related papers (2020-07-13T16:56:27Z) - Learning Expectation of Label Distribution for Facial Age and
Attractiveness Estimation [65.5880700862751]
We analyze the essential relationship between two state-of-the-art methods (Ranking-CNN and DLDL) and show that the Ranking method is in fact learning label distribution implicitly.
We propose a lightweight network architecture and propose a unified framework which can jointly learn facial attribute distribution and regress attribute value.
Our method achieves new state-of-the-art results using the single model with 36$times$ fewer parameters and 3$times$ faster inference speed on facial age/attractiveness estimation.
arXiv Detail & Related papers (2020-07-03T15:46:53Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.