Related papers: Mixed Sparsity Training: Achieving 4$\times$ FLOP Reduction for Transformer Pretraining

Mixed Sparsity Training: Achieving 4$\times$ FLOP Reduction for Transformer Pretraining

URL: http://arxiv.org/abs/2408.11746v1
Date: Wed, 21 Aug 2024 16:13:16 GMT
Title: Mixed Sparsity Training: Achieving 4$\times$ FLOP Reduction for Transformer Pretraining
Authors: Pihe Hu, Shaolong Li, Longbo Huang,
Abstract summary: Mixed Sparsity Training (MST) is an efficient pretraining method that can reduce about $75%$ of Floating Point Operations (FLOPs) while maintaining performance. Our experiment on GPT-2 showcases a FLOP reduction of $4times$ without compromising performance.
Score: 32.925150708409205
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large language models (LLMs) have made significant strides in complex tasks, yet their widespread adoption is impeded by substantial computational demands. With hundreds of billion parameters, transformer-based LLMs necessitate months of pretraining across a high-end GPU cluster. However, this paper reveals a compelling finding: transformers exhibit considerable redundancy in pretraining computations, which motivates our proposed solution, Mixed Sparsity Training (MST), an efficient pretraining method that can reduce about $75\%$ of Floating Point Operations (FLOPs) while maintaining performance. MST integrates dynamic sparse training (DST) with Sparsity Variation (SV) and Hybrid Sparse Attention (HSA) during pretraining, involving three distinct phases: warm-up, ultra-sparsification, and restoration. The warm-up phase transforms the dense model into a sparse one, and the restoration phase reinstates connections. Throughout these phases, the model is trained with a dynamically evolving sparse topology and an HSA mechanism to maintain performance and minimize training FLOPs concurrently. Our experiment on GPT-2 showcases a FLOP reduction of $4\times$ without compromising performance.

Related papers

Continual Pre-training of MoEs: How robust is your router? [25.438359533860954]
MoEs benefit from improved sample efficiency at training time and achieve much stronger performance. Many closed-source and open-source frontier language models have adopted an MoE architecture. We show that MoE LLMs maintain their sample efficiency (relative to a FLOP-matched dense model) during CPT and that they can match the performance of a fully re-trained MoE at a fraction of the cost.
arXiv Detail & Related papers (2025-03-06T22:55:01Z)
Exploring the Benefit of Activation Sparsity in Pre-training [117.25661020250658]
We study how activation properties change during pre-training. We propose Switchable Sparse-Dense Learning (SSD) SSD achieves comparable performance with identical model size and reduces pre-training costs.
arXiv Detail & Related papers (2024-10-04T13:53:33Z)
Pruning Large Language Models with Semi-Structural Adaptive Sparse Training [17.381160429641316]
Adaptive Sparse Trainer (AST) is a novel and efficient retraining framework tailored for semi-structured sparse models. AST reduces the perplexity and zero-shot accuracy gap between dense and 2:4 semi-structured sparse models to 0.6 and 1.16%, respectively.
arXiv Detail & Related papers (2024-07-30T06:33:44Z)
SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios. In the early route, intermediate outputs are consolidated via an anti-redundancy operation. In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z)
SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced Token Detection [49.43407207482008]
SpacTor is a new training procedure consisting of a hybrid objective combining span corruption (SC) and token replacement detection (RTD) In our experiments with encoder-decoder architectures (T5) on a variety of NLP tasks, SpacTor-T5 yields the same downstream performance as standard SC pre-training.
arXiv Detail & Related papers (2024-01-24T00:36:13Z)
Dynamic Layer Tying for Parameter-Efficient Transformers [65.268245109828]
We employ Reinforcement Learning to select layers during training and tie them together. This facilitates weight sharing, reduces the number of trainable parameters, and also serves as an effective regularization technique. In particular, the memory consumption during training is up to one order of magnitude less than the conventional training method.
arXiv Detail & Related papers (2024-01-23T14:53:20Z)
Preparing Lessons for Progressive Training on Language Models [75.88952808979087]
The rapid progress of Transformers in artificial intelligence has come at the cost of increased resource consumption and greenhouse gas emissions. We propose Apollo, which preptextbfares lessons for extextbfpanding textbfoperations by textbflayer functitextbfonality during training of low layers. Experiments demonstrate that Apollo achieves state-of-the-art acceleration ratios, even rivaling methods using pretrained models.
arXiv Detail & Related papers (2024-01-17T13:04:14Z)
PERP: Rethinking the Prune-Retrain Paradigm in the Era of LLMs [22.557682089926004]
We show that updating a small subset of parameters can suffice to recover or even enhance performance after pruning. We introduce two novel LoRA variants that, unlike standard LoRA, allow merging adapters back without compromising sparsity.
arXiv Detail & Related papers (2023-12-23T11:45:22Z)
Efficient GPT Model Pre-training using Tensor Train Matrix Representation [65.96485282393361]
Large-scale transformer models feature billions of parameters, leading to difficulties in their deployment and prohibitive training costs from scratch. To reduce the number of parameters in the GPT-2 architecture, we replace the matrices of fully-connected layers with the corresponding Train Matrix(TTM) structure. The resulting GPT-based model stores up to 40% fewer parameters, showing the perplexity comparable to the original model.
arXiv Detail & Related papers (2023-06-05T08:38:25Z)
Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers [107.3726071306935]
We propose a new plug-and-play training framework, SMoE-Dropout, to enable scaling transformers to better accuracy in their full capacity without collapse. SMoE-Dropout consists of a randomly and fixed router network to activate experts and gradually increases the activated expert number as training progresses over time. Our experiments demonstrate the superior performance and substantial computation savings of SMoE-Dropout, compared to dense training baselines with equivalent parameter counts.
arXiv Detail & Related papers (2023-03-02T22:12:51Z)
COST-EFF: Collaborative Optimization of Spatial and Temporal Efficiency with Slenderized Multi-exit Language Models [16.586312156966635]
Transformer-based pre-trained language models (PLMs) mostly suffer from excessive overhead despite their advanced capacity. Existing statically compressed models are unaware of the diverse complexities between input instances. We propose a collaborative optimization for PLMs that integrates static model compression and dynamic inference acceleration.
arXiv Detail & Related papers (2022-10-27T15:06:40Z)
Online Convolutional Re-parameterization [51.97831675242173]
We present online convolutional re- parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution. Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2x. We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks.
arXiv Detail & Related papers (2022-04-02T09:50:19Z)
A Fast and Efficient Conditional Learning for Tunable Trade-Off between Accuracy and Robustness [11.35810118757863]
Existing models that achieve state-of-the-art (SOTA) performance on both clean and adversarially-perturbed images rely on convolution operations conditioned with feature-wise linear modulation (FiLM) layers. We present a fast learnable once-for-all adversarial training (FLOAT) algorithm, which instead of the existing FiLM-based conditioning, presents a unique weight conditioned learning that requires no additional layer. In particular, we add scaled noise to the weight tensors that enables a trade-off between clean and adversarial performance.
arXiv Detail & Related papers (2022-03-28T19:25:36Z)
Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping [24.547833264405355]
The proposed method achieves a 24% time reduction on average per sample and allows the pre-training to be 2.5 times faster than the baseline. While being faster, our pre-trained models are equipped with strong knowledge transferability, achieving comparable and sometimes higher GLUE score than the baseline.
arXiv Detail & Related papers (2020-10-26T06:50:07Z)
Understanding the Difficulty of Training Transformers [120.99980924577787]
We show that unbalanced gradients are not the root cause of the instability of training. We propose Admin to stabilize the early stage's training and unleash its full potential in the late stage.
arXiv Detail & Related papers (2020-04-17T13:59:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.