Related papers: Random-LTD: Random and Layerwise Token Dropping Brings Efficient Training for Large-scale Transformers

Random-LTD: Random and Layerwise Token Dropping Brings Efficient Training for Large-scale Transformers

URL: http://arxiv.org/abs/2211.11586v1
Date: Thu, 17 Nov 2022 23:14:58 GMT
Title: Random-LTD: Random and Layerwise Token Dropping Brings Efficient Training for Large-scale Transformers
Authors: Zhewei Yao, Xiaoxia Wu, Conglong Li, Connor Holmes, Minjia Zhang, Cheng Li, Yuxiong He
Abstract summary: We propose a novel random and layerwise token dropping method (random-LTD) for transformer models. random-LTD achieves considerable speedups and comparable accuracy as the standard training baseline. Our results show that random-LTD can save about 33.3% theoretical compute cost and 25.6% wall-clock training time.
Score: 31.021091635737776
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large-scale transformer models have become the de-facto architectures for various machine learning applications, e.g., CV and NLP. However, those large models also introduce prohibitive training costs. To mitigate this issue, we propose a novel random and layerwise token dropping method (random-LTD), which skips the computation of a subset of the input tokens at all middle layers. Particularly, random-LTD achieves considerable speedups and comparable accuracy as the standard training baseline. Compared to other token dropping methods, random-LTD does not require (1) any importance score-based metrics, (2) any special token treatment (e.g., [CLS]), and (3) many layers in full sequence length training except the first and the last layers. Besides, a new LayerToken learning rate schedule is proposed for pretraining problems that resolve the heavy tuning requirement for our proposed training mechanism. Finally, we demonstrate that random-LTD can be applied to broader applications, including GPT and BERT pretraining as well as ViT and GPT finetuning tasks. Our results show that random-LTD can save about 33.3% theoretical compute cost and 25.6% wall-clock training time while achieving similar zero-shot evaluations on GPT-31.3B as compared to baseline.

Related papers

Basis-Oriented Low-rank Transfer for Few-Shot and Test-Time Adaptation [10.804106052326402]
Adapting large pre-trained models to unseen tasks under tight data and compute budgets remains challenging.<n>We propose BOLT, a framework that reuses existing fine-tuned models and adapts within that subspace.<n>Our results show that constraining adaptation to a task-informed subspace provides an effective alternative for unseen-task transfer.
arXiv Detail & Related papers (2025-12-02T06:00:16Z)
TwIST: Rigging the Lottery in Transformers with Independent Subnetwork Training [6.7228358095570995]
TwIST is a distributed training framework for efficient large language model sparsification.<n>It trains multipleworks in parallel, periodically aggregates their parameters, and resamples newworks during training.<n>It identifies high-qualityworks ("golden tickets") without requiring post-training procedures such as calibration or Hessian-based recovery.
arXiv Detail & Related papers (2025-11-06T02:13:24Z)
Sprint: Sparse-Dense Residual Fusion for Efficient Diffusion Transformers [91.02299679350834]
Diffusion Transformers (DiTs) deliver state-of-the-art generative performance but their quadratic training cost with sequence length makes large-scale pretraining prohibitively expensive.<n>We present Sparse--Dense Residual Fusion for Efficient Diffusion Transformers, a simple method that enables aggressive token dropping (up to 75%) while preserving quality.
arXiv Detail & Related papers (2025-10-24T19:29:55Z)
Thinking Augmented Pre-training [88.04395622064708]
Thinking augmented Pre-Training is a universal methodology that augments text with automatically generated thinking trajectories.<n>This paper introduces a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories.
arXiv Detail & Related papers (2025-09-24T14:45:13Z)
FINE: Factorizing Knowledge for Initialization of Variable-sized Diffusion Models [35.40065954148091]
FINE is a method based on the Learngene framework to initializing downstream networks leveraging pre-trained models. It decomposes pre-trained knowledge into the product of matrices (i.e., $U$, $Sigma$, and $V$), where $U$ and $V$ are shared across network blocks as learngenes'' It consistently outperforms direct pre-training, particularly for smaller models, achieving state-of-the-art results across variable model sizes.
arXiv Detail & Related papers (2024-09-28T08:57:17Z)
PYRA: Parallel Yielding Re-Activation for Training-Inference Efficient Task Adaptation [61.57833648734164]
We propose a novel Parallel Yielding Re-Activation (PYRA) method for training-inference efficient task adaptation. PYRA outperforms all competing methods under both low compression rate and high compression rate.
arXiv Detail & Related papers (2024-03-14T09:06:49Z)
Dynamic Layer Tying for Parameter-Efficient Transformers [65.268245109828]
We employ Reinforcement Learning to select layers during training and tie them together. This facilitates weight sharing, reduces the number of trainable parameters, and also serves as an effective regularization technique. In particular, the memory consumption during training is up to one order of magnitude less than the conventional training method.
arXiv Detail & Related papers (2024-01-23T14:53:20Z)
Making Pre-trained Language Models both Task-solvers and Self-calibrators [52.98858650625623]
Pre-trained language models (PLMs) serve as backbones for various real-world systems. Previous work shows that introducing an extra calibration task can mitigate this issue. We propose a training algorithm LM-TOAST to tackle the challenges.
arXiv Detail & Related papers (2023-07-21T02:51:41Z)
Approximated Prompt Tuning for Vision-Language Pre-trained Models [54.326232586461614]
In vision-language pre-trained models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks. We propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning.
arXiv Detail & Related papers (2023-06-27T05:43:47Z)
Peeling the Onion: Hierarchical Reduction of Data Redundancy for Efficient Vision Transformer Training [110.79400526706081]
Vision transformers (ViTs) have recently obtained success in many applications, but their intensive computation and heavy memory usage limit their generalization. Previous compression algorithms usually start from the pre-trained dense models and only focus on efficient inference. This paper proposes an end-to-end efficient training framework from three sparse perspectives, dubbed Tri-Level E-ViT.
arXiv Detail & Related papers (2022-11-19T21:15:47Z)
Tuning Language Models as Training Data Generators for Augmentation-Enhanced Few-Shot Learning [30.65315081964461]
We study few-shot learning with pretrained language models (PLMs) from a different perspective. We first tune an autoregressive PLM on the few-shot samples and then use it as a generator to synthesize a large amount of novel training samples. Our approach FewGen achieves an overall better result across seven classification tasks of the GLUE benchmark than existing few-shot learning methods.
arXiv Detail & Related papers (2022-11-06T06:46:47Z)
Efficient NLP Model Finetuning via Multistage Data Filtering [11.058786955754004]
We set to filter training examples in a streaming fashion, in tandem with training the target model. Our key techniques are (1) automatically determine a training loss threshold for skipping backward training passes; (2) run a meta predictor for further skipping forward training passes. Our method reduces the required training examples by up to 5.3$times$ and training time by up to 6.8$times$, while only seeing minor accuracy degradation.
arXiv Detail & Related papers (2022-07-28T21:43:31Z)
FracTrain: Fractionally Squeezing Bit Savings Both Temporally and Spatially for Efficient DNN Training [81.85361544720885]
We propose FracTrain that integrates progressive fractional quantization which gradually increases the precision of activations, weights, and gradients. FracTrain reduces computational cost and hardware-quantified energy/latency of DNN training while achieving a comparable or better (-0.12%+1.87%) accuracy.
arXiv Detail & Related papers (2020-12-24T05:24:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.