SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced
Token Detection
- URL: http://arxiv.org/abs/2401.13160v1
- Date: Wed, 24 Jan 2024 00:36:13 GMT
- Title: SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced
Token Detection
- Authors: Ke Ye, Heinrich Jiang, Afshin Rostamizadeh, Ayan Chakrabarti, Giulia
DeSalvo, Jean-Fran\c{c}ois Kagy, Lazaros Karydas, Gui Citovsky, Sanjiv Kumar
- Abstract summary: SpacTor is a new training procedure consisting of a hybrid objective combining span corruption (SC) and token replacement detection (RTD)
In our experiments with encoder-decoder architectures (T5) on a variety of NLP tasks, SpacTor-T5 yields the same downstream performance as standard SC pre-training.
- Score: 49.43407207482008
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pre-training large language models is known to be extremely resource
intensive and often times inefficient, under-utilizing the information
encapsulated in the training text sequences. In this paper, we present SpacTor,
a new training procedure consisting of (1) a hybrid objective combining span
corruption (SC) and token replacement detection (RTD), and (2) a two-stage
curriculum that optimizes the hybrid objective over the initial $\tau$
iterations, then transitions to standard SC loss. We show empirically that the
effectiveness of the hybrid objective is tied to the two-stage pre-training
schedule, and provide extensive analysis on why this is the case. In our
experiments with encoder-decoder architectures (T5) on a variety of NLP tasks,
SpacTor-T5 yields the same downstream performance as standard SC pre-training,
while enabling a 50% reduction in pre-training iterations and 40% reduction in
total FLOPs. Alternatively, given the same amount of computing budget, we find
that SpacTor results in significantly improved downstream benchmark
performance.
Related papers
- Accelerating Augmentation Invariance Pretraining [7.772780341646099]
We tackle the computational challenges of contrastive learning methods, particularly for the pretraining of Vision Transformers (ViTs)
We propose an acceleration framework, leveraging ViT's unique ability to generalize across inputs of varying sequence lengths.
Our method employs a mix of sequence compression strategies, including randomized token dropout and flexible patch scaling, to reduce the cost of gradient estimation and accelerate convergence.
arXiv Detail & Related papers (2024-10-27T21:53:33Z) - Mixed Sparsity Training: Achieving 4$\times$ FLOP Reduction for Transformer Pretraining [32.925150708409205]
Mixed Sparsity Training (MST) is an efficient pretraining method that can reduce about $75%$ of Floating Point Operations (FLOPs) while maintaining performance.
Our experiment on GPT-2 showcases a FLOP reduction of $4times$ without compromising performance.
arXiv Detail & Related papers (2024-08-21T16:13:16Z) - Efficient Stagewise Pretraining via Progressive Subnetworks [53.00045381931778]
The prevailing view suggests that stagewise dropping strategies, such as layer dropping, are ineffective when compared to stacking-based approaches.
This paper challenges this notion by demonstrating that, with proper design, dropping strategies can be competitive, if not better, than stacking methods.
We propose an instantiation of this framework - Random Part Training (RAPTR) - that selects and trains only a random subnetwork at each step, progressively increasing the size in stages.
arXiv Detail & Related papers (2024-02-08T18:49:09Z) - Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - FracTrain: Fractionally Squeezing Bit Savings Both Temporally and
Spatially for Efficient DNN Training [81.85361544720885]
We propose FracTrain that integrates progressive fractional quantization which gradually increases the precision of activations, weights, and gradients.
FracTrain reduces computational cost and hardware-quantified energy/latency of DNN training while achieving a comparable or better (-0.12%+1.87%) accuracy.
arXiv Detail & Related papers (2020-12-24T05:24:10Z) - Dynamic R-CNN: Towards High Quality Object Detection via Dynamic
Training [70.2914594796002]
We propose Dynamic R-CNN to adjust the label assignment criteria and the shape of regression loss function.
Our method improves upon ResNet-50-FPN baseline with 1.9% AP and 5.5% AP$_90$ on the MS dataset with no extra overhead.
arXiv Detail & Related papers (2020-04-13T15:20:25Z) - Gradual Channel Pruning while Training using Feature Relevance Scores
for Convolutional Neural Networks [6.534515590778012]
Pruning is one of the predominant approaches used for deep network compression.
We present a simple-yet-effective gradual channel pruning while training methodology using a novel data-driven metric.
We demonstrate the effectiveness of the proposed methodology on architectures such as VGG and ResNet.
arXiv Detail & Related papers (2020-02-23T17:56:18Z) - Fast is better than free: Revisiting adversarial training [86.11788847990783]
We show that it is possible to train empirically robust models using a much weaker and cheaper adversary.
We identify a failure mode referred to as "catastrophic overfitting" which may have caused previous attempts to use FGSM adversarial training to fail.
arXiv Detail & Related papers (2020-01-12T20:30:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.