SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced
Token Detection
- URL: http://arxiv.org/abs/2401.13160v1
- Date: Wed, 24 Jan 2024 00:36:13 GMT
- Title: SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced
Token Detection
- Authors: Ke Ye, Heinrich Jiang, Afshin Rostamizadeh, Ayan Chakrabarti, Giulia
DeSalvo, Jean-Fran\c{c}ois Kagy, Lazaros Karydas, Gui Citovsky, Sanjiv Kumar
- Abstract summary: SpacTor is a new training procedure consisting of a hybrid objective combining span corruption (SC) and token replacement detection (RTD)
In our experiments with encoder-decoder architectures (T5) on a variety of NLP tasks, SpacTor-T5 yields the same downstream performance as standard SC pre-training.
- Score: 49.43407207482008
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pre-training large language models is known to be extremely resource
intensive and often times inefficient, under-utilizing the information
encapsulated in the training text sequences. In this paper, we present SpacTor,
a new training procedure consisting of (1) a hybrid objective combining span
corruption (SC) and token replacement detection (RTD), and (2) a two-stage
curriculum that optimizes the hybrid objective over the initial $\tau$
iterations, then transitions to standard SC loss. We show empirically that the
effectiveness of the hybrid objective is tied to the two-stage pre-training
schedule, and provide extensive analysis on why this is the case. In our
experiments with encoder-decoder architectures (T5) on a variety of NLP tasks,
SpacTor-T5 yields the same downstream performance as standard SC pre-training,
while enabling a 50% reduction in pre-training iterations and 40% reduction in
total FLOPs. Alternatively, given the same amount of computing budget, we find
that SpacTor results in significantly improved downstream benchmark
performance.
Related papers
- Sparse Spectral Training and Inference on Euclidean and Hyperbolic Neural Networks [9.96381061452642]
We propose Sparse Spectral Training (SST), an advanced training methodology that updates all singular values and selectively updates singular vectors of network weights.
SST refines the training process by employing a targeted updating strategy for singular vectors, which is determined by a multinomial sampling method weighted by the significance of the singular values.
On OPT-125M, with rank equating to 8.3% of embedding dimension, SST reduces the perplexity gap to full-rank training by 67.6%, demonstrating a significant reduction of the performance loss with prevalent low-rank methods.
arXiv Detail & Related papers (2024-05-24T11:59:41Z) - Efficient Stagewise Pretraining via Progressive Subnetworks [55.65819977062729]
We propose an alternative framework, progressive subnetwork training, that maintains the full model throughout training, but only trainsworks within the model in each step.
RaPTr achieves better pre-training loss for BERT and UL2 language models while requiring 20-33% fewer FLOPs compared to standard training, and is competitive or better than other efficient training methods.
arXiv Detail & Related papers (2024-02-08T18:49:09Z) - Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - FracTrain: Fractionally Squeezing Bit Savings Both Temporally and
Spatially for Efficient DNN Training [81.85361544720885]
We propose FracTrain that integrates progressive fractional quantization which gradually increases the precision of activations, weights, and gradients.
FracTrain reduces computational cost and hardware-quantified energy/latency of DNN training while achieving a comparable or better (-0.12%+1.87%) accuracy.
arXiv Detail & Related papers (2020-12-24T05:24:10Z) - Dynamic R-CNN: Towards High Quality Object Detection via Dynamic
Training [70.2914594796002]
We propose Dynamic R-CNN to adjust the label assignment criteria and the shape of regression loss function.
Our method improves upon ResNet-50-FPN baseline with 1.9% AP and 5.5% AP$_90$ on the MS dataset with no extra overhead.
arXiv Detail & Related papers (2020-04-13T15:20:25Z) - Pruning Filters while Training for Efficiently Optimizing Deep Learning
Networks [6.269700080380206]
Pruning techniques have been proposed that remove less significant weights in deep networks.
We propose a dynamic pruning-while-training procedure, wherein we prune filters of a deep network during training itself.
Results indicate that pruning while training yields a compressed network with almost no accuracy loss after pruning 50% of the filters.
arXiv Detail & Related papers (2020-03-05T18:05:17Z) - Gradual Channel Pruning while Training using Feature Relevance Scores
for Convolutional Neural Networks [6.534515590778012]
Pruning is one of the predominant approaches used for deep network compression.
We present a simple-yet-effective gradual channel pruning while training methodology using a novel data-driven metric.
We demonstrate the effectiveness of the proposed methodology on architectures such as VGG and ResNet.
arXiv Detail & Related papers (2020-02-23T17:56:18Z) - Fast is better than free: Revisiting adversarial training [86.11788847990783]
We show that it is possible to train empirically robust models using a much weaker and cheaper adversary.
We identify a failure mode referred to as "catastrophic overfitting" which may have caused previous attempts to use FGSM adversarial training to fail.
arXiv Detail & Related papers (2020-01-12T20:30:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.