Related papers: SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced Token Detection

SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced Token Detection

URL: http://arxiv.org/abs/2401.13160v1
Date: Wed, 24 Jan 2024 00:36:13 GMT
Title: SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced Token Detection
Authors: Ke Ye, Heinrich Jiang, Afshin Rostamizadeh, Ayan Chakrabarti, Giulia DeSalvo, Jean-Fran\c{c}ois Kagy, Lazaros Karydas, Gui Citovsky, Sanjiv Kumar
Abstract summary: SpacTor is a new training procedure consisting of a hybrid objective combining span corruption (SC) and token replacement detection (RTD) In our experiments with encoder-decoder architectures (T5) on a variety of NLP tasks, SpacTor-T5 yields the same downstream performance as standard SC pre-training.
Score: 49.43407207482008
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Pre-training large language models is known to be extremely resource intensive and often times inefficient, under-utilizing the information encapsulated in the training text sequences. In this paper, we present SpacTor, a new training procedure consisting of (1) a hybrid objective combining span corruption (SC) and token replacement detection (RTD), and (2) a two-stage curriculum that optimizes the hybrid objective over the initial $\tau$ iterations, then transitions to standard SC loss. We show empirically that the effectiveness of the hybrid objective is tied to the two-stage pre-training schedule, and provide extensive analysis on why this is the case. In our experiments with encoder-decoder architectures (T5) on a variety of NLP tasks, SpacTor-T5 yields the same downstream performance as standard SC pre-training, while enabling a 50% reduction in pre-training iterations and 40% reduction in total FLOPs. Alternatively, given the same amount of computing budget, we find that SpacTor results in significantly improved downstream benchmark performance.

Related papers

LEAD: Iterative Data Selection for Efficient LLM Instruction Tuning [22.242445543184264]
We propose LEAD, an efficient iterative data selection framework that accurately estimates sample utility entirely within the standard training loop.<n>Experiments show that LEAD significantly outperforms state-of-the-art methods, improving average model performance by 6.1%-10.8% while using only 2.5% of the training data and reducing overall training time by 5-10x.
arXiv Detail & Related papers (2025-05-12T10:57:51Z)
SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training [60.9776082805359]
Large Language Models (LLMs) have demonstrated exceptional performance across diverse tasks, yet their training remains highly resource-intensive and susceptible to training instability. This paper presents a comprehensive investigation into gradient spikes observed during LLM training, revealing their prevalence across multiple architectures and datasets. We propose Spike-Aware Adam with Momentum Reset, a novel designed to counteract gradient spikes through momentum reset and spike-aware clipping.
arXiv Detail & Related papers (2025-01-12T15:21:22Z)
Accelerating Augmentation Invariance Pretraining [7.772780341646099]
We tackle the computational challenges of contrastive learning methods, particularly for the pretraining of Vision Transformers (ViTs) We propose an acceleration framework, leveraging ViT's unique ability to generalize across inputs of varying sequence lengths. Our method employs a mix of sequence compression strategies, including randomized token dropout and flexible patch scaling, to reduce the cost of gradient estimation and accelerate convergence.
arXiv Detail & Related papers (2024-10-27T21:53:33Z)
Mixed Sparsity Training: Achieving 4$\times$ FLOP Reduction for Transformer Pretraining [32.925150708409205]
Mixed Sparsity Training (MST) is an efficient pretraining method that can reduce about $75%$ of Floating Point Operations (FLOPs) while maintaining performance. Our experiment on GPT-2 showcases a FLOP reduction of $4times$ without compromising performance.
arXiv Detail & Related papers (2024-08-21T16:13:16Z)
Efficient Stagewise Pretraining via Progressive Subnetworks [53.00045381931778]
The prevailing view suggests that stagewise dropping strategies, such as layer dropping, are ineffective when compared to stacking-based approaches. This paper challenges this notion by demonstrating that, with proper design, dropping strategies can be competitive, if not better, than stacking methods. We propose an instantiation of this framework - Random Part Training (RAPTR) - that selects and trains only a random subnetwork at each step, progressively increasing the size in stages.
arXiv Detail & Related papers (2024-02-08T18:49:09Z)
Efficient Grammatical Error Correction Via Multi-Task Training and Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences. We formulate each task as a sequence-to-sequence problem and perform multi-task training. We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z)
FracTrain: Fractionally Squeezing Bit Savings Both Temporally and Spatially for Efficient DNN Training [81.85361544720885]
We propose FracTrain that integrates progressive fractional quantization which gradually increases the precision of activations, weights, and gradients. FracTrain reduces computational cost and hardware-quantified energy/latency of DNN training while achieving a comparable or better (-0.12%+1.87%) accuracy.
arXiv Detail & Related papers (2020-12-24T05:24:10Z)
Dynamic R-CNN: Towards High Quality Object Detection via Dynamic Training [70.2914594796002]
We propose Dynamic R-CNN to adjust the label assignment criteria and the shape of regression loss function. Our method improves upon ResNet-50-FPN baseline with 1.9% AP and 5.5% AP$_90$ on the MS dataset with no extra overhead.
arXiv Detail & Related papers (2020-04-13T15:20:25Z)
Gradual Channel Pruning while Training using Feature Relevance Scores for Convolutional Neural Networks [6.534515590778012]
Pruning is one of the predominant approaches used for deep network compression. We present a simple-yet-effective gradual channel pruning while training methodology using a novel data-driven metric. We demonstrate the effectiveness of the proposed methodology on architectures such as VGG and ResNet.
arXiv Detail & Related papers (2020-02-23T17:56:18Z)
Fast is better than free: Revisiting adversarial training [86.11788847990783]
We show that it is possible to train empirically robust models using a much weaker and cheaper adversary. We identify a failure mode referred to as "catastrophic overfitting" which may have caused previous attempts to use FGSM adversarial training to fail.
arXiv Detail & Related papers (2020-01-12T20:30:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.