To Pretrain or Not to Pretrain: Examining the Benefits of Pretraining on
Resource Rich Tasks
- URL: http://arxiv.org/abs/2006.08671v1
- Date: Mon, 15 Jun 2020 18:18:59 GMT
- Title: To Pretrain or Not to Pretrain: Examining the Benefits of Pretraining on
Resource Rich Tasks
- Authors: Sinong Wang, Madian Khabsa, Hao Ma
- Abstract summary: We show that as the number of training examples grow into the millions, the accuracy gap between finetuning BERT-based model and training vanilla LSTM from scratch narrows to within 1%.
Our findings indicate that pre-trained models might reach a diminishing return point as the supervised data size increases significantly.
- Score: 25.05882459314221
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pretraining NLP models with variants of Masked Language Model (MLM)
objectives has recently led to a significant improvements on many tasks. This
paper examines the benefits of pretrained models as a function of the number of
training samples used in the downstream task. On several text classification
tasks, we show that as the number of training examples grow into the millions,
the accuracy gap between finetuning BERT-based model and training vanilla LSTM
from scratch narrows to within 1%. Our findings indicate that MLM-based models
might reach a diminishing return point as the supervised data size increases
significantly.
Related papers
- Low-rank finetuning for LLMs: A fairness perspective [54.13240282850982]
Low-rank approximation techniques have become the de facto standard for fine-tuning Large Language Models.
This paper investigates the effectiveness of these methods in capturing the shift of fine-tuning datasets from the initial pre-trained data distribution.
We show that low-rank fine-tuning inadvertently preserves undesirable biases and toxic behaviors.
arXiv Detail & Related papers (2024-05-28T20:43:53Z) - Self-Influence Guided Data Reweighting for Language Model Pre-training [46.57714637505164]
Language Models (LMs) pre-trained with self-supervision on large text corpora have become the default starting point for developing models for various NLP tasks.
All data samples in the corpus are treated with equal importance during LM pre-training.
Due to varying levels of relevance and quality of data, equal importance to all the data samples may not be the optimal choice.
We propose PRESENCE, a method for jointly reweighting samples by leveraging self-influence (SI) scores as an indicator of sample importance and pre-training.
arXiv Detail & Related papers (2023-11-02T01:00:46Z) - How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression? [92.90857135952231]
Transformers pretrained on diverse tasks exhibit remarkable in-context learning (ICL) capabilities.
We study ICL in one of its simplest setups: pretraining a linearly parameterized single-layer linear attention model for linear regression.
arXiv Detail & Related papers (2023-10-12T15:01:43Z) - Learning to Modulate pre-trained Models in RL [22.812215561012874]
Fine-tuning a pre-trained model often suffers from catastrophic forgetting.
Our study shows that with most fine-tuning approaches, the performance on pre-training tasks deteriorates significantly.
We propose a novel method, Learning-to-Modulate (L2M), that avoids the degradation of learned skills by modulating the information flow of the frozen pre-trained model.
arXiv Detail & Related papers (2023-06-26T17:53:05Z) - The effectiveness of MAE pre-pretraining for billion-scale pretraining [65.98338857597935]
We introduce an additional pre-pretraining stage that is simple and uses the self-supervised MAE technique to initialize the model.
We measure the effectiveness of pre-pretraining on 10 different visual recognition tasks spanning image classification, video recognition, object detection, low-shot classification and zero-shot recognition.
arXiv Detail & Related papers (2023-03-23T17:56:12Z) - SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language
Models [4.114555639014612]
We show the benefits of using unstructured weight sparsity to train only a subset of weights during pre-training.
We demonstrate that we can induce up to 75% sparsity into a 1.3B parameter GPT-3 XL model resulting in a 2.5x reduction in pre-training FLOPs.
arXiv Detail & Related papers (2023-03-18T17:56:01Z) - Tuning Language Models as Training Data Generators for
Augmentation-Enhanced Few-Shot Learning [30.65315081964461]
We study few-shot learning with pretrained language models (PLMs) from a different perspective.
We first tune an autoregressive PLM on the few-shot samples and then use it as a generator to synthesize a large amount of novel training samples.
Our approach FewGen achieves an overall better result across seven classification tasks of the GLUE benchmark than existing few-shot learning methods.
arXiv Detail & Related papers (2022-11-06T06:46:47Z) - METRO: Efficient Denoising Pretraining of Large Scale Autoencoding
Language Models with Model Generated Signals [151.3601429216877]
We present an efficient method of pretraining large-scale autoencoding language models using training signals generated by an auxiliary model.
We propose a recipe, namely "Model generated dEnoising TRaining Objective" (METRO)
The resultant models, METRO-LM, consisting of up to 5.4 billion parameters, achieve new state-of-the-art on the GLUE, SuperGLUE, and SQuAD benchmarks.
arXiv Detail & Related papers (2022-04-13T21:39:15Z) - SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark
for Semantic and Generative Capabilities [76.97949110580703]
We introduce SUPERB-SG, a new benchmark to evaluate pre-trained models across various speech tasks.
We use a lightweight methodology to test the robustness of representations learned by pre-trained models under shifts in data domain.
We also show that the task diversity of SUPERB-SG coupled with limited task supervision is an effective recipe for evaluating the generalizability of model representation.
arXiv Detail & Related papers (2022-03-14T04:26:40Z) - Masked Language Modeling and the Distributional Hypothesis: Order Word
Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines.
In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics.
Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z) - Reinforced Curriculum Learning on Pre-trained Neural Machine Translation
Models [20.976165305749777]
We learn a curriculum for improving a pre-trained NMT model by re-selecting influential data samples from the original training set.
We propose a data selection framework based on Deterministic Actor-Critic, in which a critic network predicts the expected change of model performance.
arXiv Detail & Related papers (2020-04-13T03:40:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.