Data-Efficient Pretraining via Contrastive Self-Supervision
- URL: http://arxiv.org/abs/2010.01061v4
- Date: Thu, 15 Apr 2021 15:16:34 GMT
- Title: Data-Efficient Pretraining via Contrastive Self-Supervision
- Authors: Nils Rethmeier and Isabelle Augenstein
- Abstract summary: In this work, we evaluate against three core challenges for resource efficient learning.
We propose a data and compute efficient self-supervised, contrastive text encoder, pretrained on 60MB of task-internal' text data.
We find our method outperforms RoBERTa, while pretraining and fine-tuning in a 1/5th of RoBERTa's fine-tuning time.
- Score: 48.255310614527694
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For natural language processing `text-to-text' tasks, the prevailing
approaches heavily rely on pretraining large self-supervised models on
increasingly larger `task-external' data. Transfer learning from high-resource
pretraining works well, but research has focused on settings with very large
data and compute requirements, while the potential of efficient low-resource
learning, without large `task-external' pretraining, remains under-explored. In
this work, we evaluate against three core challenges for resource efficient
learning. Namely, we analyze: (1) pretraining data ($X$) efficiency; (2) zero
to few-shot label ($Y$) efficiency; and (3) long-tail generalization, since
long-tail preservation has been linked to algorithmic fairness and because data
in the tail is limited by definition. To address these challenges, we propose a
data and compute efficient self-supervised, contrastive text encoder,
pretrained on 60MB of `task-internal' text data, and compare it to RoBERTa,
which was pretrained on 160GB of `task-external' text. We find our method
outperforms RoBERTa, while pretraining and fine-tuning in a 1/5th of RoBERTa's
fine-tuning time.
Related papers
- Bucket Pre-training is All You Need [9.332544709626875]
Large language models (LLMs) have demonstrated exceptional performance across various natural language processing tasks.
The conventional fixed-length data composition strategy for pretraining, which involves concatenating and splitting documents, can introduce noise and limit the model's ability to capture long-range dependencies.
We propose a multi-bucket data composition method that moves beyond the fixed-length paradigm, offering a more flexible and efficient approach to pretraining.
arXiv Detail & Related papers (2024-07-10T09:27:23Z) - Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - Efficient Data Learning for Open Information Extraction with Pre-trained Language Models [15.554865537872919]
Open Information Extraction (OpenIE) is a fundamental yet challenging task in Natural Language Processing.
In this paper, we introduce a novel framework, OK-IE, that ingeniously transforms the task form of OpenIE into the pre-training task form of the T5 model.
Furthermore, we introduce an innovative concept of Anchor to control the sequence of model outputs, effectively eliminating the impact of order penalty on model convergence.
arXiv Detail & Related papers (2023-10-23T15:19:24Z) - Effective Long-Context Scaling of Foundation Models [90.57254298730923]
We present a series of long-context LLMs that support effective context windows of up to 32,768 tokens.
Our models achieve consistent improvements on most regular tasks and significant improvements on long-context tasks over Llama 2.
arXiv Detail & Related papers (2023-09-27T21:41:49Z) - Curriculum-Based Self-Training Makes Better Few-Shot Learners for
Data-to-Text Generation [56.98033565736974]
We propose Curriculum-Based Self-Training (CBST) to leverage unlabeled data in a rearranged order determined by the difficulty of text generation.
Our method can outperform fine-tuning and task-adaptive pre-training methods, and achieve state-of-the-art performance in the few-shot setting of data-to-text generation.
arXiv Detail & Related papers (2022-06-06T16:11:58Z) - Knowledge Distillation as Efficient Pre-training: Faster Convergence,
Higher Data-efficiency, and Better Transferability [53.27240222619834]
Knowledge Distillation as Efficient Pre-training aims to efficiently transfer the learned feature representation from pre-trained models to new student models for future downstream tasks.
Our method performs comparably with supervised pre-training counterparts in 3 downstream tasks and 9 downstream datasets requiring 10x less data and 5x less pre-training time.
arXiv Detail & Related papers (2022-03-10T06:23:41Z) - Self-training Improves Pre-training for Natural Language Understanding [63.78927366363178]
We study self-training as another way to leverage unlabeled data through semi-supervised learning.
We introduce SentAugment, a data augmentation method which computes task-specific query embeddings from labeled data.
Our approach leads to scalable and effective self-training with improvements of up to 2.6% on standard text classification benchmarks.
arXiv Detail & Related papers (2020-10-05T17:52:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.