Related papers: Data-Efficient Pretraining via Contrastive Self-Supervision

Data-Efficient Pretraining via Contrastive Self-Supervision

URL: http://arxiv.org/abs/2010.01061v4
Date: Thu, 15 Apr 2021 15:16:34 GMT
Title: Data-Efficient Pretraining via Contrastive Self-Supervision
Authors: Nils Rethmeier and Isabelle Augenstein
Abstract summary: In this work, we evaluate against three core challenges for resource efficient learning. We propose a data and compute efficient self-supervised, contrastive text encoder, pretrained on 60MB of task-internal' text data. We find our method outperforms RoBERTa, while pretraining and fine-tuning in a 1/5th of RoBERTa's fine-tuning time.
Score: 48.255310614527694
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: For natural language processing `text-to-text' tasks, the prevailing approaches heavily rely on pretraining large self-supervised models on increasingly larger `task-external' data. Transfer learning from high-resource pretraining works well, but research has focused on settings with very large data and compute requirements, while the potential of efficient low-resource learning, without large `task-external' pretraining, remains under-explored. In this work, we evaluate against three core challenges for resource efficient learning. Namely, we analyze: (1) pretraining data ($X$) efficiency; (2) zero to few-shot label ($Y$) efficiency; and (3) long-tail generalization, since long-tail preservation has been linked to algorithmic fairness and because data in the tail is limited by definition. To address these challenges, we propose a data and compute efficient self-supervised, contrastive text encoder, pretrained on 60MB of `task-internal' text data, and compare it to RoBERTa, which was pretrained on 160GB of `task-external' text. We find our method outperforms RoBERTa, while pretraining and fine-tuning in a 1/5th of RoBERTa's fine-tuning time.

Related papers

Reasoning to Learn from Latent Thoughts [45.59740535714148]
We show that explicitly modeling and inferring the latent thoughts that underlie the text generation process can significantly improve pretraining data efficiency. We show that a 1B LM can bootstrap its performance across at least three iterations and significantly outperform baselines trained on raw data. The gains from inference scaling and EM iterations suggest new opportunities for scaling data-constrained pretraining.
arXiv Detail & Related papers (2025-03-24T16:41:23Z)
Bucket Pre-training is All You Need [9.332544709626875]
Large language models (LLMs) have demonstrated exceptional performance across various natural language processing tasks. The conventional fixed-length data composition strategy for pretraining, which involves concatenating and splitting documents, can introduce noise and limit the model's ability to capture long-range dependencies. We propose a multi-bucket data composition method that moves beyond the fixed-length paradigm, offering a more flexible and efficient approach to pretraining.
arXiv Detail & Related papers (2024-07-10T09:27:23Z)
Efficient Grammatical Error Correction Via Multi-Task Training and Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences. We formulate each task as a sequence-to-sequence problem and perform multi-task training. We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z)
Efficient Data Learning for Open Information Extraction with Pre-trained Language Models [15.554865537872919]
Open Information Extraction (OpenIE) is a fundamental yet challenging task in Natural Language Processing. In this paper, we introduce a novel framework, OK-IE, that ingeniously transforms the task form of OpenIE into the pre-training task form of the T5 model. Furthermore, we introduce an innovative concept of Anchor to control the sequence of model outputs, effectively eliminating the impact of order penalty on model convergence.
arXiv Detail & Related papers (2023-10-23T15:19:24Z)
Effective Long-Context Scaling of Foundation Models [90.57254298730923]
We present a series of long-context LLMs that support effective context windows of up to 32,768 tokens. Our models achieve consistent improvements on most regular tasks and significant improvements on long-context tasks over Llama 2.
arXiv Detail & Related papers (2023-09-27T21:41:49Z)
Curriculum-Based Self-Training Makes Better Few-Shot Learners for Data-to-Text Generation [56.98033565736974]
We propose Curriculum-Based Self-Training (CBST) to leverage unlabeled data in a rearranged order determined by the difficulty of text generation. Our method can outperform fine-tuning and task-adaptive pre-training methods, and achieve state-of-the-art performance in the few-shot setting of data-to-text generation.
arXiv Detail & Related papers (2022-06-06T16:11:58Z)
Knowledge Distillation as Efficient Pre-training: Faster Convergence, Higher Data-efficiency, and Better Transferability [53.27240222619834]
Knowledge Distillation as Efficient Pre-training aims to efficiently transfer the learned feature representation from pre-trained models to new student models for future downstream tasks. Our method performs comparably with supervised pre-training counterparts in 3 downstream tasks and 9 downstream datasets requiring 10x less data and 5x less pre-training time.
arXiv Detail & Related papers (2022-03-10T06:23:41Z)
Self-training Improves Pre-training for Natural Language Understanding [63.78927366363178]
We study self-training as another way to leverage unlabeled data through semi-supervised learning. We introduce SentAugment, a data augmentation method which computes task-specific query embeddings from labeled data. Our approach leads to scalable and effective self-training with improvements of up to 2.6% on standard text classification benchmarks.
arXiv Detail & Related papers (2020-10-05T17:52:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.