When Do You Need Billions of Words of Pretraining Data?
- URL: http://arxiv.org/abs/2011.04946v1
- Date: Tue, 10 Nov 2020 07:16:18 GMT
- Title: When Do You Need Billions of Words of Pretraining Data?
- Authors: Yian Zhang, Alex Warstadt, Haau-Sing Li, and Samuel R. Bowman
- Abstract summary: Transformer LMs learn from large-scale pretraining that they cannot learn from less data.
We find that LMs require only about 10M or 100M words to learn representations that reliably encode most syntactic and semantic features.
- Score: 23.80748200206869
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: NLP is currently dominated by general-purpose pretrained language models like
RoBERTa, which achieve strong performance on NLU tasks through pretraining on
billions of words. But what exact knowledge or skills do Transformer LMs learn
from large-scale pretraining that they cannot learn from less data? We adopt
four probing methods---classifier probing, information-theoretic probing,
unsupervised relative acceptability judgment, and fine-tuning on NLU
tasks---and draw learning curves that track the growth of these different
measures of linguistic ability with respect to pretraining data volume using
the MiniBERTas, a group of RoBERTa models pretrained on 1M, 10M, 100M and 1B
words. We find that LMs require only about 10M or 100M words to learn
representations that reliably encode most syntactic and semantic features we
test. A much larger quantity of data is needed in order to acquire enough
commonsense knowledge and other skills required to master typical downstream
NLU tasks. The results suggest that, while the ability to encode linguistic
features is almost certainly necessary for language understanding, it is likely
that other forms of knowledge are the major drivers of recent improvements in
language understanding among large pretrained models.
Related papers
- Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data [76.90128359866462]
We introduce an extended concept of memorization, distributional memorization, which measures the correlation between the output probabilities and the pretraining data frequency.
This study demonstrates that memorization plays a larger role in simpler, knowledge-intensive tasks, while generalization is the key for harder, reasoning-based tasks.
arXiv Detail & Related papers (2024-07-20T21:24:40Z) - EXnet: Efficient In-context Learning for Data-less Text classification [0.0]
We present EXnet, a model specifically designed to perform in-context learning without limitations on the number of examples.
We argue that in-context learning is an effective method to increase task accuracy, and providing examples facilitates cross-task generalization.
With extensive experiments, we show that even our smallest model (15M parameters) generalizes to several unseen classification tasks and domains.
arXiv Detail & Related papers (2023-05-24T01:40:57Z) - Pre-Training to Learn in Context [138.0745138788142]
The ability of in-context learning is not fully exploited because language models are not explicitly trained to learn in context.
We propose PICL (Pre-training for In-Context Learning), a framework to enhance the language models' in-context learning ability.
Our experiments show that PICL is more effective and task-generalizable than a range of baselines, outperforming larger language models with nearly 4x parameters.
arXiv Detail & Related papers (2023-05-16T03:38:06Z) - A Survey of Knowledge Enhanced Pre-trained Language Models [78.56931125512295]
We present a comprehensive review of Knowledge Enhanced Pre-trained Language Models (KE-PLMs)
For NLU, we divide the types of knowledge into four categories: linguistic knowledge, text knowledge, knowledge graph (KG) and rule knowledge.
The KE-PLMs for NLG are categorized into KG-based and retrieval-based methods.
arXiv Detail & Related papers (2022-11-11T04:29:02Z) - AfroLM: A Self-Active Learning-based Multilingual Pretrained Language
Model for 23 African Languages [0.021987601456703476]
We present AfroLM, a multilingual language model pretrained from scratch on 23 African languages.
AfroLM is pretrained on a dataset 14x smaller than existing baselines.
It is able to generalize well across various domains.
arXiv Detail & Related papers (2022-11-07T02:15:25Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - Collaboration of Pre-trained Models Makes Better Few-shot Learner [49.89134194181042]
Few-shot classification requires deep neural networks to learn generalized representations only from limited training images.
Recently, CLIP-based methods have shown promising few-shot performance benefited from the contrastive language-image pre-training.
We propose CoMo, a Collaboration of pre-trained Models that incorporates diverse prior knowledge from various pre-training paradigms for better few-shot learning.
arXiv Detail & Related papers (2022-09-25T16:23:12Z) - Knowledge Based Multilingual Language Model [44.70205282863062]
We present a novel framework to pretrain knowledge based multilingual language models (KMLMs)
We generate a large amount of code-switched synthetic sentences and reasoning-based multilingual training data using the Wikidata knowledge graphs.
Based on the intra- and inter-sentence structures of the generated data, we design pretraining tasks to facilitate knowledge learning.
arXiv Detail & Related papers (2021-11-22T02:56:04Z) - From Universal Language Model to Downstream Task: Improving
RoBERTa-Based Vietnamese Hate Speech Detection [8.602181445598776]
We propose a pipeline to adapt the general-purpose RoBERTa language model to a specific text classification task: Vietnamese Hate Speech Detection.
Our experiments proved that our proposed pipeline boosts the performance significantly, achieving a new state-of-the-art on Vietnamese Hate Speech Detection campaign with 0.7221 F1 score.
arXiv Detail & Related papers (2021-02-24T09:30:55Z) - Emergent Communication Pretraining for Few-Shot Machine Translation [66.48990742411033]
We pretrain neural networks via emergent communication from referential games.
Our key assumption is that grounding communication on images---as a crude approximation of real-world environments---inductively biases the model towards learning natural languages.
arXiv Detail & Related papers (2020-11-02T10:57:53Z) - Learning Which Features Matter: RoBERTa Acquires a Preference for
Linguistic Generalizations (Eventually) [25.696099563130517]
We introduce a new English-language diagnostic set called MSGS (the Mixed Signals Generalization Set)
MSGS consists of 20 ambiguous binary classification tasks that we use to test whether a pretrained model prefers linguistic or surface generalizations during fine-tuning.
We pretrain RoBERTa models from scratch on quantities of data ranging from 1M to 1B words and compare their performance on MSGS to the publicly available RoBERTa-base.
We find that models can learn to represent linguistic features with little pretraining data, but require far more data to learn to prefer linguistic generalizations over surface ones.
arXiv Detail & Related papers (2020-10-11T22:09:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.