Learning Which Features Matter: RoBERTa Acquires a Preference for
Linguistic Generalizations (Eventually)
- URL: http://arxiv.org/abs/2010.05358v1
- Date: Sun, 11 Oct 2020 22:09:27 GMT
- Title: Learning Which Features Matter: RoBERTa Acquires a Preference for
Linguistic Generalizations (Eventually)
- Authors: Alex Warstadt, Yian Zhang, Haau-Sing Li, Haokun Liu, Samuel R. Bowman
- Abstract summary: We introduce a new English-language diagnostic set called MSGS (the Mixed Signals Generalization Set)
MSGS consists of 20 ambiguous binary classification tasks that we use to test whether a pretrained model prefers linguistic or surface generalizations during fine-tuning.
We pretrain RoBERTa models from scratch on quantities of data ranging from 1M to 1B words and compare their performance on MSGS to the publicly available RoBERTa-base.
We find that models can learn to represent linguistic features with little pretraining data, but require far more data to learn to prefer linguistic generalizations over surface ones.
- Score: 25.696099563130517
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: One reason pretraining on self-supervised linguistic tasks is effective is
that it teaches models features that are helpful for language understanding.
However, we want pretrained models to learn not only to represent linguistic
features, but also to use those features preferentially during fine-turning.
With this goal in mind, we introduce a new English-language diagnostic set
called MSGS (the Mixed Signals Generalization Set), which consists of 20
ambiguous binary classification tasks that we use to test whether a pretrained
model prefers linguistic or surface generalizations during fine-tuning. We
pretrain RoBERTa models from scratch on quantities of data ranging from 1M to
1B words and compare their performance on MSGS to the publicly available
RoBERTa-base. We find that models can learn to represent linguistic features
with little pretraining data, but require far more data to learn to prefer
linguistic generalizations over surface ones. Eventually, with about 30B words
of pretraining data, RoBERTa-base does demonstrate a linguistic bias with some
regularity. We conclude that while self-supervised pretraining is an effective
way to learn helpful inductive biases, there is likely room to improve the rate
at which models learn which features matter.
Related papers
- How to Plant Trees in Language Models: Data and Architectural Effects on
the Emergence of Syntactic Inductive Biases [28.58785395946639]
We show that pre-training can teach language models to rely on hierarchical syntactic features when performing tasks after fine-tuning.
We focus on architectural features (depth, width, and number of parameters), as well as the genre and size of the pre-training corpus.
arXiv Detail & Related papers (2023-05-31T14:38:14Z) - Pre-Training to Learn in Context [138.0745138788142]
The ability of in-context learning is not fully exploited because language models are not explicitly trained to learn in context.
We propose PICL (Pre-training for In-Context Learning), a framework to enhance the language models' in-context learning ability.
Our experiments show that PICL is more effective and task-generalizable than a range of baselines, outperforming larger language models with nearly 4x parameters.
arXiv Detail & Related papers (2023-05-16T03:38:06Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - Bidirectional Language Models Are Also Few-shot Learners [54.37445173284831]
We present SAP (Sequential Autoregressive Prompting), a technique that enables the prompting of bidirectional models.
We show SAP is effective on question answering and summarization.
For the first time, our results demonstrate prompt-based learning is an emergent property of a broader class of language models.
arXiv Detail & Related papers (2022-09-29T01:35:57Z) - Pre-training Data Quality and Quantity for a Low-Resource Language: New
Corpus and BERT Models for Maltese [4.4681678689625715]
We analyse the effect of pre-training with monolingual data for a low-resource language.
We present a newly created corpus for Maltese, and determine the effect that the pre-training data size and domain have on the downstream performance.
We compare two models on the new corpus: a monolingual BERT model trained from scratch (BERTu), and a further pre-trained multilingual BERT (mBERTu)
arXiv Detail & Related papers (2022-05-21T06:44:59Z) - Interpreting Language Models Through Knowledge Graph Extraction [42.97929497661778]
We compare BERT-based language models through snapshots of acquired knowledge at sequential stages of the training process.
We present a methodology to unveil a knowledge acquisition timeline by generating knowledge graph extracts from cloze "fill-in-the-blank" statements.
We extend this analysis to a comparison of pretrained variations of BERT models (DistilBERT, BERT-base, RoBERTa)
arXiv Detail & Related papers (2021-11-16T15:18:01Z) - Pre-Training a Language Model Without Human Language [74.11825654535895]
We study how the intrinsic nature of pre-training data contributes to the fine-tuned downstream performance.
We find that models pre-trained on unstructured data beat those trained directly from scratch on downstream tasks.
To our great astonishment, we uncover that pre-training on certain non-human language data gives GLUE performance close to performance pre-trained on another non-English language.
arXiv Detail & Related papers (2020-12-22T13:38:06Z) - When Do You Need Billions of Words of Pretraining Data? [23.80748200206869]
Transformer LMs learn from large-scale pretraining that they cannot learn from less data.
We find that LMs require only about 10M or 100M words to learn representations that reliably encode most syntactic and semantic features.
arXiv Detail & Related papers (2020-11-10T07:16:18Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Pretrained Language Model Embryology: The Birth of ALBERT [68.5801642674541]
We investigate the developmental process from a set of randomly parameters to a totipotent language model.
Our results show that ALBERT learns to reconstruct and predict tokens of different parts of speech (POS) in different learning speeds during pretraining.
These findings suggest that knowledge of a pretrained model varies during pretraining, and having more pretrain steps does not necessarily provide a model with more comprehensive knowledge.
arXiv Detail & Related papers (2020-10-06T05:15:39Z) - Multilingual Jointly Trained Acoustic and Written Word Embeddings [22.63696520064212]
We extend this idea to multiple low-resource languages.
We jointly train an AWE model and an AGWE model, using phonetically transcribed data from multiple languages.
The pre-trained models can then be used for unseen zero-resource languages, or fine-tuned on data from low-resource languages.
arXiv Detail & Related papers (2020-06-24T19:16:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.