How to Plant Trees in Language Models: Data and Architectural Effects on
the Emergence of Syntactic Inductive Biases
- URL: http://arxiv.org/abs/2305.19905v1
- Date: Wed, 31 May 2023 14:38:14 GMT
- Title: How to Plant Trees in Language Models: Data and Architectural Effects on
the Emergence of Syntactic Inductive Biases
- Authors: Aaron Mueller, Tal Linzen
- Abstract summary: We show that pre-training can teach language models to rely on hierarchical syntactic features when performing tasks after fine-tuning.
We focus on architectural features (depth, width, and number of parameters), as well as the genre and size of the pre-training corpus.
- Score: 28.58785395946639
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accurate syntactic representations are essential for robust generalization in
natural language. Recent work has found that pre-training can teach language
models to rely on hierarchical syntactic features - as opposed to incorrect
linear features - when performing tasks after fine-tuning. We test what aspects
of pre-training are important for endowing encoder-decoder Transformers with an
inductive bias that favors hierarchical syntactic generalizations. We focus on
architectural features (depth, width, and number of parameters), as well as the
genre and size of the pre-training corpus, diagnosing inductive biases using
two syntactic transformation tasks: question formation and passivization, both
in English. We find that the number of parameters alone does not explain
hierarchical generalization: model depth plays greater role than model width.
We also find that pre-training on simpler language, such as child-directed
speech, induces a hierarchical bias using an order-of-magnitude less data than
pre-training on more typical datasets based on web text or Wikipedia; this
suggests that in cognitively plausible language acquisition settings, neural
language models may be more data-efficient than previously thought.
Related papers
- Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data [76.90128359866462]
Large language models (LLMs) have sparked debate over whether they genuinely generalize to unseen tasks or rely on memorizing vast amounts of pretraining data.
We introduce an extended concept of memorization, distributional memorization, which measures the correlation between the LLM output probabilities and the pretraining data frequency.
This study demonstrates that memorization plays a larger role in simpler, knowledge-intensive tasks, while generalization is the key for harder, reasoning-based tasks.
arXiv Detail & Related papers (2024-07-20T21:24:40Z) - Towards a theory of how the structure of language is acquired by deep neural networks [6.363756171493383]
We use a tree-like generative model that captures many of the hierarchical structures found in natural languages.
We show that token-token correlations can be used to build a representation of the grammar's hidden variables.
We conjecture that the relationship between training set size and effective range of correlations holds beyond our synthetic datasets.
arXiv Detail & Related papers (2024-05-28T17:01:22Z) - Learning Syntax Without Planting Trees: Understanding When and Why Transformers Generalize Hierarchically [74.96551626420188]
Transformers trained on natural language data have been shown to learn its hierarchical structure and generalize to sentences with unseen syntactic structures.
We investigate sources of inductive bias in transformer models and their training that could cause such generalization behavior to emerge.
arXiv Detail & Related papers (2024-04-25T07:10:29Z) - Syntax and Semantics Meet in the "Middle": Probing the Syntax-Semantics
Interface of LMs Through Agentivity [68.8204255655161]
We present the semantic notion of agentivity as a case study for probing such interactions.
This suggests LMs may potentially serve as more useful tools for linguistic annotation, theory testing, and discovery.
arXiv Detail & Related papers (2023-05-29T16:24:01Z) - How poor is the stimulus? Evaluating hierarchical generalization in
neural networks trained on child-directed speech [25.02822854434971]
We train LSTMs and Transformers on data similar in quantity and content to children's linguistic input: text from the CHILDES corpus.
We find that both model types generalize in a way more consistent with an incorrect linear rule than the correct hierarchical rule.
These results suggest that human-like generalization from text alone requires stronger biases than the general sequence-processing biases of standard neural network architectures.
arXiv Detail & Related papers (2023-01-26T23:24:17Z) - Is neural language acquisition similar to natural? A chronological
probing study [0.0515648410037406]
We present the chronological probing study of transformer English models such as MultiBERT and T5.
We compare the information about the language learned by the models in the process of training on corpora.
The results show that 1) linguistic information is acquired in the early stages of training 2) both language models demonstrate capabilities to capture various features from various levels of language.
arXiv Detail & Related papers (2022-07-01T17:24:11Z) - Unnatural Language Inference [48.45003475966808]
We find that state-of-the-art NLI models, such as RoBERTa and BART, are invariant to, and sometimes even perform better on, examples with randomly reordered words.
Our findings call into question the idea that our natural language understanding models, and the tasks used for measuring their progress, genuinely require a human-like understanding of syntax.
arXiv Detail & Related papers (2020-12-30T20:40:48Z) - Infusing Finetuning with Semantic Dependencies [62.37697048781823]
We show that, unlike syntax, semantics is not brought to the surface by today's pretrained models.
We then use convolutional graph encoders to explicitly incorporate semantic parses into task-specific finetuning.
arXiv Detail & Related papers (2020-12-10T01:27:24Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Learning Which Features Matter: RoBERTa Acquires a Preference for
Linguistic Generalizations (Eventually) [25.696099563130517]
We introduce a new English-language diagnostic set called MSGS (the Mixed Signals Generalization Set)
MSGS consists of 20 ambiguous binary classification tasks that we use to test whether a pretrained model prefers linguistic or surface generalizations during fine-tuning.
We pretrain RoBERTa models from scratch on quantities of data ranging from 1M to 1B words and compare their performance on MSGS to the publicly available RoBERTa-base.
We find that models can learn to represent linguistic features with little pretraining data, but require far more data to learn to prefer linguistic generalizations over surface ones.
arXiv Detail & Related papers (2020-10-11T22:09:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.