Coloring the Blank Slate: Pre-training Imparts a Hierarchical Inductive
Bias to Sequence-to-sequence Models
- URL: http://arxiv.org/abs/2203.09397v1
- Date: Thu, 17 Mar 2022 15:46:53 GMT
- Title: Coloring the Blank Slate: Pre-training Imparts a Hierarchical Inductive
Bias to Sequence-to-sequence Models
- Authors: Aaron Mueller, Robert Frank, Tal Linzen, Luheng Wang, Sebastian
Schuster
- Abstract summary: Sequence-to-sequence (seq2seq) models often fail to generalize in a hierarchy-sensitive manner when performing syntactic transformations.
We find that pre-trained seq2seq models generalize hierarchically when performing syntactic transformations, whereas models trained from scratch on syntactic transformations do not.
- Score: 23.21767225871304
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Relations between words are governed by hierarchical structure rather than
linear ordering. Sequence-to-sequence (seq2seq) models, despite their success
in downstream NLP applications, often fail to generalize in a
hierarchy-sensitive manner when performing syntactic transformations - for
example, transforming declarative sentences into questions. However, syntactic
evaluations of seq2seq models have only observed models that were not
pre-trained on natural language data before being trained to perform syntactic
transformations, in spite of the fact that pre-training has been found to
induce hierarchical linguistic generalizations in language models; in other
words, the syntactic capabilities of seq2seq models may have been greatly
understated. We address this gap using the pre-trained seq2seq models T5 and
BART, as well as their multilingual variants mT5 and mBART. We evaluate whether
they generalize hierarchically on two transformations in two languages:
question formation and passivization in English and German. We find that
pre-trained seq2seq models generalize hierarchically when performing syntactic
transformations, whereas models trained from scratch on syntactic
transformations do not. This result presents evidence for the learnability of
hierarchical syntactic information from non-annotated natural language text
while also demonstrating that seq2seq models are capable of syntactic
generalization, though only after exposure to much more language data than
human learners receive.
Related papers
- Learning Syntax Without Planting Trees: Understanding When and Why Transformers Generalize Hierarchically [74.96551626420188]
Transformers trained on natural language data have been shown to learn its hierarchical structure and generalize to sentences with unseen syntactic structures.
We investigate sources of inductive bias in transformer models and their training that could cause such generalization behavior to emerge.
arXiv Detail & Related papers (2024-04-25T07:10:29Z) - How to Plant Trees in Language Models: Data and Architectural Effects on
the Emergence of Syntactic Inductive Biases [28.58785395946639]
We show that pre-training can teach language models to rely on hierarchical syntactic features when performing tasks after fine-tuning.
We focus on architectural features (depth, width, and number of parameters), as well as the genre and size of the pre-training corpus.
arXiv Detail & Related papers (2023-05-31T14:38:14Z) - Hierarchical Phrase-based Sequence-to-Sequence Learning [94.10257313923478]
We describe a neural transducer that maintains the flexibility of standard sequence-to-sequence (seq2seq) models while incorporating hierarchical phrases as a source of inductive bias during training and as explicit constraints during inference.
Our approach trains two models: a discriminative derivation based on a bracketing grammar whose tree hierarchically aligns source and target phrases, and a neural seq2seq model that learns to translate the aligned phrases one-by-one.
arXiv Detail & Related papers (2022-11-15T05:22:40Z) - Structural generalization is hard for sequence-to-sequence models [85.0087839979613]
Sequence-to-sequence (seq2seq) models have been successful across many NLP tasks.
Recent work on compositional generalization has shown that seq2seq models achieve very low accuracy in generalizing to linguistic structures that were not seen in training.
arXiv Detail & Related papers (2022-10-24T09:03:03Z) - Compositional Generalization Requires Compositional Parsers [69.77216620997305]
We compare sequence-to-sequence models and models guided by compositional principles on the recent COGS corpus.
We show structural generalization is a key measure of compositional generalization and requires models that are aware of complex structure.
arXiv Detail & Related papers (2022-02-24T07:36:35Z) - Transformers Generalize Linearly [1.7709450506466664]
We examine patterns of structural generalization for Transformer sequence-to-sequence models.
We find that not only do Transformers fail to generalize hierarchically across a wide variety of grammatical mapping tasks, but they exhibit an even stronger preference for linear generalization than comparable networks.
arXiv Detail & Related papers (2021-09-24T15:48:46Z) - Structured Reordering for Modeling Latent Alignments in Sequence
Transduction [86.94309120789396]
We present an efficient dynamic programming algorithm performing exact marginal inference of separable permutations.
The resulting seq2seq model exhibits better systematic generalization than standard models on synthetic problems and NLP tasks.
arXiv Detail & Related papers (2021-06-06T21:53:54Z) - Structural Supervision Improves Few-Shot Learning and Syntactic
Generalization in Neural Language Models [47.42249565529833]
Humans can learn structural properties about a word from minimal experience.
We assess the ability of modern neural language models to reproduce this behavior in English.
arXiv Detail & Related papers (2020-10-12T14:12:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.