Related papers: Learning Syntax Without Planting Trees: Understanding When and Why Transformers Generalize Hierarchically

Learning Syntax Without Planting Trees: Understanding When and Why Transformers Generalize Hierarchically

URL: http://arxiv.org/abs/2404.16367v2
Date: Fri, 31 May 2024 23:47:15 GMT
Title: Learning Syntax Without Planting Trees: Understanding When and Why Transformers Generalize Hierarchically
Authors: Kabir Ahuja, Vidhisha Balachandran, Madhur Panwar, Tianxing He, Noah A. Smith, Navin Goyal, Yulia Tsvetkov,
Abstract summary: Transformers trained on natural language data have been shown to learn its hierarchical structure and generalize to sentences with unseen syntactic structures. We investigate sources of inductive bias in transformer models and their training that could cause such generalization behavior to emerge.
Score: 74.96551626420188
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformers trained on natural language data have been shown to learn its hierarchical structure and generalize to sentences with unseen syntactic structures without explicitly encoding any structural bias. In this work, we investigate sources of inductive bias in transformer models and their training that could cause such generalization behavior to emerge. We extensively experiment with transformer models trained on multiple synthetic datasets and with different training objectives and show that while other objectives e.g. sequence-to-sequence modeling, prefix language modeling, often failed to lead to hierarchical generalization, models trained with the language modeling objective consistently learned to generalize hierarchically. We then conduct pruning experiments to study how transformers trained with the language modeling objective encode hierarchical structure. When pruned, we find joint existence of subnetworks within the model with different generalization behaviors (subnetworks corresponding to hierarchical structure and linear order). Finally, we take a Bayesian perspective to further uncover transformers' preference for hierarchical generalization: We establish a correlation between whether transformers generalize hierarchically on a dataset and whether the simplest explanation of that dataset is provided by a hierarchical grammar compared to regular grammars exhibiting linear generalization.

Related papers

Scaling Laws and Representation Learning in Simple Hierarchical Languages: Transformers vs. Convolutional Architectures [49.19753720526998]
We derive theoretical scaling laws for neural network performance on synthetic datasets.<n>We validate that convolutional networks, whose structure aligns with that of the generative process through locality and weight sharing, enjoy a faster scaling of performance.<n>This finding clarifies the architectural biases underlying neural scaling laws and highlights how representation learning is shaped by the interaction between model architecture and the statistical properties of data.
arXiv Detail & Related papers (2025-05-11T17:44:14Z)
How to Plant Trees in Language Models: Data and Architectural Effects on the Emergence of Syntactic Inductive Biases [28.58785395946639]
We show that pre-training can teach language models to rely on hierarchical syntactic features when performing tasks after fine-tuning. We focus on architectural features (depth, width, and number of parameters), as well as the genre and size of the pre-training corpus.
arXiv Detail & Related papers (2023-05-31T14:38:14Z)
Grokking of Hierarchical Structure in Vanilla Transformers [72.45375959893218]
We show that transformer language models can learn to generalize hierarchically after training for extremely long periods. intermediate-depth models generalize better than both very deep and very shallow transformers.
arXiv Detail & Related papers (2023-05-30T04:34:13Z)
Characterizing Intrinsic Compositionality in Transformers with Tree Projections [72.45375959893218]
neural models like transformers can route information arbitrarily between different parts of their input. We show that transformers for three different tasks become more treelike over the course of training. These trees are predictive of model behavior, with more tree-like models generalizing better on tests of compositional generalization.
arXiv Detail & Related papers (2022-11-02T17:10:07Z)
Structural Biases for Improving Transformers on Translation into Morphologically Rich Languages [120.74406230847904]
TP-Transformer augments the traditional Transformer architecture to include an additional component to represent structure. The second method imbues structure at the data level by segmenting the data with morphological tokenization. We find that each of these two approaches allows the network to achieve better performance, but this improvement is dependent on the size of the dataset.
arXiv Detail & Related papers (2022-08-11T22:42:24Z)
Forming Trees with Treeformers [3.8073142980733]
Many state-of-the-art neural networks models such as Transformers have no explicit hierarchical structure in its architecture. We introduce Treeformer, a general-purpose encoder module inspired by the CKY algorithm. Our experiments demonstrate the benefits of incorporating hierarchical structure into the Transformer.
arXiv Detail & Related papers (2022-07-14T14:39:30Z)
Compositional Generalization Requires Compositional Parsers [69.77216620997305]
We compare sequence-to-sequence models and models guided by compositional principles on the recent COGS corpus. We show structural generalization is a key measure of compositional generalization and requires models that are aware of complex structure.
arXiv Detail & Related papers (2022-02-24T07:36:35Z)
Transformers Generalize Linearly [1.7709450506466664]
We examine patterns of structural generalization for Transformer sequence-to-sequence models. We find that not only do Transformers fail to generalize hierarchically across a wide variety of grammatical mapping tasks, but they exhibit an even stronger preference for linear generalization than comparable networks.
arXiv Detail & Related papers (2021-09-24T15:48:46Z)
Compositional Generalization via Semantic Tagging [81.24269148865555]
We propose a new decoding framework that preserves the expressivity and generality of sequence-to-sequence models. We show that the proposed approach consistently improves compositional generalization across model architectures, domains, and semantic formalisms.
arXiv Detail & Related papers (2020-10-22T15:55:15Z)
Does syntax need to grow on trees? Sources of hierarchical inductive bias in sequence-to-sequence networks [28.129220683169052]
In neural network models, inductive biases could in theory arise from any aspect of the model architecture. We investigate which architectural factors affect the generalization behavior of neural sequence-to-sequence models trained on two syntactic tasks.
arXiv Detail & Related papers (2020-01-10T19:02:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.