Related papers: Grokking of Hierarchical Structure in Vanilla Transformers

Grokking of Hierarchical Structure in Vanilla Transformers

URL: http://arxiv.org/abs/2305.18741v1
Date: Tue, 30 May 2023 04:34:13 GMT
Title: Grokking of Hierarchical Structure in Vanilla Transformers
Authors: Shikhar Murty, Pratyusha Sharma, Jacob Andreas, Christopher D. Manning
Abstract summary: We show that transformer language models can learn to generalize hierarchically after training for extremely long periods. intermediate-depth models generalize better than both very deep and very shallow transformers.
Score: 72.45375959893218
License: http://creativecommons.org/licenses/by/4.0/
Abstract: For humans, language production and comprehension is sensitive to the hierarchical structure of sentences. In natural language processing, past work has questioned how effectively neural sequence models like transformers capture this hierarchical structure when generalizing to structurally novel inputs. We show that transformer language models can learn to generalize hierarchically after training for extremely long periods -- far beyond the point when in-domain accuracy has saturated. We call this phenomenon \emph{structural grokking}. On multiple datasets, structural grokking exhibits inverted U-shaped scaling in model depth: intermediate-depth models generalize better than both very deep and very shallow transformers. When analyzing the relationship between model-internal properties and grokking, we find that optimal depth for grokking can be identified using the tree-structuredness metric of \citet{murty2023projections}. Overall, our work provides strong evidence that, with extended training, vanilla transformers discover and use hierarchical structure.

Related papers

Learning Syntax Without Planting Trees: Understanding When and Why Transformers Generalize Hierarchically [74.96551626420188]
Transformers trained on natural language data have been shown to learn its hierarchical structure and generalize to sentences with unseen syntactic structures. We investigate sources of inductive bias in transformer models and their training that could cause such generalization behavior to emerge.
arXiv Detail & Related papers (2024-04-25T07:10:29Z)
How Do Transformers Learn Topic Structure: Towards a Mechanistic Understanding [56.222097640468306]
We provide mechanistic understanding of how transformers learn "semantic structure" We show, through a combination of mathematical analysis and experiments on Wikipedia data, that the embedding layer and the self-attention layer encode the topical structure.
arXiv Detail & Related papers (2023-03-07T21:42:17Z)
Characterizing Intrinsic Compositionality in Transformers with Tree Projections [72.45375959893218]
neural models like transformers can route information arbitrarily between different parts of their input. We show that transformers for three different tasks become more treelike over the course of training. These trees are predictive of model behavior, with more tree-like models generalizing better on tests of compositional generalization.
arXiv Detail & Related papers (2022-11-02T17:10:07Z)
Structural Biases for Improving Transformers on Translation into Morphologically Rich Languages [120.74406230847904]
TP-Transformer augments the traditional Transformer architecture to include an additional component to represent structure. The second method imbues structure at the data level by segmenting the data with morphological tokenization. We find that each of these two approaches allows the network to achieve better performance, but this improvement is dependent on the size of the dataset.
arXiv Detail & Related papers (2022-08-11T22:42:24Z)
Forming Trees with Treeformers [3.8073142980733]
Many state-of-the-art neural networks models such as Transformers have no explicit hierarchical structure in its architecture. We introduce Treeformer, a general-purpose encoder module inspired by the CKY algorithm. Our experiments demonstrate the benefits of incorporating hierarchical structure into the Transformer.
arXiv Detail & Related papers (2022-07-14T14:39:30Z)
Transformers Generalize Linearly [1.7709450506466664]
We examine patterns of structural generalization for Transformer sequence-to-sequence models. We find that not only do Transformers fail to generalize hierarchically across a wide variety of grammatical mapping tasks, but they exhibit an even stronger preference for linear generalization than comparable networks.
arXiv Detail & Related papers (2021-09-24T15:48:46Z)
Tree-structured Attention with Hierarchical Accumulation [103.47584968330325]
"Hierarchical Accumulation" encodes parse tree structures into self-attention at constant time complexity. Our approach outperforms SOTA methods in four IWSLT translation tasks and the WMT'14 English-German translation task.
arXiv Detail & Related papers (2020-02-19T08:17:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.