Grokking of Hierarchical Structure in Vanilla Transformers
- URL: http://arxiv.org/abs/2305.18741v1
- Date: Tue, 30 May 2023 04:34:13 GMT
- Title: Grokking of Hierarchical Structure in Vanilla Transformers
- Authors: Shikhar Murty, Pratyusha Sharma, Jacob Andreas, Christopher D. Manning
- Abstract summary: We show that transformer language models can learn to generalize hierarchically after training for extremely long periods.
intermediate-depth models generalize better than both very deep and very shallow transformers.
- Score: 72.45375959893218
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: For humans, language production and comprehension is sensitive to the
hierarchical structure of sentences. In natural language processing, past work
has questioned how effectively neural sequence models like transformers capture
this hierarchical structure when generalizing to structurally novel inputs. We
show that transformer language models can learn to generalize hierarchically
after training for extremely long periods -- far beyond the point when
in-domain accuracy has saturated. We call this phenomenon \emph{structural
grokking}. On multiple datasets, structural grokking exhibits inverted U-shaped
scaling in model depth: intermediate-depth models generalize better than both
very deep and very shallow transformers. When analyzing the relationship
between model-internal properties and grokking, we find that optimal depth for
grokking can be identified using the tree-structuredness metric of
\citet{murty2023projections}. Overall, our work provides strong evidence that,
with extended training, vanilla transformers discover and use hierarchical
structure.
Related papers
- Learning Syntax Without Planting Trees: Understanding When and Why Transformers Generalize Hierarchically [74.96551626420188]
Transformers trained on natural language data have been shown to learn its hierarchical structure and generalize to sentences with unseen syntactic structures.
We investigate sources of inductive bias in transformer models and their training that could cause such generalization behavior to emerge.
arXiv Detail & Related papers (2024-04-25T07:10:29Z) - How Do Transformers Learn Topic Structure: Towards a Mechanistic
Understanding [56.222097640468306]
We provide mechanistic understanding of how transformers learn "semantic structure"
We show, through a combination of mathematical analysis and experiments on Wikipedia data, that the embedding layer and the self-attention layer encode the topical structure.
arXiv Detail & Related papers (2023-03-07T21:42:17Z) - Characterizing Intrinsic Compositionality in Transformers with Tree
Projections [72.45375959893218]
neural models like transformers can route information arbitrarily between different parts of their input.
We show that transformers for three different tasks become more treelike over the course of training.
These trees are predictive of model behavior, with more tree-like models generalizing better on tests of compositional generalization.
arXiv Detail & Related papers (2022-11-02T17:10:07Z) - Structural Biases for Improving Transformers on Translation into
Morphologically Rich Languages [120.74406230847904]
TP-Transformer augments the traditional Transformer architecture to include an additional component to represent structure.
The second method imbues structure at the data level by segmenting the data with morphological tokenization.
We find that each of these two approaches allows the network to achieve better performance, but this improvement is dependent on the size of the dataset.
arXiv Detail & Related papers (2022-08-11T22:42:24Z) - Forming Trees with Treeformers [3.8073142980733]
Many state-of-the-art neural networks models such as Transformers have no explicit hierarchical structure in its architecture.
We introduce Treeformer, a general-purpose encoder module inspired by the CKY algorithm.
Our experiments demonstrate the benefits of incorporating hierarchical structure into the Transformer.
arXiv Detail & Related papers (2022-07-14T14:39:30Z) - Transformers Generalize Linearly [1.7709450506466664]
We examine patterns of structural generalization for Transformer sequence-to-sequence models.
We find that not only do Transformers fail to generalize hierarchically across a wide variety of grammatical mapping tasks, but they exhibit an even stronger preference for linear generalization than comparable networks.
arXiv Detail & Related papers (2021-09-24T15:48:46Z) - Tree-structured Attention with Hierarchical Accumulation [103.47584968330325]
"Hierarchical Accumulation" encodes parse tree structures into self-attention at constant time complexity.
Our approach outperforms SOTA methods in four IWSLT translation tasks and the WMT'14 English-German translation task.
arXiv Detail & Related papers (2020-02-19T08:17:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.