Tree-Planted Transformers: Unidirectional Transformer Language Models with Implicit Syntactic Supervision
- URL: http://arxiv.org/abs/2402.12691v2
- Date: Thu, 6 Jun 2024 13:16:16 GMT
- Title: Tree-Planted Transformers: Unidirectional Transformer Language Models with Implicit Syntactic Supervision
- Authors: Ryo Yoshida, Taiga Someya, Yohei Oseki,
- Abstract summary: We propose a new method dubbed tree-planting.
Instead of explicitly generating syntactic structures, we "plant" trees into attention weights of unidirectional Transformer LMs.
Tree-Planted Transformers inherit the training efficiency from SLMs without changing the inference efficiency of their underlying Transformer LMs.
- Score: 4.665860995185884
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Syntactic Language Models (SLMs) can be trained efficiently to reach relatively high performance; however, they have trouble with inference efficiency due to the explicit generation of syntactic structures. In this paper, we propose a new method dubbed tree-planting: instead of explicitly generating syntactic structures, we "plant" trees into attention weights of unidirectional Transformer LMs to implicitly reflect syntactic structures of natural language. Specifically, unidirectional Transformer LMs trained with tree-planting will be called Tree-Planted Transformers (TPT), which inherit the training efficiency from SLMs without changing the inference efficiency of their underlying Transformer LMs. Targeted syntactic evaluations on the SyntaxGym benchmark demonstrated that TPTs, despite the lack of explicit generation of syntactic structures, significantly outperformed not only vanilla Transformer LMs but also various SLMs that generate hundreds of syntactic structures in parallel. This result suggests that TPTs can learn human-like syntactic knowledge as data-efficiently as SLMs while maintaining the modeling space of Transformer LMs unchanged.
Related papers
- Tree Transformers are an Ineffective Model of Syntactic Constituency [0.0]
Linguists have long held that a key aspect of natural language syntax is the organization of language units into constituent structures.
A number of alternative models have been proposed to provide inductive biases towards constituency, including the Tree Transformer.
We investigate Tree Transformers to study whether they utilize meaningful and/or useful constituent structures.
arXiv Detail & Related papers (2024-11-25T23:53:46Z) - Strengthening Structural Inductive Biases by Pre-training to Perform Syntactic Transformations [75.14793516745374]
We propose to strengthen the structural inductive bias of a Transformer by intermediate pre-training.
Our experiments confirm that this helps with few-shot learning of syntactic tasks such as chunking.
Our analysis shows that the intermediate pre-training leads to attention heads that keep track of which syntactic transformation needs to be applied to which token.
arXiv Detail & Related papers (2024-07-05T14:29:44Z) - Differentiable Tree Operations Promote Compositional Generalization [106.59434079287661]
Differentiable Tree Machine (DTM) architecture integrates interpreter with external memory and agent that learns to sequentially select tree operations.
DTM achieves 100% while existing baselines such as Transformer, Tree Transformer, LSTM, and Tree2Tree LSTM achieve less than 30%.
arXiv Detail & Related papers (2023-06-01T14:46:34Z) - Characterizing Intrinsic Compositionality in Transformers with Tree
Projections [72.45375959893218]
neural models like transformers can route information arbitrarily between different parts of their input.
We show that transformers for three different tasks become more treelike over the course of training.
These trees are predictive of model behavior, with more tree-like models generalizing better on tests of compositional generalization.
arXiv Detail & Related papers (2022-11-02T17:10:07Z) - Syntax-guided Localized Self-attention by Constituency Syntactic
Distance [26.141356981833862]
We propose a syntax-guided localized self-attention for Transformer.
It allows incorporating directly grammar structures from an external constituency.
Experimental results show that our model could consistently improve translation performance.
arXiv Detail & Related papers (2022-10-21T06:37:25Z) - Transformer Grammars: Augmenting Transformer Language Models with
Syntactic Inductive Biases at Scale [31.293175512404172]
We introduce Transformer Grammars -- a class of Transformer language models that combine expressive power, scalability, and strong performance of Transformers.
We find that Transformer Grammars outperform various strong baselines on multiple syntax-sensitive language modeling evaluation metrics.
arXiv Detail & Related papers (2022-03-01T17:22:31Z) - Learning Bounded Context-Free-Grammar via LSTM and the
Transformer:Difference and Explanations [51.77000472945441]
Long Short-Term Memory (LSTM) and Transformers are two popular neural architectures used for natural language processing tasks.
In practice, it is often observed that Transformer models have better representation power than LSTM.
We study such practical differences between LSTM and Transformer and propose an explanation based on their latent space decomposition patterns.
arXiv Detail & Related papers (2021-12-16T19:56:44Z) - Causal Transformers Perform Below Chance on Recursive Nested
Constructions, Unlike Humans [7.897143833642971]
We test four different Transformer LMs on two different types of nested constructions.
We find that Transformers achieve near-perfect performance on short-range embedded dependencies.
On long-range embedded dependencies, Transformers' performance sharply drops below chance level.
arXiv Detail & Related papers (2021-10-14T09:22:17Z) - Bayesian Transformer Language Models for Speech Recognition [59.235405107295655]
State-of-the-art neural language models (LMs) represented by Transformers are highly complex.
This paper proposes a full Bayesian learning framework for Transformer LM estimation.
arXiv Detail & Related papers (2021-02-09T10:55:27Z) - Tree-structured Attention with Hierarchical Accumulation [103.47584968330325]
"Hierarchical Accumulation" encodes parse tree structures into self-attention at constant time complexity.
Our approach outperforms SOTA methods in four IWSLT translation tasks and the WMT'14 English-German translation task.
arXiv Detail & Related papers (2020-02-19T08:17:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.