Causal Transformers Perform Below Chance on Recursive Nested
Constructions, Unlike Humans
- URL: http://arxiv.org/abs/2110.07240v1
- Date: Thu, 14 Oct 2021 09:22:17 GMT
- Title: Causal Transformers Perform Below Chance on Recursive Nested
Constructions, Unlike Humans
- Authors: Yair Lakretz, Th\'eo Desbordes, Dieuwke Hupkes, Stanislas Dehaene
- Abstract summary: We test four different Transformer LMs on two different types of nested constructions.
We find that Transformers achieve near-perfect performance on short-range embedded dependencies.
On long-range embedded dependencies, Transformers' performance sharply drops below chance level.
- Score: 7.897143833642971
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recursive processing is considered a hallmark of human linguistic abilities.
A recent study evaluated recursive processing in recurrent neural language
models (RNN-LMs) and showed that such models perform below chance level on
embedded dependencies within nested constructions -- a prototypical example of
recursion in natural language. Here, we study if state-of-the-art Transformer
LMs do any better. We test four different Transformer LMs on two different
types of nested constructions, which differ in whether the embedded (inner)
dependency is short or long range. We find that Transformers achieve
near-perfect performance on short-range embedded dependencies, significantly
better than previous results reported for RNN-LMs and humans. However, on
long-range embedded dependencies, Transformers' performance sharply drops below
chance level. Remarkably, the addition of only three words to the embedded
dependency caused Transformers to fall from near-perfect to below-chance
performance. Taken together, our results reveal Transformers' shortcoming when
it comes to recursive, structure-based, processing.
Related papers
- Bypassing the Exponential Dependency: Looped Transformers Efficiently Learn In-context by Multi-step Gradient Descent [26.764893400499354]
We show that linear looped Transformers can implement multi-step gradient descent efficiently for in-context learning.
Our results demonstrate that as long as the input data has a constant condition number, $n = O(d)$, the linear looped Transformers can achieve a small error.
arXiv Detail & Related papers (2024-10-15T04:44:23Z) - Transformers are Efficient Compilers, Provably [11.459397066286822]
Transformer-based large language models (LLMs) have demonstrated surprisingly robust performance across a wide range of language-related tasks.
In this paper, we take the first steps towards a formal investigation of using transformers as compilers from an expressive power perspective.
We introduce a representative programming language, Mini-Husky, which encapsulates key features of modern C-like languages.
arXiv Detail & Related papers (2024-10-07T20:31:13Z) - MoEUT: Mixture-of-Experts Universal Transformers [75.96744719516813]
Universal Transformers (UTs) have advantages over standard Transformers in learning compositional generalizations.
Layer-sharing drastically reduces the parameter count compared to the non-shared model with the same dimensionality.
No previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as language modeling.
arXiv Detail & Related papers (2024-05-25T03:24:32Z) - Tree-Planted Transformers: Unidirectional Transformer Language Models with Implicit Syntactic Supervision [4.665860995185884]
We propose a new method dubbed tree-planting.
Instead of explicitly generating syntactic structures, we "plant" trees into attention weights of unidirectional Transformer LMs.
Tree-Planted Transformers inherit the training efficiency from SLMs without changing the inference efficiency of their underlying Transformer LMs.
arXiv Detail & Related papers (2024-02-20T03:37:24Z) - Characterizing Intrinsic Compositionality in Transformers with Tree
Projections [72.45375959893218]
neural models like transformers can route information arbitrarily between different parts of their input.
We show that transformers for three different tasks become more treelike over the course of training.
These trees are predictive of model behavior, with more tree-like models generalizing better on tests of compositional generalization.
arXiv Detail & Related papers (2022-11-02T17:10:07Z) - Transformer Grammars: Augmenting Transformer Language Models with
Syntactic Inductive Biases at Scale [31.293175512404172]
We introduce Transformer Grammars -- a class of Transformer language models that combine expressive power, scalability, and strong performance of Transformers.
We find that Transformer Grammars outperform various strong baselines on multiple syntax-sensitive language modeling evaluation metrics.
arXiv Detail & Related papers (2022-03-01T17:22:31Z) - Learning Bounded Context-Free-Grammar via LSTM and the
Transformer:Difference and Explanations [51.77000472945441]
Long Short-Term Memory (LSTM) and Transformers are two popular neural architectures used for natural language processing tasks.
In practice, it is often observed that Transformer models have better representation power than LSTM.
We study such practical differences between LSTM and Transformer and propose an explanation based on their latent space decomposition patterns.
arXiv Detail & Related papers (2021-12-16T19:56:44Z) - Revisiting Simple Neural Probabilistic Language Models [27.957834093475686]
This paper revisits the neural probabilistic language model (NPLM) ofcitetBengio2003ANP.
When scaled up to modern hardware, this model performs much better than expected on word-level language model benchmarks.
Inspired by this result, we modify the Transformer by replacing its first self-attention layer with the NPLM's local concatenation layer.
arXiv Detail & Related papers (2021-04-08T02:18:47Z) - Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation.
A linear-complexity recurrent variant has proven well suited for autoregressive generation.
This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z) - Bayesian Transformer Language Models for Speech Recognition [59.235405107295655]
State-of-the-art neural language models (LMs) represented by Transformers are highly complex.
This paper proposes a full Bayesian learning framework for Transformer LM estimation.
arXiv Detail & Related papers (2021-02-09T10:55:27Z) - Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks.
We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.