Why are Sensitive Functions Hard for Transformers?
- URL: http://arxiv.org/abs/2402.09963v4
- Date: Mon, 27 May 2024 17:01:29 GMT
- Title: Why are Sensitive Functions Hard for Transformers?
- Authors: Michael Hahn, Mark Rofin,
- Abstract summary: We show that under the transformer architecture, the loss landscape is constrained by the input-space sensitivity.
We show theoretically and empirically that this theory unifies a broad array of empirical observations about the learning abilities and biases of transformers.
- Score: 1.0632690677209804
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Empirical studies have identified a range of learnability biases and limitations of transformers, such as a persistent difficulty in learning to compute simple formal languages such as PARITY, and a bias towards low-degree functions. However, theoretical understanding remains limited, with existing expressiveness theory either overpredicting or underpredicting realistic learning abilities. We prove that, under the transformer architecture, the loss landscape is constrained by the input-space sensitivity: Transformers whose output is sensitive to many parts of the input string inhabit isolated points in parameter space, leading to a low-sensitivity bias in generalization. We show theoretically and empirically that this theory unifies a broad array of empirical observations about the learning abilities and biases of transformers, such as their generalization bias towards low sensitivity and low degree, and difficulty in length generalization for PARITY. This shows that understanding transformers' inductive biases requires studying not just their in-principle expressivity, but also their loss landscape.
Related papers
- Strengthening Structural Inductive Biases by Pre-training to Perform Syntactic Transformations [75.14793516745374]
We propose to strengthen the structural inductive bias of a Transformer by intermediate pre-training.
Our experiments confirm that this helps with few-shot learning of syntactic tasks such as chunking.
Our analysis shows that the intermediate pre-training leads to attention heads that keep track of which syntactic transformation needs to be applied to which token.
arXiv Detail & Related papers (2024-07-05T14:29:44Z) - Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption.
We analyze how magnitude-based models affect generalization while improving adaption.
We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z) - Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization [22.033370572209744]
We study whether transformers can learn to implicitly reason over parametric knowledge.
We focus on two representative reasoning types, composition and comparison.
We find that transformers can learn implicit reasoning, but only through grokking.
arXiv Detail & Related papers (2024-05-23T21:42:19Z) - Simplicity Bias of Transformers to Learn Low Sensitivity Functions [19.898451497341714]
Transformers achieve state-of-the-art accuracy and robustness across many tasks.
An understanding of the inductive biases that they have and how those biases are different from other neural network architectures remains elusive.
arXiv Detail & Related papers (2024-03-11T17:12:09Z) - On the Convergence of Encoder-only Shallow Transformers [62.639819460956176]
We build the global convergence theory of encoder-only shallow Transformers under a realistic setting.
Our results can pave the way for a better understanding of modern Transformers, particularly on training dynamics.
arXiv Detail & Related papers (2023-11-02T20:03:05Z) - Transformers learn in-context by gradient descent [58.24152335931036]
Training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations.
We show how trained Transformers become mesa-optimizers i.e. learn models by gradient descent in their forward pass.
arXiv Detail & Related papers (2022-12-15T09:21:21Z) - Simplicity Bias in Transformers and their Ability to Learn Sparse
Boolean Functions [29.461559919821802]
Recent works have found that Transformers struggle to model several formal languages when compared to recurrent models.
This raises the question of why Transformers perform well in practice and whether they have any properties that enable them to generalize better than recurrent models.
arXiv Detail & Related papers (2022-11-22T15:10:48Z) - XAI for Transformers: Better Explanations through Conservative
Propagation [60.67748036747221]
We show that the gradient in a Transformer reflects the function only locally, and thus fails to reliably identify the contribution of input features to the prediction.
Our proposal can be seen as a proper extension of the well-established LRP method to Transformers.
arXiv Detail & Related papers (2022-02-15T10:47:11Z) - Pushing the Limits of Rule Reasoning in Transformers through Natural
Language Satisfiability [30.01308882849197]
We propose a new methodology for creating challenging algorithmic reasoning datasets.
Key idea is to draw insights from empirical sampling of hard propositional SAT problems and from complexity-theoretic studies of language.
We find that current transformers, given sufficient training data, are surprisingly robust at solving the resulting NLSat problems.
arXiv Detail & Related papers (2021-12-16T17:47:20Z) - On the Power of Saturated Transformers: A View from Circuit Complexity [87.20342701232869]
We show that saturated transformers transcend the limitations of hard-attention transformers.
The jump from hard to saturated attention can be understood as increasing the transformer's effective circuit depth by a factor of $O(log n)$.
arXiv Detail & Related papers (2021-06-30T17:09:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.