Related papers: Why are Sensitive Functions Hard for Transformers?

Why are Sensitive Functions Hard for Transformers?

URL: http://arxiv.org/abs/2402.09963v4
Date: Mon, 27 May 2024 17:01:29 GMT
Title: Why are Sensitive Functions Hard for Transformers?
Authors: Michael Hahn, Mark Rofin,
Abstract summary: We show that under the transformer architecture, the loss landscape is constrained by the input-space sensitivity. We show theoretically and empirically that this theory unifies a broad array of empirical observations about the learning abilities and biases of transformers.
Score: 1.0632690677209804
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Empirical studies have identified a range of learnability biases and limitations of transformers, such as a persistent difficulty in learning to compute simple formal languages such as PARITY, and a bias towards low-degree functions. However, theoretical understanding remains limited, with existing expressiveness theory either overpredicting or underpredicting realistic learning abilities. We prove that, under the transformer architecture, the loss landscape is constrained by the input-space sensitivity: Transformers whose output is sensitive to many parts of the input string inhabit isolated points in parameter space, leading to a low-sensitivity bias in generalization. We show theoretically and empirically that this theory unifies a broad array of empirical observations about the learning abilities and biases of transformers, such as their generalization bias towards low sensitivity and low degree, and difficulty in length generalization for PARITY. This shows that understanding transformers' inductive biases requires studying not just their in-principle expressivity, but also their loss landscape.

Related papers

Characterizing the Expressivity of Transformer Language Models [56.598551673153366]
We provide an exact characterization of fixed-precision transformers with strict future masking and soft attention.<n>We show that these models are precisely as expressive as a specific fragment of linear temporal logic.<n>We further relate this logic to established classes in formal language theory, automata theory, and algebra.
arXiv Detail & Related papers (2025-05-29T16:30:30Z)
Transformers for Learning on Noisy and Task-Level Manifolds: Approximation and Generalization Insights [47.62295798627317]
This work establishes a theoretical foundation by analyzing the performance of transformers for regression tasks involving noisy input data on a manifold.<n>We prove approximation and generalization errors which crucially depend on the intrinsic dimension of the manifold.<n>Our results demonstrate that transformers can leverage low-complexity structures in learning task even when the input data are perturbed by high-dimensional noise.
arXiv Detail & Related papers (2025-05-06T05:41:46Z)
A Little Depth Goes a Long Way: The Expressive Power of Log-Depth Transformers [29.839710738657203]
Recent theoretical results show transformers cannot express sequential reasoning problems over long inputs, intuitively because their computational depth is bounded.<n>We show even highly uniform transformers with depth $Theta(log n)$ can express two important problems.<n>We quantitatively predict how depth must grow with input length to express these problems.
arXiv Detail & Related papers (2025-03-05T23:26:25Z)
A distributional simplicity bias in the learning dynamics of transformers [50.91742043564049]
We show that transformers, trained on natural language data, also display a simplicity bias. Specifically, they sequentially learn many-body interactions among input tokens, reaching a saturation point in the prediction error for low-degree interactions. This approach opens up the possibilities of studying how interactions of different orders in the data affect learning, in natural language processing and beyond.
arXiv Detail & Related papers (2024-10-25T15:39:34Z)
Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? [69.4145579827826]
We show a fast flow on the regression loss despite the gradient non-ity algorithms for our convergence landscape. This is the first theoretical analysis for multi-layer Transformer in this setting.
arXiv Detail & Related papers (2024-10-10T18:29:05Z)
A Formal Framework for Understanding Length Generalization in Transformers [14.15513446489798]
We introduce a rigorous theoretical framework to analyze length generalization in causal transformers. We experimentally validate the theory as a predictor of success and failure of length generalization across a range of algorithmic and formal language tasks.
arXiv Detail & Related papers (2024-10-03T01:52:01Z)
Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption. We analyze how magnitude-based models affect generalization while improving adaption. We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z)
Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization [22.033370572209744]
We study whether transformers can learn to implicitly reason over parametric knowledge. We focus on two representative reasoning types, composition and comparison. We find that transformers can learn implicit reasoning, but only through grokking.
arXiv Detail & Related papers (2024-05-23T21:42:19Z)
Simplicity Bias of Transformers to Learn Low Sensitivity Functions [19.898451497341714]
Transformers achieve state-of-the-art accuracy and robustness across many tasks. An understanding of the inductive biases that they have and how those biases are different from other neural network architectures remains elusive.
arXiv Detail & Related papers (2024-03-11T17:12:09Z)
On the Convergence of Encoder-only Shallow Transformers [62.639819460956176]
We build the global convergence theory of encoder-only shallow Transformers under a realistic setting. Our results can pave the way for a better understanding of modern Transformers, particularly on training dynamics.
arXiv Detail & Related papers (2023-11-02T20:03:05Z)
Simplicity Bias in Transformers and their Ability to Learn Sparse Boolean Functions [29.461559919821802]
Recent works have found that Transformers struggle to model several formal languages when compared to recurrent models. This raises the question of why Transformers perform well in practice and whether they have any properties that enable them to generalize better than recurrent models.
arXiv Detail & Related papers (2022-11-22T15:10:48Z)
XAI for Transformers: Better Explanations through Conservative Propagation [60.67748036747221]
We show that the gradient in a Transformer reflects the function only locally, and thus fails to reliably identify the contribution of input features to the prediction. Our proposal can be seen as a proper extension of the well-established LRP method to Transformers.
arXiv Detail & Related papers (2022-02-15T10:47:11Z)
On the Power of Saturated Transformers: A View from Circuit Complexity [87.20342701232869]
We show that saturated transformers transcend the limitations of hard-attention transformers. The jump from hard to saturated attention can be understood as increasing the transformer's effective circuit depth by a factor of $O(log n)$.
arXiv Detail & Related papers (2021-06-30T17:09:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.