Attention layers provably solve single-location regression
- URL: http://arxiv.org/abs/2410.01537v1
- Date: Wed, 2 Oct 2024 13:28:02 GMT
- Title: Attention layers provably solve single-location regression
- Authors: Pierre Marion, Raphaël Berthier, Gérard Biau, Claire Boyer,
- Abstract summary: Attention-based models, such as Transformer, excel across various tasks but lack a comprehensive theoretical understanding, especially regarding tokenwise sparsity and internal linear structures.
We introduce the single-location regression task, where only one token in a sequence determines the output, and position is a latent, retrievable via a linear projection.
- Score: 12.355792442566681
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Attention-based models, such as Transformer, excel across various tasks but lack a comprehensive theoretical understanding, especially regarding token-wise sparsity and internal linear representations. To address this gap, we introduce the single-location regression task, where only one token in a sequence determines the output, and its position is a latent random variable, retrievable via a linear projection of the input. To solve this task, we propose a dedicated predictor, which turns out to be a simplified version of a non-linear self-attention layer. We study its theoretical properties, by showing its asymptotic Bayes optimality and analyzing its training dynamics. In particular, despite the non-convex nature of the problem, the predictor effectively learns the underlying structure. This work highlights the capacity of attention mechanisms to handle sparse token information and internal linear structures.
Related papers
- Analysis of Overparameterization in Continual Learning under a Linear Model [5.5165579223151795]
We study continual learning and catastrophic forgetting from a theoretical perspective in the simple setting of gradient descent.
We analytically demonstrate that over parameterization alone can mitigate forgetting in the context of a linear regression model.
As part of this work, we establish a non-asymptotic bound of the risk of a single linear regression task, which may be of independent interest to the field of double descent theory.
arXiv Detail & Related papers (2025-02-11T00:15:38Z) - Learning Differentiable Surrogate Losses for Structured Prediction [23.15754467559003]
We introduce a novel framework in which a structured loss function, parameterized by neural networks, is learned directly from output training data.
As a result, the differentiable loss not only enables the learning of neural networks due to the finite dimension of the surrogate space but also allows for the prediction of new structures of the output data.
arXiv Detail & Related papers (2024-11-18T16:07:47Z) - A Random Matrix Theory Perspective on the Spectrum of Learned Features and Asymptotic Generalization Capabilities [30.737171081270322]
We study how fully-connected two-layer neural networks adapt to the target function after a single, but aggressive, gradient descent step.
This provides a sharp description of the impact of feature learning in the generalization of two-layer neural networks, beyond the random features and lazy training regimes.
arXiv Detail & Related papers (2024-10-24T17:24:34Z) - Learning Linear Causal Representations from Interventions under General
Nonlinear Mixing [52.66151568785088]
We prove strong identifiability results given unknown single-node interventions without access to the intervention targets.
This is the first instance of causal identifiability from non-paired interventions for deep neural network embeddings.
arXiv Detail & Related papers (2023-06-04T02:32:12Z) - How does unlabeled data improve generalization in self-training? A
one-hidden-layer theoretical analysis [93.37576644429578]
This work establishes the first theoretical analysis for the known iterative self-training paradigm.
We prove the benefits of unlabeled data in both training convergence and generalization ability.
Experiments from shallow neural networks to deep neural networks are also provided to justify the correctness of our established theoretical insights on self-training.
arXiv Detail & Related papers (2022-01-21T02:16:52Z) - Minimax Estimation of Linear Functions of Eigenvectors in the Face of
Small Eigen-Gaps [95.62172085878132]
Eigenvector perturbation analysis plays a vital role in various statistical data science applications.
We develop a suite of statistical theory that characterizes the perturbation of arbitrary linear functions of an unknown eigenvector.
In order to mitigate a non-negligible bias issue inherent to the natural "plug-in" estimator, we develop de-biased estimators.
arXiv Detail & Related papers (2021-04-07T17:55:10Z) - Gradient Starvation: A Learning Proclivity in Neural Networks [97.02382916372594]
Gradient Starvation arises when cross-entropy loss is minimized by capturing only a subset of features relevant for the task.
This work provides a theoretical explanation for the emergence of such feature imbalance in neural networks.
arXiv Detail & Related papers (2020-11-18T18:52:08Z) - Theoretical Analysis of Self-Training with Deep Networks on Unlabeled
Data [48.4779912667317]
Self-training algorithms have been very successful for learning with unlabeled data using neural networks.
This work provides a unified theoretical analysis of self-training with deep networks for semi-supervised learning, unsupervised domain adaptation, and unsupervised learning.
arXiv Detail & Related papers (2020-10-07T19:43:55Z) - Learning Output Embeddings in Structured Prediction [73.99064151691597]
A powerful and flexible approach to structured prediction consists in embedding the structured objects to be predicted into a feature space of possibly infinite dimension.
A prediction in the original space is computed by solving a pre-image problem.
In this work, we propose to jointly learn a finite approximation of the output embedding and the regression function into the new feature space.
arXiv Detail & Related papers (2020-07-29T09:32:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.