What can a Single Attention Layer Learn? A Study Through the Random
Features Lens
- URL: http://arxiv.org/abs/2307.11353v1
- Date: Fri, 21 Jul 2023 05:05:55 GMT
- Title: What can a Single Attention Layer Learn? A Study Through the Random
Features Lens
- Authors: Hengyu Fu, Tianyu Guo, Yu Bai, Song Mei
- Abstract summary: We study the learning and generalization of a single multi-head attention layer, with a sequence of key vectors and a separate query vector as input.
We show that such a random-feature attention layer can express a broad class of target functions that are permutation invariant to the key vectors.
- Score: 27.197540880103325
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Attention layers -- which map a sequence of inputs to a sequence of outputs
-- are core building blocks of the Transformer architecture which has achieved
significant breakthroughs in modern artificial intelligence. This paper
presents a rigorous theoretical study on the learning and generalization of a
single multi-head attention layer, with a sequence of key vectors and a
separate query vector as input. We consider the random feature setting where
the attention layer has a large number of heads, with randomly sampled frozen
query and key matrices, and trainable value matrices. We show that such a
random-feature attention layer can express a broad class of target functions
that are permutation invariant to the key vectors. We further provide
quantitative excess risk bounds for learning these target functions from finite
samples, using random feature attention with finitely many heads.
Our results feature several implications unique to the attention structure
compared with existing random features theory for neural networks, such as (1)
Advantages in the sample complexity over standard two-layer random-feature
networks; (2) Concrete and natural classes of functions that can be learned
efficiently by a random-feature attention layer; and (3) The effect of the
sampling distribution of the query-key weight matrix (the product of the query
and key matrix), where Gaussian random weights with a non-zero mean result in
better sample complexities over the zero-mean counterpart for learning certain
natural target functions. Experiments on simulated data corroborate our
theoretical findings and further illustrate the interplay between the sample
size and the complexity of the target function.
Related papers
- A Random Matrix Theory Perspective on the Spectrum of Learned Features and Asymptotic Generalization Capabilities [30.737171081270322]
We study how fully-connected two-layer neural networks adapt to the target function after a single, but aggressive, gradient descent step.
This provides a sharp description of the impact of feature learning in the generalization of two-layer neural networks, beyond the random features and lazy training regimes.
arXiv Detail & Related papers (2024-10-24T17:24:34Z) - Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
We study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data.
We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model.
arXiv Detail & Related papers (2024-09-09T18:10:26Z) - Going Beyond Neural Network Feature Similarity: The Network Feature
Complexity and Its Interpretation Using Category Theory [64.06519549649495]
We provide the definition of what we call functionally equivalent features.
These features produce equivalent output under certain transformations.
We propose an efficient algorithm named Iterative Feature Merging.
arXiv Detail & Related papers (2023-10-10T16:27:12Z) - Learning Active Subspaces and Discovering Important Features with Gaussian Radial Basis Functions Neural Networks [0.0]
We show that precious information is contained in the spectrum of the precision matrix that can be extracted once the training of the model is completed.
We conducted numerical experiments for regression, classification, and feature selection tasks.
Our results demonstrate that the proposed model does not only yield an attractive prediction performance compared to the competitors.
arXiv Detail & Related papers (2023-07-11T09:54:30Z) - Deterministic equivalent and error universality of deep random features
learning [4.8461049669050915]
This problem can be seen as a natural generalization of the widely studied random features model to deeper architectures.
First, we prove universality of the test error in a universality ridge setting where the learner and target networks share the same intermediate layers, and provide a sharp formula for it.
Second, we conjecture the universality of the test error in the more general setting of arbitrary convex losses and generic learner/target architectures.
arXiv Detail & Related papers (2023-02-01T12:37:10Z) - Random Ferns for Semantic Segmentation of PolSAR Images [0.0]
This paper extends the Random Fern framework to the semantic segmentation of polarimetric synthetic aperture radar images.
Two distinct optimization strategies are proposed.
Experiments show that results can be achieved that are similar to a more complex Random Forest model.
arXiv Detail & Related papers (2022-02-07T20:22:57Z) - Understanding Interlocking Dynamics of Cooperative Rationalization [90.6863969334526]
Selective rationalization explains the prediction of complex neural networks by finding a small subset of the input that is sufficient to predict the neural model output.
We reveal a major problem with such cooperative rationalization paradigm -- model interlocking.
We propose a new rationalization framework, called A2R, which introduces a third component into the architecture, a predictor driven by soft attention as opposed to selection.
arXiv Detail & Related papers (2021-10-26T17:39:18Z) - Inductive Biases and Variable Creation in Self-Attention Mechanisms [25.79946667926312]
This work provides a theoretical analysis of the inductive biases of self-attention modules.
Our focus is to rigorously establish which functions and long-range dependencies self-attention blocks prefer to represent.
Our main result shows that bounded-norm Transformer layers create sparse variables.
arXiv Detail & Related papers (2021-10-19T16:36:19Z) - Intersection Regularization for Extracting Semantic Attributes [72.53481390411173]
We consider the problem of supervised classification, such that the features that the network extracts match an unseen set of semantic attributes.
For example, when learning to classify images of birds into species, we would like to observe the emergence of features that zoologists use to classify birds.
We propose training a neural network with discrete top-level activations, which is followed by a multi-layered perceptron (MLP) and a parallel decision tree.
arXiv Detail & Related papers (2021-03-22T14:32:44Z) - Probabilistic Simplex Component Analysis [66.30587591100566]
PRISM is a probabilistic simplex component analysis approach to identifying the vertices of a data-circumscribing simplex from data.
The problem has a rich variety of applications, the most notable being hyperspectral unmixing in remote sensing and non-negative matrix factorization in machine learning.
arXiv Detail & Related papers (2021-03-18T05:39:00Z) - A Trainable Optimal Transport Embedding for Feature Aggregation and its
Relationship to Attention [96.77554122595578]
We introduce a parametrized representation of fixed size, which embeds and then aggregates elements from a given input set according to the optimal transport plan between the set and a trainable reference.
Our approach scales to large datasets and allows end-to-end training of the reference, while also providing a simple unsupervised learning mechanism with small computational cost.
arXiv Detail & Related papers (2020-06-22T08:35:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.