On the Expressive Power of Self-Attention Matrices
- URL: http://arxiv.org/abs/2106.03764v2
- Date: Tue, 8 Jun 2021 10:02:26 GMT
- Title: On the Expressive Power of Self-Attention Matrices
- Authors: Valerii Likhosherstov, Krzysztof Choromanski, Adrian Weller
- Abstract summary: We show that a fixed self-attention module can approximate arbitrary sparse patterns depending on the input.
We propose an algorithm for finding adaptive inputs and fixed self-attention parameters in order to approximate a given matrix.
- Score: 41.72598286888797
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer networks are able to capture patterns in data coming from many
domains (text, images, videos, proteins, etc.) with little or no change to
architecture components. We perform a theoretical analysis of the core
component responsible for signal propagation between elements, i.e. the
self-attention matrix. In practice, this matrix typically exhibits two
properties: (1) it is sparse, meaning that each token only attends to a small
subset of other tokens; and (2) it changes dynamically depending on the input
to the module. With these considerations in mind, we ask the following
question: Can a fixed self-attention module approximate arbitrary sparse
patterns depending on the input? How small is the hidden size $d$ required for
such approximation? We make progress in answering this question and show that
the self-attention matrix can provably approximate sparse matrices, where
sparsity is in terms of a bounded number of nonzero elements in each row and
column. While the parameters of self-attention are fixed, various sparse
matrices can be approximated by only modifying the inputs. Our proof is based
on the random projection technique and uses the seminal Johnson-Lindenstrauss
lemma. Our proof is constructive, enabling us to propose an algorithm for
finding adaptive inputs and fixed self-attention parameters in order to
approximate a given matrix. In particular, we show that, in order to
approximate any sparse matrix up to a given precision defined in terms of
preserving matrix element ratios, $d$ grows only logarithmically with the
sequence length $L$ (i.e. $d = O(\log L)$).
Related papers
- Optimal Quantization for Matrix Multiplication [35.007966885532724]
We present a universal quantizer based on nested lattices with an explicit guarantee of approximation error for any (non-random) pair of matrices $A$, $B$ in terms of only Frobenius norms $|A|_F, |B|_F$ and $|Atop B|_F$.
arXiv Detail & Related papers (2024-10-17T17:19:48Z) - Conv-Basis: A New Paradigm for Efficient Attention Inference and Gradient Computation in Transformers [16.046186753149]
Self-attention mechanism is the key to the success of transformers in recent Large Language Models (LLMs)
We leverage the convolution-like structure of attention matrices to develop an efficient approximation method for attention using convolution matrices.
We hope our new paradigm for accelerating attention computation in transformer models can help their application to longer contexts.
arXiv Detail & Related papers (2024-05-08T17:11:38Z) - One-sided Matrix Completion from Two Observations Per Row [95.87811229292056]
We propose a natural algorithm that involves imputing the missing values of the matrix $XTX$.
We evaluate our algorithm on one-sided recovery of synthetic data and low-coverage genome sequencing.
arXiv Detail & Related papers (2023-06-06T22:35:16Z) - Sparse Factorization of Large Square Matrices [10.94053598642913]
In this paper, we propose to approximate a large square matrix with a product of sparse full-rank matrices.
In the approximation, our method needs only $N(log N)2$ non-zero numbers for an $Ntimes N$ full matrix.
We show that our method gives a better approximation when the approximated matrix is sparse and high-rank.
arXiv Detail & Related papers (2021-09-16T18:42:21Z) - Non-PSD Matrix Sketching with Applications to Regression and
Optimization [56.730993511802865]
We present dimensionality reduction methods for non-PSD and square-roots" matrices.
We show how these techniques can be used for multiple downstream tasks.
arXiv Detail & Related papers (2021-06-16T04:07:48Z) - Linear-Sample Learning of Low-Rank Distributions [56.59844655107251]
We show that learning $ktimes k$, rank-$r$, matrices to normalized $L_1$ distance requires $Omega(frackrepsilon2)$ samples.
We propose an algorithm that uses $cal O(frackrepsilon2log2fracepsilon)$ samples, a number linear in the high dimension, and nearly linear in the matrices, typically low, rank proofs.
arXiv Detail & Related papers (2020-09-30T19:10:32Z) - Compressed sensing of low-rank plus sparse matrices [3.8073142980733]
This manuscript develops similar guarantees showing that $mtimes n$ that can be expressed as the sum of a rank-rparse matrix and a $s-sparse matrix can be recovered by computationally tractable methods.
Results are shown for synthetic problems, dynamic-foreground/static separation, multispectral imaging, and Robust PCA.
arXiv Detail & Related papers (2020-07-18T15:36:11Z) - Spectral Learning on Matrices and Tensors [74.88243719463053]
We show that tensor decomposition can pick up latent effects that are missed by matrix methods.
We also outline computational techniques to design efficient tensor decomposition methods.
arXiv Detail & Related papers (2020-04-16T22:53:00Z) - Sketching Transformed Matrices with Applications to Natural Language
Processing [76.6222695417524]
We propose a space-efficient sketching algorithm for computing the product of a given small matrix with the transformed matrix.
We show that our approach obtains small error and is efficient in both space and time.
arXiv Detail & Related papers (2020-02-23T03:07:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.