How to Capture Higher-order Correlations? Generalizing Matrix Softmax
Attention to Kronecker Computation
- URL: http://arxiv.org/abs/2310.04064v1
- Date: Fri, 6 Oct 2023 07:42:39 GMT
- Title: How to Capture Higher-order Correlations? Generalizing Matrix Softmax
Attention to Kronecker Computation
- Authors: Josh Alman, Zhao Song
- Abstract summary: We study a generalization of attention which captures triple-wise correlations.
This generalization is able to solve problems about detecting triple-wise connections that were shown to be impossible for transformers.
We show that our construction, algorithms, and lower bounds naturally generalize to higher-order tensors and correlations.
- Score: 12.853829771559916
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In the classical transformer attention scheme, we are given three $n \times
d$ size matrices $Q, K, V$ (the query, key, and value tokens), and the goal is
to compute a new $n \times d$ size matrix $D^{-1} \exp(QK^\top) V$ where $D =
\mathrm{diag}( \exp(QK^\top) {\bf 1}_n )$. In this work, we study a
generalization of attention which captures triple-wise correlations. This
generalization is able to solve problems about detecting triple-wise
connections that were shown to be impossible for transformers. The potential
downside of this generalization is that it appears as though computations are
even more difficult, since the straightforward algorithm requires cubic time in
$n$. However, we show that in the bounded-entry setting (which arises in
practice, and which is well-studied in both theory and practice), there is
actually a near-linear time algorithm. More precisely, we show that bounded
entries are both necessary and sufficient for quickly performing generalized
computations:
$\bullet$ On the positive side, if all entries of the input matrices are
bounded above by $o(\sqrt[3]{\log n})$ then we show how to approximate the
``tensor-type'' attention matrix in $n^{1+o(1)}$ time.
$\bullet$ On the negative side, we show that if the entries of the input
matrices may be as large as $\Omega(\sqrt[3]{\log n})$, then there is no
algorithm that runs faster than $n^{3-o(1)}$ (assuming the Strong Exponential
Time Hypothesis from fine-grained complexity theory).
We also show that our construction, algorithms, and lower bounds naturally
generalize to higher-order tensors and correlations. Interestingly, the higher
the order of the tensors, the lower the bound on the entries needs to be for an
efficient algorithm. Our results thus yield a natural tradeoff between the
boundedness of the entries, and order of the tensor one may use for more
expressive, efficient attention computation.
Related papers
- Overcomplete Tensor Decomposition via Koszul-Young Flattenings [63.01248796170617]
We give a new algorithm for decomposing an $n_times n times n_3$ tensor as the sum of a minimal number of rank-1 terms.
We show that an even more general class of degree-$d$s cannot surpass rank $Cn$ for a constant $C = C(d)$.
arXiv Detail & Related papers (2024-11-21T17:41:09Z) - Optimal Sketching for Residual Error Estimation for Matrix and Vector Norms [50.15964512954274]
We study the problem of residual error estimation for matrix and vector norms using a linear sketch.
We demonstrate that this gives a substantial advantage empirically, for roughly the same sketch size and accuracy as in previous work.
We also show an $Omega(k2/pn1-2/p)$ lower bound for the sparse recovery problem, which is tight up to a $mathrmpoly(log n)$ factor.
arXiv Detail & Related papers (2024-08-16T02:33:07Z) - Conv-Basis: A New Paradigm for Efficient Attention Inference and Gradient Computation in Transformers [16.046186753149]
Self-attention mechanism is the key to the success of transformers in recent Large Language Models (LLMs)
We leverage the convolution-like structure of attention matrices to develop an efficient approximation method for attention using convolution matrices.
We hope our new paradigm for accelerating attention computation in transformer models can help their application to longer contexts.
arXiv Detail & Related papers (2024-05-08T17:11:38Z) - Provably learning a multi-head attention layer [55.2904547651831]
Multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models.
In this work, we initiate the study of provably learning a multi-head attention layer from random examples.
We prove computational lower bounds showing that in the worst case, exponential dependence on $m$ is unavoidable.
arXiv Detail & Related papers (2024-02-06T15:39:09Z) - Efficiently Learning One-Hidden-Layer ReLU Networks via Schur
Polynomials [50.90125395570797]
We study the problem of PAC learning a linear combination of $k$ ReLU activations under the standard Gaussian distribution on $mathbbRd$ with respect to the square loss.
Our main result is an efficient algorithm for this learning task with sample and computational complexity $(dk/epsilon)O(k)$, whereepsilon>0$ is the target accuracy.
arXiv Detail & Related papers (2023-07-24T14:37:22Z) - Fast Attention Requires Bounded Entries [19.17278873525312]
inner product attention computation is a fundamental task for training large language models such as Transformer, GPT-1, BERT, GPT-2, GPT-3 and ChatGPT.
We investigate whether faster algorithms are possible by implicitly making use of the matrix $A$.
This gives a theoretical explanation for the phenomenon observed in practice that attention computation is much more efficient when the input matrices have smaller entries.
arXiv Detail & Related papers (2023-02-26T02:42:39Z) - Average-Case Complexity of Tensor Decomposition for Low-Degree
Polynomials [93.59919600451487]
"Statistical-computational gaps" occur in many statistical inference tasks.
We consider a model for random order-3 decomposition where one component is slightly larger in norm than the rest.
We show that tensor entries can accurately estimate the largest component when $ll n3/2$ but fail to do so when $rgg n3/2$.
arXiv Detail & Related papers (2022-11-10T00:40:37Z) - Optimal Query Complexities for Dynamic Trace Estimation [59.032228008383484]
We consider the problem of minimizing the number of matrix-vector queries needed for accurate trace estimation in the dynamic setting where our underlying matrix is changing slowly.
We provide a novel binary tree summation procedure that simultaneously estimates all $m$ traces up to $epsilon$ error with $delta$ failure probability.
Our lower bounds (1) give the first tight bounds for Hutchinson's estimator in the matrix-vector product model with Frobenius norm error even in the static setting, and (2) are the first unconditional lower bounds for dynamic trace estimation.
arXiv Detail & Related papers (2022-09-30T04:15:44Z) - Approximate Multiplication of Sparse Matrices with Limited Space [24.517908972536432]
We develop sparse co-occuring directions, which reduces the time complexity to $widetildeOleft((nnz(X)+nnz(Y))ell+nell2right)$ in expectation.
Theoretical analysis reveals that the approximation error of our algorithm is almost the same as that of COD.
arXiv Detail & Related papers (2020-09-08T05:39:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.