Tensor Attention Training: Provably Efficient Learning of Higher-order Transformers
- URL: http://arxiv.org/abs/2405.16411v1
- Date: Sun, 26 May 2024 02:59:13 GMT
- Title: Tensor Attention Training: Provably Efficient Learning of Higher-order Transformers
- Authors: Jiuxiang Gu, Yingyu Liang, Zhenmei Shi, Zhao Song, Yufa Zhou,
- Abstract summary: We prove that the backward gradient of tensor attention training can be computed in almost linear $n1+o(1)$ time.
Our results establish the feasibility of efficient higher-order transformer training and may facilitate practical applications of tensor attention architectures.
- Score: 29.27113653850964
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Tensor Attention, a multi-view attention that is able to capture high-order correlations among multiple modalities, can overcome the representational limitations of classical matrix attention. However, the $\Omega(n^3)$ time complexity of tensor attention poses a significant obstacle to its practical implementation in transformers, where $n$ is the input sequence length. In this work, we prove that the backward gradient of tensor attention training can be computed in almost linear $n^{1+o(1)}$ time, the same complexity as its forward computation under a bounded entries assumption. We provide a closed-form solution for the gradient and propose a fast computation method utilizing polynomial approximation methods and tensor algebraic tricks. Furthermore, we prove the necessity and tightness of our assumption through hardness analysis, showing that slightly weakening it renders the gradient problem unsolvable in truly subcubic time. Our theoretical results establish the feasibility of efficient higher-order transformer training and may facilitate practical applications of tensor attention architectures.
Related papers
- Tensor cumulants for statistical inference on invariant distributions [49.80012009682584]
We show that PCA becomes computationally hard at a critical value of the signal's magnitude.
We define a new set of objects, which provide an explicit, near-orthogonal basis for invariants of a given degree.
It also lets us analyze a new problem of distinguishing between different ensembles.
arXiv Detail & Related papers (2024-04-29T14:33:24Z) - Adaptive Federated Learning Over the Air [108.62635460744109]
We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training.
Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha ).
arXiv Detail & Related papers (2024-03-11T09:10:37Z) - Tensor Completion via Integer Optimization [7.813563137863005]
The main challenge with the tensor completion problem is a fundamental tension between computation power and the information-theoretic sample complexity rate.
Past approaches either achieve the information-theoretic rate but lack practical algorithms to compute the corresponding solution.
This paper develops a novel tensor completion algorithm that resolves this tension by achieving both provable convergence (in numerical tolerance) in a linear number of oracle steps and the information-theoretic rate.
arXiv Detail & Related papers (2024-02-06T21:44:07Z) - Scalable tensor methods for nonuniform hypergraphs [0.18434042562191813]
A recently proposed adjacency tensor is applicable to nonuniform hypergraphs, but is prohibitively costly to form and analyze in practice.
We develop tensor times same vector (TTSV) algorithms which improve complexity from $O(nr)$ to a low-degree in $r$.
We demonstrate the flexibility and utility of our approach in practice by developing tensor-based hypergraph centrality and clustering algorithms.
arXiv Detail & Related papers (2023-06-30T17:41:58Z) - Sublinear scaling in non-Markovian open quantum systems simulations [0.0]
We introduce a numerically exact algorithm to calculate process tensors.
Our approach requires only $mathcalO(nlog n)$ singular value decompositions for environments with infinite memory.
arXiv Detail & Related papers (2023-04-11T15:40:33Z) - Linear Self-Attention Approximation via Trainable Feedforward Kernel [77.34726150561087]
In pursuit of faster computation, Efficient Transformers demonstrate an impressive variety of approaches.
We aim to expand the idea of trainable kernel methods to approximate the self-attention mechanism of the Transformer architecture.
arXiv Detail & Related papers (2022-11-08T08:14:11Z) - Error Analysis of Tensor-Train Cross Approximation [88.83467216606778]
We provide accuracy guarantees in terms of the entire tensor for both exact and noisy measurements.
Results are verified by numerical experiments, and may have important implications for the usefulness of cross approximations for high-order tensors.
arXiv Detail & Related papers (2022-07-09T19:33:59Z) - Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks.
Existing methods are either theoretically flawed or empirically ineffective for visual recognition.
We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z) - SOFT: Softmax-free Transformer with Linear Complexity [112.9754491864247]
Vision transformers (ViTs) have pushed the state-of-the-art for various visual recognition tasks by patch-wise image tokenization followed by self-attention.
Various attempts on approximating the self-attention with linear complexity have been made in Natural Language Processing.
We identify that their limitations are rooted in keeping the softmax self-attention during approximations.
For the first time, a softmax-free transformer or SOFT is proposed.
arXiv Detail & Related papers (2021-10-22T17:57:29Z) - Tensor Completion Made Practical [19.229475414802213]
We introduce a new variant of alternating minimization, which in turn is inspired by understanding how the progress measures that guide convergence need to be adapted to the tensor setting.
We show that our algorithm converges linearly to the true tensors even when the factors are highly correlated and can be implemented in nearly linear time.
In contrast, and somewhat surprisingly, we show that the standard version of alternating minimization, without our new twist, can converge at a drastically slower rate in practice.
arXiv Detail & Related papers (2020-06-04T21:09:44Z) - A Solution for Large Scale Nonlinear Regression with High Rank and
Degree at Constant Memory Complexity via Latent Tensor Reconstruction [0.0]
This paper proposes a novel method for learning highly nonlinear, multivariate functions from examples.
Our method takes advantage of the property that continuous functions can be approximated by bys, which in turn are representable by tensors.
For learning the models, we present an efficient-based algorithm that can be implemented in linear time.
arXiv Detail & Related papers (2020-05-04T14:49:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.