Tensor Attention Training: Provably Efficient Learning of Higher-order Transformers
- URL: http://arxiv.org/abs/2405.16411v2
- Date: Mon, 14 Oct 2024 04:10:19 GMT
- Title: Tensor Attention Training: Provably Efficient Learning of Higher-order Transformers
- Authors: Yingyu Liang, Zhenmei Shi, Zhao Song, Yufa Zhou,
- Abstract summary: Time complexity of tensor attention is a significant obstacle to its utilization in transformers.
We prove that the backward gradient of attention training can be computed in almost linear time.
- Score: 18.331374727331077
- License:
- Abstract: Tensor Attention, a multi-view attention that is able to capture high-order correlations among multiple modalities, can overcome the representational limitations of classical matrix attention. However, the $O(n^3)$ time complexity of tensor attention poses a significant obstacle to its utilization in transformers, where $n$ is the input sequence length. In this work, we prove that the backward gradient of tensor attention training can be computed in almost linear time $n^{1+o(1)}$, the same complexity as its forward computation under the bounded entries assumption. We provide a closed-form solution for the gradient and propose a fast computation method utilizing polynomial approximation methods and tensor algebraic techniques. Furthermore, we prove the necessity and tightness of our assumption through hardness analysis, showing that slightly weakening it renders the gradient problem unsolvable in truly subcubic time. Our theoretical results establish the feasibility of efficient higher-order transformer training and may facilitate practical applications of tensor attention architectures.
Related papers
- Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? [69.4145579827826]
We show a fast flow on the regression loss despite the gradient non-ity algorithms for our convergence landscape.
This is the first theoretical analysis for multi-layer Transformer in this setting.
arXiv Detail & Related papers (2024-10-10T18:29:05Z) - Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time [17.086679273053853]
We show that a novel fast approximation method can calculate the gradients in almost linear time.
By improving the efficiency of gradient, we hope that this work will facilitate more effective training and deployment of long-context language models.
arXiv Detail & Related papers (2024-08-23T17:16:43Z) - Power of $\ell_1$-Norm Regularized Kaczmarz Algorithms for High-Order Tensor Recovery [8.812294191190896]
We propose novel Kaczmarz algorithms for recovering high-order tensors characterized by sparse and/or low-rank structures.
A variety of numerical experiments on both synthetic and real-world datasets demonstrate the effectiveness and significant potential of the proposed methods.
arXiv Detail & Related papers (2024-05-14T02:06:53Z) - Tensor cumulants for statistical inference on invariant distributions [49.80012009682584]
We show that PCA becomes computationally hard at a critical value of the signal's magnitude.
We define a new set of objects, which provide an explicit, near-orthogonal basis for invariants of a given degree.
It also lets us analyze a new problem of distinguishing between different ensembles.
arXiv Detail & Related papers (2024-04-29T14:33:24Z) - Tensor Completion via Integer Optimization [7.813563137863005]
The main challenge with the tensor completion problem is a fundamental tension between computation power and the information-theoretic sample complexity rate.
Past approaches either achieve the information-theoretic rate but lack practical algorithms to compute the corresponding solution.
This paper develops a novel tensor completion algorithm that resolves this tension by achieving both provable convergence (in numerical tolerance) in a linear number of oracle steps and the information-theoretic rate.
arXiv Detail & Related papers (2024-02-06T21:44:07Z) - Scalable tensor methods for nonuniform hypergraphs [0.18434042562191813]
A recently proposed adjacency tensor is applicable to nonuniform hypergraphs, but is prohibitively costly to form and analyze in practice.
We develop tensor times same vector (TTSV) algorithms which improve complexity from $O(nr)$ to a low-degree in $r$.
We demonstrate the flexibility and utility of our approach in practice by developing tensor-based hypergraph centrality and clustering algorithms.
arXiv Detail & Related papers (2023-06-30T17:41:58Z) - Linear Self-Attention Approximation via Trainable Feedforward Kernel [77.34726150561087]
In pursuit of faster computation, Efficient Transformers demonstrate an impressive variety of approaches.
We aim to expand the idea of trainable kernel methods to approximate the self-attention mechanism of the Transformer architecture.
arXiv Detail & Related papers (2022-11-08T08:14:11Z) - Error Analysis of Tensor-Train Cross Approximation [88.83467216606778]
We provide accuracy guarantees in terms of the entire tensor for both exact and noisy measurements.
Results are verified by numerical experiments, and may have important implications for the usefulness of cross approximations for high-order tensors.
arXiv Detail & Related papers (2022-07-09T19:33:59Z) - Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks.
Existing methods are either theoretically flawed or empirically ineffective for visual recognition.
We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z) - SOFT: Softmax-free Transformer with Linear Complexity [112.9754491864247]
Vision transformers (ViTs) have pushed the state-of-the-art for various visual recognition tasks by patch-wise image tokenization followed by self-attention.
Various attempts on approximating the self-attention with linear complexity have been made in Natural Language Processing.
We identify that their limitations are rooted in keeping the softmax self-attention during approximations.
For the first time, a softmax-free transformer or SOFT is proposed.
arXiv Detail & Related papers (2021-10-22T17:57:29Z) - A Solution for Large Scale Nonlinear Regression with High Rank and
Degree at Constant Memory Complexity via Latent Tensor Reconstruction [0.0]
This paper proposes a novel method for learning highly nonlinear, multivariate functions from examples.
Our method takes advantage of the property that continuous functions can be approximated by bys, which in turn are representable by tensors.
For learning the models, we present an efficient-based algorithm that can be implemented in linear time.
arXiv Detail & Related papers (2020-05-04T14:49:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.