Sub-Linear Memory: How to Make Performers SLiM
- URL: http://arxiv.org/abs/2012.11346v1
- Date: Mon, 21 Dec 2020 13:56:04 GMT
- Title: Sub-Linear Memory: How to Make Performers SLiM
- Authors: Valerii Likhosherstov, Krzysztof Choromanski, Jared Davis, Xingyou
Song, Adrian Weller
- Abstract summary: vanilla Transformers require $O(L2)$ in serial time and memory as functions of input length $L$.
Recent works proposed various linear self-attention mechanisms, scaling only as $O(L)$ for serial computation.
We observe a remarkable computational flexibility: forward and backward propagation can be performed with no approximations using sublinear memory.
- Score: 38.068090269482425
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Transformer architecture has revolutionized deep learning on sequential
data, becoming ubiquitous in state-of-the-art solutions for a wide variety of
applications. Yet vanilla Transformers are notoriously resource-expensive,
requiring $O(L^2)$ in serial time and memory as functions of input length $L$.
Recent works proposed various linear self-attention mechanisms, scaling only as
$O(L)$ for serial computation. We perform a thorough analysis of recent
Transformer mechanisms with linear self-attention, Performers, in terms of
overall computational complexity. We observe a remarkable computational
flexibility: forward and backward propagation can be performed with no
approximations using sublinear memory as a function of $L$ (in addition to
negligible storage for the input sequence), at a cost of greater time
complexity in the parallel setting. In the extreme case, a Performer consumes
only $O(1)$ memory during training, and still requires $O(L)$ time. This
discovered time-memory tradeoff can be used for training or, due to complete
backward-compatibility, for fine-tuning on a low-memory device, e.g. a
smartphone or an earlier-generation GPU, thus contributing towards
decentralized and democratized deep learning.
Related papers
- Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time [17.086679273053853]
We show that a novel fast approximation method can calculate the gradients in almost linear time.
By improving the efficiency of gradient, we hope that this work will facilitate more effective training and deployment of long-context language models.
arXiv Detail & Related papers (2024-08-23T17:16:43Z) - One Pass Streaming Algorithm for Super Long Token Attention
Approximation in Sublinear Space [11.735802740426294]
Attention computation takes both the time complexity of $O(n2)$ and the space complexity of $O(n2)$ simultaneously.
We introduce a new algorithm that only reads one pass of data in a streaming fashion.
Notably, our algorithm exhibits exceptional memory-efficient performance with super-long tokens.
arXiv Detail & Related papers (2023-11-24T18:35:00Z) - Transformers Learn Shortcuts to Automata [52.015990420075944]
We find that a low-depth Transformer can represent the computations of any finite-state automaton.
We show that a Transformer with $O(log T)$ layers can exactly replicate the computation of an automaton on an input sequence of length $T$.
We further investigate the brittleness of these solutions and propose potential mitigations.
arXiv Detail & Related papers (2022-10-19T17:45:48Z) - Mixability made efficient: Fast online multiclass logistic regression [68.8204255655161]
We show that mixability can be a powerful tool to obtain algorithms with optimal regret.
The resulting methods often suffer from high computational complexity which has reduced their practical applicability.
arXiv Detail & Related papers (2021-10-08T08:22:05Z) - Compressing 1D Time-Channel Separable Convolutions using Sparse Random
Ternary Matrices [65.4388266814055]
We replace 1x1-convolutions in 1D time-channel separable convolutions with constant, sparse random ternary matrices with weights in $-1,0,+1$.
For command recognition on Google Speech Commands v1, we improve the state-of-the-art accuracy from $97.21%$ to $97.41%$ at the same network size.
For speech recognition on Librispeech, we half the number of weights to be trained while only sacrificing about $1%$ of the floating-point baseline's word error rate.
arXiv Detail & Related papers (2021-03-31T15:09:20Z) - Fast Approximate Multi-output Gaussian Processes [6.6174748514131165]
Training with the proposed approach requires computing only a $N times n$ eigenfunction matrix and a $n times n$ inverse where $n$ is a selected number of eigenvalues.
The proposed method can regress over multiple outputs, estimate the derivative of the regressor of any order, and learn the correlations between them.
arXiv Detail & Related papers (2020-08-22T14:34:45Z) - $O(n)$ Connections are Expressive Enough: Universal Approximability of
Sparse Transformers [71.31712741938837]
We show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n2$ connections.
We also present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
arXiv Detail & Related papers (2020-06-08T18:30:12Z) - GMAT: Global Memory Augmentation for Transformers [45.584411593847406]
We propose to augment sparse Transformer blocks with a dense attention-based $textitglobal memory$ of length $M$ ($ll L$)
Our augmentation has a manageable $O(Mcdot(L+M))$ memory overhead, and can be seamlessly integrated with prior sparse solutions.
arXiv Detail & Related papers (2020-06-05T07:50:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.