Related papers: Sub-Linear Memory: How to Make Performers SLiM

Sub-Linear Memory: How to Make Performers SLiM

URL: http://arxiv.org/abs/2012.11346v1
Date: Mon, 21 Dec 2020 13:56:04 GMT
Title: Sub-Linear Memory: How to Make Performers SLiM
Authors: Valerii Likhosherstov, Krzysztof Choromanski, Jared Davis, Xingyou Song, Adrian Weller
Abstract summary: vanilla Transformers require $O(L2)$ in serial time and memory as functions of input length $L$. Recent works proposed various linear self-attention mechanisms, scaling only as $O(L)$ for serial computation. We observe a remarkable computational flexibility: forward and backward propagation can be performed with no approximations using sublinear memory.
Score: 38.068090269482425
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The Transformer architecture has revolutionized deep learning on sequential data, becoming ubiquitous in state-of-the-art solutions for a wide variety of applications. Yet vanilla Transformers are notoriously resource-expensive, requiring $O(L^2)$ in serial time and memory as functions of input length $L$. Recent works proposed various linear self-attention mechanisms, scaling only as $O(L)$ for serial computation. We perform a thorough analysis of recent Transformer mechanisms with linear self-attention, Performers, in terms of overall computational complexity. We observe a remarkable computational flexibility: forward and backward propagation can be performed with no approximations using sublinear memory as a function of $L$ (in addition to negligible storage for the input sequence), at a cost of greater time complexity in the parallel setting. In the extreme case, a Performer consumes only $O(1)$ memory during training, and still requires $O(L)$ time. This discovered time-memory tradeoff can be used for training or, due to complete backward-compatibility, for fine-tuning on a low-memory device, e.g. a smartphone or an earlier-generation GPU, thus contributing towards decentralized and democratized deep learning.

Related papers

Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time [17.086679273053853]
We show that a novel fast approximation method can calculate the gradients in almost linear time. By improving the efficiency of gradient, we hope that this work will facilitate more effective training and deployment of long-context language models.
arXiv Detail & Related papers (2024-08-23T17:16:43Z)
One Pass Streaming Algorithm for Super Long Token Attention Approximation in Sublinear Space [11.735802740426294]
Attention computation takes both the time complexity of $O(n2)$ and the space complexity of $O(n2)$ simultaneously. We introduce a new algorithm that only reads one pass of data in a streaming fashion. Notably, our algorithm exhibits exceptional memory-efficient performance with super-long tokens.
arXiv Detail & Related papers (2023-11-24T18:35:00Z)
Transformers Learn Shortcuts to Automata [52.015990420075944]
We find that a low-depth Transformer can represent the computations of any finite-state automaton. We show that a Transformer with $O(log T)$ layers can exactly replicate the computation of an automaton on an input sequence of length $T$. We further investigate the brittleness of these solutions and propose potential mitigations.
arXiv Detail & Related papers (2022-10-19T17:45:48Z)
Mixability made efficient: Fast online multiclass logistic regression [68.8204255655161]
We show that mixability can be a powerful tool to obtain algorithms with optimal regret. The resulting methods often suffer from high computational complexity which has reduced their practical applicability.
arXiv Detail & Related papers (2021-10-08T08:22:05Z)
Compressing 1D Time-Channel Separable Convolutions using Sparse Random Ternary Matrices [65.4388266814055]
We replace 1x1-convolutions in 1D time-channel separable convolutions with constant, sparse random ternary matrices with weights in $-1,0,+1$. For command recognition on Google Speech Commands v1, we improve the state-of-the-art accuracy from $97.21%$ to $97.41%$ at the same network size. For speech recognition on Librispeech, we half the number of weights to be trained while only sacrificing about $1%$ of the floating-point baseline's word error rate.
arXiv Detail & Related papers (2021-03-31T15:09:20Z)
Fast Approximate Multi-output Gaussian Processes [6.6174748514131165]
Training with the proposed approach requires computing only a $N times n$ eigenfunction matrix and a $n times n$ inverse where $n$ is a selected number of eigenvalues. The proposed method can regress over multiple outputs, estimate the derivative of the regressor of any order, and learn the correlations between them.
arXiv Detail & Related papers (2020-08-22T14:34:45Z)
$O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers [71.31712741938837]
We show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n2$ connections. We also present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
arXiv Detail & Related papers (2020-06-08T18:30:12Z)
GMAT: Global Memory Augmentation for Transformers [45.584411593847406]
We propose to augment sparse Transformer blocks with a dense attention-based $textitglobal memory$ of length $M$ ($ll L$) Our augmentation has a manageable $O(Mcdot(L+M))$ memory overhead, and can be seamlessly integrated with prior sparse solutions.
arXiv Detail & Related papers (2020-06-05T07:50:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.