Transformers are Multi-State RNNs
- URL: http://arxiv.org/abs/2401.06104v2
- Date: Tue, 18 Jun 2024 09:16:14 GMT
- Title: Transformers are Multi-State RNNs
- Authors: Matanel Oren, Michael Hassid, Nir Yarden, Yossi Adi, Roy Schwartz,
- Abstract summary: We show that decoder-only transformers can be conceptualized as unbounded multi-state RNNs.
Transformers can be converted into $textitbounded$ multi-state RNNs by fixing the size of their hidden state.
We introduce a novel, training-free compression policy - $textbfT$oken $textbfO$mission $textbfV$ia $textbfA$ttention (TOVA)
- Score: 25.99353771107789
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Transformers are considered conceptually different from the previous generation of state-of-the-art NLP models - recurrent neural networks (RNNs). In this work, we demonstrate that decoder-only transformers can in fact be conceptualized as unbounded multi-state RNNs - an RNN variant with unlimited hidden state size. We further show that transformers can be converted into $\textit{bounded}$ multi-state RNNs by fixing the size of their hidden state, effectively compressing their key-value cache. We introduce a novel, training-free compression policy - $\textbf{T}$oken $\textbf{O}$mission $\textbf{V}$ia $\textbf{A}$ttention (TOVA). Our experiments with four long range tasks and several LLMs show that TOVA outperforms several baseline compression policies. Particularly, our results are nearly on par with the full model, using in some cases only $\frac{1}{8}$ of the original cache size, which translates to 4.8X higher throughput. Our results shed light on the connection between transformers and RNNs, and help mitigate one of LLMs' most painful computational bottlenecks - the size of their key-value cache. We publicly release our code at https://github.com/schwartz-lab-NLP/TOVA
Related papers
- On the Role of Depth and Looping for In-Context Learning with Task Diversity [69.4145579827826]
We study in-context learning for linear regression with diverse tasks.
We show that multilayer Transformers are not robust to even distributional shifts as small as $O(e-L)$ in Wasserstein distance.
arXiv Detail & Related papers (2024-10-29T03:27:56Z) - Efficient k-Nearest-Neighbor Machine Translation with Dynamic Retrieval [49.825549809652436]
$k$NN-MT constructs an external datastore to store domain-specific translation knowledge.
adaptive retrieval ($k$NN-MT-AR) dynamically estimates $lambda$ and skips $k$NN retrieval if $lambda$ is less than a fixed threshold.
We propose dynamic retrieval ($k$NN-MT-DR) that significantly extends vanilla $k$NN-MT in two aspects.
arXiv Detail & Related papers (2024-06-10T07:36:55Z) - Unlimiformer: Long-Range Transformers with Unlimited Length Input [67.04942180004805]
Unlimiformer is a general approach that wraps any existing pretrained encoder-decoder transformer.
It offloads the cross-attention computation to a single k-nearest-neighbor (kNN) index.
We show that Unlimiformer can process even 500k token-long inputs from the BookSum dataset, without any input truncation at test time.
arXiv Detail & Related papers (2023-05-02T17:35:08Z) - What's Hidden in a One-layer Randomly Weighted Transformer? [100.98342094831334]
Hidden within one-layer randomly weighted neural networks, there existworks that can achieve impressive performance.
Using a fixed pre-trained embedding layer, the previously foundworks are smaller than, but can match 98%/92% (34.14/25.24 BLEU) of the performance of, a trained Transformer small/base on IWSLT14/WMT14.
arXiv Detail & Related papers (2021-09-08T21:22:52Z) - Escaping the Big Data Paradigm with Compact Transformers [7.697698018200631]
We show for the first time that with the right size and tokenization, transformers can perform head-to-head with state-of-the-art CNNs on small datasets.
Our method is flexible in terms of model size, and can have as little as 0.28M parameters and achieve reasonable results.
arXiv Detail & Related papers (2021-04-12T17:58:56Z) - Incorporating Convolution Designs into Visual Transformers [24.562955955312187]
We propose a new textbfConvolution-enhanced image Transformer (CeiT) which combines the advantages of CNNs in extracting low-level features, strengthening locality, and the advantages of Transformers in establishing long-range dependencies.
Experimental results on ImageNet and seven downstream tasks show the effectiveness and generalization ability of CeiT compared with previous Transformers and state-of-the-art CNNs, without requiring a large amount of training data and extra CNN teachers.
arXiv Detail & Related papers (2021-03-22T13:16:12Z) - Masked Contrastive Representation Learning for Reinforcement Learning [202.8261654227565]
CURL, which uses contrastive learning to extract high-level features from raw pixels of individual video frames, is an efficient algorithm.
We propose a new algorithm, masked contrastive representation learning for RL, that takes the correlation among consecutive inputs into consideration.
Our method achieves consistent improvements over CURL on $14$ out of $16$ environments from DMControl suite and $21$ out of $26$ environments from Atari 2600 Games.
arXiv Detail & Related papers (2020-10-15T02:00:10Z) - Improving Network Slimming with Nonconvex Regularization [8.017631543721684]
Convolutional neural networks (CNNs) have developed to become powerful models for various computer vision tasks.
Most of the state-of-the-art CNNs cannot be deployed directly.
straightforward approach to compressing CNN is proposed.
arXiv Detail & Related papers (2020-10-03T01:04:02Z) - $O(n)$ Connections are Expressive Enough: Universal Approximability of
Sparse Transformers [71.31712741938837]
We show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n2$ connections.
We also present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
arXiv Detail & Related papers (2020-06-08T18:30:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.