Related papers: Transformers are Multi-State RNNs

Transformers are Multi-State RNNs

URL: http://arxiv.org/abs/2401.06104v2
Date: Tue, 18 Jun 2024 09:16:14 GMT
Title: Transformers are Multi-State RNNs
Authors: Matanel Oren, Michael Hassid, Nir Yarden, Yossi Adi, Roy Schwartz,
Abstract summary: We show that decoder-only transformers can be conceptualized as unbounded multi-state RNNs. Transformers can be converted into $textitbounded$ multi-state RNNs by fixing the size of their hidden state. We introduce a novel, training-free compression policy - $textbfT$oken $textbfO$mission $textbfV$ia $textbfA$ttention (TOVA)
Score: 25.99353771107789
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Transformers are considered conceptually different from the previous generation of state-of-the-art NLP models - recurrent neural networks (RNNs). In this work, we demonstrate that decoder-only transformers can in fact be conceptualized as unbounded multi-state RNNs - an RNN variant with unlimited hidden state size. We further show that transformers can be converted into $\textit{bounded}$ multi-state RNNs by fixing the size of their hidden state, effectively compressing their key-value cache. We introduce a novel, training-free compression policy - $\textbf{T}$oken $\textbf{O}$mission $\textbf{V}$ia $\textbf{A}$ttention (TOVA). Our experiments with four long range tasks and several LLMs show that TOVA outperforms several baseline compression policies. Particularly, our results are nearly on par with the full model, using in some cases only $\frac{1}{8}$ of the original cache size, which translates to 4.8X higher throughput. Our results shed light on the connection between transformers and RNNs, and help mitigate one of LLMs' most painful computational bottlenecks - the size of their key-value cache. We publicly release our code at https://github.com/schwartz-lab-NLP/TOVA

Related papers

Theoretical limitations of multi-layer Transformer [14.63344366356708]
We prove the first $textitunconditional$ lower bound against multi-layer decoder-only transformers. We also introduce a new proof technique that finds a certain $textitindistinguishable$ $textitde$ all possible inputs. We believe our new communication model and proof technique will be helpful to further understand the computational power of transformers.
arXiv Detail & Related papers (2024-12-04T02:37:31Z)
Puppet-CNN: Input-Adaptive Convolutional Neural Networks with Model Compression using Ordinary Differential Equation [5.453850739960517]
We propose a new CNN framework, named as $textitPuppet-CNN$, which contains two modules. The puppet module is a CNN model used to process the input data just like other works. By recurrently generating kernel parameters in the puppet module, we can take advantage of the dependence among kernels of different convolutional layers to significantly reduce the size of CNN model.
arXiv Detail & Related papers (2024-11-19T21:44:21Z)
On the Role of Depth and Looping for In-Context Learning with Task Diversity [69.4145579827826]
We study in-context learning for linear regression with diverse tasks. We show that multilayer Transformers are not robust to even distributional shifts as small as $O(e-L)$ in Wasserstein distance.
arXiv Detail & Related papers (2024-10-29T03:27:56Z)
Efficient k-Nearest-Neighbor Machine Translation with Dynamic Retrieval [49.825549809652436]
$k$NN-MT constructs an external datastore to store domain-specific translation knowledge. adaptive retrieval ($k$NN-MT-AR) dynamically estimates $lambda$ and skips $k$NN retrieval if $lambda$ is less than a fixed threshold. We propose dynamic retrieval ($k$NN-MT-DR) that significantly extends vanilla $k$NN-MT in two aspects.
arXiv Detail & Related papers (2024-06-10T07:36:55Z)
Unlimiformer: Long-Range Transformers with Unlimited Length Input [67.04942180004805]
Unlimiformer is a general approach that wraps any existing pretrained encoder-decoder transformer. It offloads the cross-attention computation to a single k-nearest-neighbor (kNN) index. We show that Unlimiformer can process even 500k token-long inputs from the BookSum dataset, without any input truncation at test time.
arXiv Detail & Related papers (2023-05-02T17:35:08Z)
What's Hidden in a One-layer Randomly Weighted Transformer? [100.98342094831334]
Hidden within one-layer randomly weighted neural networks, there existworks that can achieve impressive performance. Using a fixed pre-trained embedding layer, the previously foundworks are smaller than, but can match 98%/92% (34.14/25.24 BLEU) of the performance of, a trained Transformer small/base on IWSLT14/WMT14.
arXiv Detail & Related papers (2021-09-08T21:22:52Z)
Escaping the Big Data Paradigm with Compact Transformers [7.697698018200631]
We show for the first time that with the right size and tokenization, transformers can perform head-to-head with state-of-the-art CNNs on small datasets. Our method is flexible in terms of model size, and can have as little as 0.28M parameters and achieve reasonable results.
arXiv Detail & Related papers (2021-04-12T17:58:56Z)
Masked Contrastive Representation Learning for Reinforcement Learning [202.8261654227565]
CURL, which uses contrastive learning to extract high-level features from raw pixels of individual video frames, is an efficient algorithm. We propose a new algorithm, masked contrastive representation learning for RL, that takes the correlation among consecutive inputs into consideration. Our method achieves consistent improvements over CURL on $14$ out of $16$ environments from DMControl suite and $21$ out of $26$ environments from Atari 2600 Games.
arXiv Detail & Related papers (2020-10-15T02:00:10Z)
Improving Network Slimming with Nonconvex Regularization [8.017631543721684]
Convolutional neural networks (CNNs) have developed to become powerful models for various computer vision tasks. Most of the state-of-the-art CNNs cannot be deployed directly. straightforward approach to compressing CNN is proposed.
arXiv Detail & Related papers (2020-10-03T01:04:02Z)
$O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers [71.31712741938837]
We show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n2$ connections. We also present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
arXiv Detail & Related papers (2020-06-08T18:30:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.