RWKV: Reinventing RNNs for the Transformer Era
- URL: http://arxiv.org/abs/2305.13048v2
- Date: Mon, 11 Dec 2023 03:58:56 GMT
- Title: RWKV: Reinventing RNNs for the Transformer Era
- Authors: Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel
Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Matteo
Grella, Kranthi Kiran GV, Xuzheng He, Haowen Hou, Jiaju Lin, Przemyslaw
Kazienko, Jan Kocon, Jiaming Kong, Bartlomiej Koptyra, Hayden Lau, Krishna
Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Guangyu Song, Xiangru Tang,
Bolun Wang, Johan S. Wind, Stanislaw Wozniak, Ruichong Zhang, Zhenyuan Zhang,
Qihang Zhao, Peng Zhou, Qinghua Zhou, Jian Zhu, Rui-Jie Zhu
- Abstract summary: We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs.
We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
- Score: 54.716108899349614
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformers have revolutionized almost all natural language processing (NLP)
tasks but suffer from memory and computational complexity that scales
quadratically with sequence length. In contrast, recurrent neural networks
(RNNs) exhibit linear scaling in memory and computational requirements but
struggle to match the same performance as Transformers due to limitations in
parallelization and scalability. We propose a novel model architecture,
Receptance Weighted Key Value (RWKV), that combines the efficient
parallelizable training of transformers with the efficient inference of RNNs.
Our approach leverages a linear attention mechanism and allows us to
formulate the model as either a Transformer or an RNN, thus parallelizing
computations during training and maintains constant computational and memory
complexity during inference. We scale our models as large as 14 billion
parameters, by far the largest dense RNN ever trained, and find RWKV performs
on par with similarly sized Transformers, suggesting future work can leverage
this architecture to create more efficient models. This work presents a
significant step towards reconciling trade-offs between computational
efficiency and model performance in sequence processing tasks.
Related papers
- Autoregressive + Chain of Thought = Recurrent: Recurrence's Role in Language Models' Computability and a Revisit of Recurrent Transformer [29.970200877158764]
We investigate the influence of recurrent structures in neural models on their reasoning abilities and computability.
We shed light on how the CoT approach can mimic recurrent computation and act as a bridge between autoregression and recurrence.
arXiv Detail & Related papers (2024-09-14T00:30:57Z) - Separations in the Representational Capabilities of Transformers and Recurrent Architectures [27.783705012503237]
We analyze the differences in the representational capabilities of Transformers and RNNs across several tasks of practical relevance.
We show that a one-layer Transformer of logarithmic width can perform index lookup, whereas an RNN requires a hidden state of linear size.
We also show that a log-size two-layer Transformer can implement the nearest neighbor algorithm in its forward pass.
arXiv Detail & Related papers (2024-06-13T17:31:30Z) - Attention as an RNN [66.5420926480473]
We show that attention can be viewed as a special Recurrent Neural Network (RNN) with the ability to compute its textitmany-to-one RNN output efficiently.
We introduce a new efficient method of computing attention's textitmany-to-many RNN output based on the parallel prefix scan algorithm.
We show Aarens achieve comparable performance to Transformers on $38$ datasets spread across four popular sequential problem settings.
arXiv Detail & Related papers (2024-05-22T19:45:01Z) - Blockwise Parallel Transformer for Large Context Models [70.97386897478238]
Blockwise Parallel Transformer (BPT) is a blockwise computation of self-attention and feedforward network fusion to minimize memory costs.
By processing longer input sequences while maintaining memory efficiency, BPT enables training sequences 32 times longer than vanilla Transformers and up to 4 times longer than previous memory-efficient methods.
arXiv Detail & Related papers (2023-05-30T19:25:51Z) - Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency.
We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z) - Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z) - Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation.
A linear-complexity recurrent variant has proven well suited for autoregressive generation.
This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z) - Attention is All You Need in Speech Separation [12.57578429586883]
We propose a novel RNN-free Transformer-based neural network for speech separation.
The proposed model achieves state-of-the-art (SOTA) performance on the standard WSJ0-2/3mix datasets.
arXiv Detail & Related papers (2020-10-25T16:28:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.