Why Are Linear RNNs More Parallelizable?
- URL: http://arxiv.org/abs/2603.03612v2
- Date: Thu, 05 Mar 2026 05:58:29 GMT
- Title: Why Are Linear RNNs More Parallelizable?
- Authors: William Merrill, Hongjian Jiang, Yanhong Li, Anthony Lin, Ashish Sabharwal,
- Abstract summary: We show that LRNNs can be viewed as log-depth arithmetic circuits, which represents only a slight depth overhead relative to log-depth circuits that transformers admit.<n>We also identify expressivity differences between recent popular LRNN variants: permutation-diagonal LRNNs are $mathsfNC1$-complete whereas diagonal-plus-low-rank LRNNs are more expressive.
- Score: 34.04439983593104
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The community is increasingly exploring linear RNNs (LRNNs) as language models, motivated by their expressive power and parallelizability. While prior work establishes the expressivity benefits of LRNNs over transformers, it is unclear what makes LRNNs -- but not traditional, nonlinear RNNs -- as easy to parallelize in practice as transformers. We answer this question by providing a tight connection between types of RNNs and standard complexity classes. We show that LRNNs can be viewed as log-depth (bounded fan-in) arithmetic circuits, which represents only a slight depth overhead relative to log-depth boolean circuits that transformers admit. Furthermore, we show that nonlinear RNNs can solve $\mathsf{L}$-complete problems (and even $\mathsf{P}$-complete ones, under polynomial precision), revealing a fundamental barrier to parallelizing them as efficiently as transformers. Our theory also identifies fine-grained expressivity differences between recent popular LRNN variants: permutation-diagonal LRNNs are $\mathsf{NC}^1$-complete whereas diagonal-plus-low-rank LRNNs are more expressive ($\mathsf{PNC}^1$-complete). We provide further insight by associating each type of RNN with a corresponding automata-theoretic model that it can simulate. Together, our results reveal fundamental tradeoffs between nonlinear RNNs and different variants of LRNNs, providing a foundation for designing LLM architectures that achieve an optimal balance between expressivity and parallelism.
Related papers
- ParaRNN: Unlocking Parallel Training of Nonlinear RNNs for Large Language Models [9.107447466062409]
ParaRNN is a framework that breaks the sequence-parallelization barrier for nonlinear RNNs.<n>Our implementation achieves speedups of up to 665x over sequential application.<n>ParaRNN is released as an open-source framework for automatic training-parallelization of nonlinear RNNs.
arXiv Detail & Related papers (2025-10-24T13:28:33Z) - Fixed-Point RNNs: Interpolating from Diagonal to Dense [18.06917701940596]
Linear recurrent neural networks (RNNs) and state-space models (SSMs) have become promising alternatives to softmax-attention as sequence mixing layers in Transformer architectures.<n>Current models, however, do not exhibit the full state-tracking expressivity of RNNs because they rely on channel-wise (i.e. diagonal) sequence mixing.<n>In this paper, we investigate parameterizations of a large class of dense linear RNNs as fixed-points of parallelizable diagonal RNNs.
arXiv Detail & Related papers (2025-03-13T18:50:22Z) - Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues [65.41946981594567]
Mamba, RWKV, GLA, mLSTM, and DeltaNet have emerged as efficient alternatives to Transformers for long sequences.<n>However, both Transformers and LRNNs struggle to perform state-tracking, which may impair performance in tasks such as code evaluation.<n>We show that extending the eigenvalue range of Mamba and DeltaNet to include negative values can improve their performance on state-tracking tasks.
arXiv Detail & Related papers (2024-11-19T14:35:38Z) - Attention as an RNN [66.5420926480473]
We show that attention can be viewed as a special Recurrent Neural Network (RNN) with the ability to compute its textitmany-to-one RNN output efficiently.
We introduce a new efficient method of computing attention's textitmany-to-many RNN output based on the parallel prefix scan algorithm.
We show Aarens achieve comparable performance to Transformers on $38$ datasets spread across four popular sequential problem settings.
arXiv Detail & Related papers (2024-05-22T19:45:01Z) - On Efficiently Representing Regular Languages as RNNs [49.88310438099143]
We show that RNNs can efficiently represent bounded hierarchical structures that are prevalent in human language.
This suggests that RNNs' success might be linked to their ability to model hierarchy.
We show that RNNs can efficiently represent a larger class of LMs than previously claimed.
arXiv Detail & Related papers (2024-02-24T13:42:06Z) - Advancing Regular Language Reasoning in Linear Recurrent Neural Networks [56.11830645258106]
We study whether linear recurrent neural networks (LRNNs) can learn the hidden rules in training sequences.
We propose a new LRNN equipped with a block-diagonal and input-dependent transition matrix.
Experiments suggest that the proposed model is the only LRNN capable of performing length extrapolation on regular language tasks.
arXiv Detail & Related papers (2023-09-14T03:36:01Z) - RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs.
We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z) - Adaptive-saturated RNN: Remember more with less instability [2.191505742658975]
This work proposes Adaptive-Saturated RNNs (asRNN), a variant that dynamically adjusts its saturation level between the two approaches.
Our experiments show encouraging results of asRNN on challenging sequence learning benchmarks compared to several strong competitors.
arXiv Detail & Related papers (2023-04-24T02:28:03Z) - Exploiting Low-Rank Tensor-Train Deep Neural Networks Based on
Riemannian Gradient Descent With Illustrations of Speech Processing [74.31472195046099]
We exploit a low-rank tensor-train deep neural network (TT-DNN) to build an end-to-end deep learning pipeline, namely LR-TT-DNN.
A hybrid model combining LR-TT-DNN with a convolutional neural network (CNN) is set up to boost the performance.
Our empirical evidence demonstrates that the LR-TT-DNN and CNN+(LR-TT-DNN) models with fewer model parameters can outperform the TT-DNN and CNN+(LR-TT-DNN) counterparts.
arXiv Detail & Related papers (2022-03-11T15:55:34Z) - Recurrent Neural Network from Adder's Perspective: Carry-lookahead RNN [9.20540910698296]
We discuss the similarities between recurrent neural network (RNN) and serial adder.
Inspired by carry-lookahead adder, we introduce carry-lookahead module to RNN, which makes it possible for RNN to run in parallel.
arXiv Detail & Related papers (2021-06-22T12:28:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.