ParaRNN: Unlocking Parallel Training of Nonlinear RNNs for Large Language Models
- URL: http://arxiv.org/abs/2510.21450v2
- Date: Mon, 03 Nov 2025 09:47:30 GMT
- Title: ParaRNN: Unlocking Parallel Training of Nonlinear RNNs for Large Language Models
- Authors: Federico Danieli, Pau Rodriguez, Miguel Sarabia, Xavier Suau, Luca Zappella,
- Abstract summary: ParaRNN is a framework that breaks the sequence-parallelization barrier for nonlinear RNNs.<n>Our implementation achieves speedups of up to 665x over sequential application.<n>ParaRNN is released as an open-source framework for automatic training-parallelization of nonlinear RNNs.
- Score: 9.107447466062409
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recurrent Neural Networks (RNNs) laid the foundation for sequence modeling, but their intrinsic sequential nature restricts parallel computation, creating a fundamental barrier to scaling. This has led to the dominance of parallelizable architectures like Transformers and, more recently, State Space Models (SSMs). While SSMs achieve efficient parallelization through structured linear recurrences, this linearity constraint limits their expressive power and precludes modeling complex, nonlinear sequence-wise dependencies. To address this, we present ParaRNN, a framework that breaks the sequence-parallelization barrier for nonlinear RNNs. Building on prior work, we cast the sequence of nonlinear recurrence relationships as a single system of equations, which we solve in parallel using Newton's iterations combined with custom parallel reductions. Our implementation achieves speedups of up to 665x over naive sequential application, allowing training nonlinear RNNs at unprecedented scales. To showcase this, we apply ParaRNN to adaptations of LSTM and GRU architectures, successfully training models of 7B parameters that attain perplexity comparable to similarly-sized Transformers and Mamba2 architectures. To accelerate research in efficient sequence modeling, we release the ParaRNN codebase as an open-source framework for automatic training-parallelization of nonlinear RNNs, enabling researchers and practitioners to explore new nonlinear RNN models at scale.
Related papers
- Improved state mixing in higher-order and block diagonal linear recurrent networks [16.116191916700554]
Linear recurrent networks (LRNNs) and linear state space models (SSMs) promise computational and memory efficiency on long-sequence modeling tasks.<n>Dense and nonlinear architectures (e.g., LSTMs) on the other hand are provably more expressive, but computationally costly.<n>Here, we explore how expressivity in LRNNs can be increased via richer state mixing across time and channels while maintaining competitive efficiency.
arXiv Detail & Related papers (2026-02-12T14:51:59Z) - PRISM: Parallel Residual Iterative Sequence Model [52.26239951489612]
We propose PRISM (Parallel Residual Iterative Sequence Model) to resolve this tension.<n>PRISM introduces a solver-inspired inductive bias that captures key structural properties of multi-step refinement in a parallelizable form.<n>We prove that this formulation achieves Rank-$L$ accumulation, structurally expanding the update manifold beyond the single-step Rank-$1$ bottleneck.
arXiv Detail & Related papers (2026-02-11T12:39:41Z) - Deep Hierarchical Learning with Nested Subspace Networks [53.71337604556311]
We propose Nested Subspace Networks (NSNs) for large neural networks.<n>NSNs enable a single model to be dynamically and granularly adjusted across a continuous spectrum of compute budgets.<n>We show that NSNs can be surgically applied to pre-trained LLMs and unlock a smooth and predictable compute-performance frontier.
arXiv Detail & Related papers (2025-09-22T15:13:14Z) - MesaNet: Sequence Modeling by Locally Optimal Test-Time Training [67.45211108321203]
We introduce a numerically stable, chunkwise parallelizable version of the recently proposed Mesa layer.<n>We show that optimal test-time training enables reaching lower language modeling perplexity and higher downstream benchmark performance than previous RNNs.
arXiv Detail & Related papers (2025-06-05T16:50:23Z) - Bidirectional Linear Recurrent Models for Sequence-Level Multisource Fusion [10.867398697751742]
We introduce BLUR (Bidirectional Linear Unit for Recurrent network), which uses forward and backward linear recurrent units (LRUs) to capture both past and future dependencies with high computational efficiency.<n>Experiments on sequential image and time series datasets reveal that BLUR not only surpasses transformers and traditional RNNs in accuracy but also significantly reduces computational costs.
arXiv Detail & Related papers (2025-04-11T20:42:58Z) - Fixed-Point RNNs: Interpolating from Diagonal to Dense [10.851383867834052]
We investigate a class of dense linear RNNs as fixed-points of parallelizable diagonal RNNs.<n>The resulting models can naturally trade expressivity for efficiency at a fixed number of parameters.
arXiv Detail & Related papers (2025-03-13T18:50:22Z) - Were RNNs All We Needed? [55.822693848969855]
In this work, we revisit sequence modelling from a historical perspective, focusing on Recurrent Neural Networks (RNNs)<n>We demonstrate that by simplifying these models, we can derive minimal versions (minLSTMs and minGRUs) that use fewer parameters than their traditional counterparts, are fully parallelizable during training, and achieve surprisingly competitive performance on a range of tasks, rivalling recent models including Transformers.
arXiv Detail & Related papers (2024-10-02T03:06:49Z) - RotRNN: Modelling Long Sequences with Rotations [7.037239398244858]
Linear recurrent neural networks, such as State Space Models (SSMs) and Linear Recurrent Units (LRUs) have recently shown state-of-the-art performance on long sequence modelling benchmarks.
We propose RotRNN -- a linear recurrent model which utilises the convenient properties of rotation matrices.
We show that RotRNN provides a simple and efficient model with a robust normalisation procedure, and a practical implementation that remains faithful to its theoretical derivation.
arXiv Detail & Related papers (2024-07-09T21:37:36Z) - RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs.
We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z) - Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency.
We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z) - Reverse engineering recurrent neural networks with Jacobian switching
linear dynamical systems [24.0378100479104]
Recurrent neural networks (RNNs) are powerful models for processing time-series data.
The framework of reverse engineering a trained RNN by linearizing around its fixed points has provided insight, but the approach has significant challenges.
We present a new model that overcomes these limitations by co-training an RNN with a novel switching linear dynamical system (SLDS) formulation.
arXiv Detail & Related papers (2021-11-01T20:49:30Z) - A Fully Tensorized Recurrent Neural Network [48.50376453324581]
We introduce a "fully tensorized" RNN architecture which jointly encodes the separate weight matrices within each recurrent cell.
This approach reduces model size by several orders of magnitude, while still maintaining similar or better performance compared to standard RNNs.
arXiv Detail & Related papers (2020-10-08T18:24:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.