Related papers: Improved state mixing in higher-order and block diagonal linear recurrent networks

Improved state mixing in higher-order and block diagonal linear recurrent networks

URL: http://arxiv.org/abs/2602.12021v1
Date: Thu, 12 Feb 2026 14:51:59 GMT
Title: Improved state mixing in higher-order and block diagonal linear recurrent networks
Authors: Igor Dubinin, Antonio Orvieto, Felix Effenberger,
Abstract summary: Linear recurrent networks (LRNNs) and linear state space models (SSMs) promise computational and memory efficiency on long-sequence modeling tasks.<n>Dense and nonlinear architectures (e.g., LSTMs) on the other hand are provably more expressive, but computationally costly.<n>Here, we explore how expressivity in LRNNs can be increased via richer state mixing across time and channels while maintaining competitive efficiency.
Score: 16.116191916700554
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Linear recurrent networks (LRNNs) and linear state space models (SSMs) promise computational and memory efficiency on long-sequence modeling tasks, yet their diagonal state transitions limit expressivity. Dense and nonlinear architectures (e.g., LSTMs) on the other hand are provably more expressive, but computationally costly. Here, we explore how expressivity in LRNNs can be increased via richer state mixing across time and channels while maintaining competitive efficiency. Specifically, we introduce two structured LRNN architectures: (i) Higher-order Linear Recurrent Units (H-LRU), which generalize first-order recurrence to higher order, mixing multiple past states, and (ii) Block-Diagonal LRUs (BD-LRU), which enable dense intra-block channel mixing. Per-channel (H-LRU) or per-row (BD-LRU) L1-normalization of selective gates stabilizes training and allows for scaling window/block sizes. A parallel-scan implementation of the proposed architectures keeps the throughput competitive with diagonal LRNNs for moderate orders (H-LRU) and block sizes (BD-LRU). In synthetic sequence modeling tasks, the performance of BD-LRU matches or exceeds those of linear SSMs (Mamba), low-rank LRNNs (DeltaNet) and LSTM baselines, while H-LRU is found to be the most parameter-efficient in compression task. In both synthetic sequence modeling and language modeling, our results indicate that the structure of state mixing rather than width alone shapes expressivity of LRNNs, offering a practical route to closing the efficiency-expressivity gap in linear sequence models.

Related papers

PRISM: Parallel Residual Iterative Sequence Model [52.26239951489612]
We propose PRISM (Parallel Residual Iterative Sequence Model) to resolve this tension.<n>PRISM introduces a solver-inspired inductive bias that captures key structural properties of multi-step refinement in a parallelizable form.<n>We prove that this formulation achieves Rank-$L$ accumulation, structurally expanding the update manifold beyond the single-step Rank-$1$ bottleneck.
arXiv Detail & Related papers (2026-02-11T12:39:41Z)
Higher-order Linear Attention [59.92962330635185]
quadratic cost of scaled dot-product attention is a central obstacle to scaling autoregressive language models to long contexts.<n>We introduce Higher-order Linear Attention (HLA), a causal, streaming mechanism that realizes higher interactions via compact prefix sufficient statistics.
arXiv Detail & Related papers (2025-10-31T07:54:37Z)
ParaRNN: Unlocking Parallel Training of Nonlinear RNNs for Large Language Models [9.107447466062409]
ParaRNN is a framework that breaks the sequence-parallelization barrier for nonlinear RNNs.<n>Our implementation achieves speedups of up to 665x over sequential application.<n>ParaRNN is released as an open-source framework for automatic training-parallelization of nonlinear RNNs.
arXiv Detail & Related papers (2025-10-24T13:28:33Z)
Beyond Fixed: Training-Free Variable-Length Denoising for Diffusion Large Language Models [74.15250326312179]
Diffusion Large Language Models offer efficient parallel generation and capable global modeling.<n>The dominant application ofDLLMs is hindered by the need for a statically predefined generation length.<n>We introduce DAEDAL, a novel training-free denoising strategy that enables Dynamic Adaptive Length Expansion.
arXiv Detail & Related papers (2025-08-01T17:56:07Z)
pLSTM: parallelizable Linear Source Transition Mark networks [10.620405837091022]
We introduce parallelizable Linear Source Transition Mark networks (pLSTMs) using Source, Transition, and Mark gates.<n>pLSTMs tackle the vanishing/exploding activation/gradient problem for long distances in DAGs via two distinct modes.<n>We demonstrate that pLSTMs generalize well to larger image sizes, whereas Transformers struggle to extrapolate.
arXiv Detail & Related papers (2025-06-13T17:51:37Z)
Scaling Up Liquid-Resistance Liquid-Capacitance Networks for Efficient Sequence Modeling [50.994194925685434]
LrcSSM is a $textitnon-linear$ recurrent model that processes long sequences as fast as today's linear state-space layers.<n>By forcing its Jacobian matrix to be diagonal, the full sequence can be solved in parallel.<n>LrcSSM offers a formal gradient-stability guarantee that other input-varying systems such as Liquid-S4 do not provide.
arXiv Detail & Related papers (2025-05-27T20:02:59Z)
Efficient Large Language Model Inference with Neural Block Linearization [51.619870789584525]
We introduce Neural Block Linearization (NBL), a novel framework for accelerating transformer model inference.<n>NBL replaces self-attention layers with linear approximations derived from Linear Minimum Mean Squared Error estimators.<n>In experiments, NBL achieves notable computational speed-ups while preserving competitive accuracy.
arXiv Detail & Related papers (2025-05-27T12:01:43Z)
Structured Linear CDEs: Maximally Expressive and Parallel-in-Time Sequence Models [15.650005330621148]
This work introduces Structured Linear Controlled Differential Equations (SLiCEs)<n>It is a unifying framework for sequence models with structured, input-dependent state-transition matrices.<n>We prove that SLiCEs employ block-diagonal, sparse, or Walsh-Hadamard matrices.
arXiv Detail & Related papers (2025-05-23T11:34:21Z)
Bidirectional Linear Recurrent Models for Sequence-Level Multisource Fusion [10.867398697751742]
We introduce BLUR (Bidirectional Linear Unit for Recurrent network), which uses forward and backward linear recurrent units (LRUs) to capture both past and future dependencies with high computational efficiency.<n>Experiments on sequential image and time series datasets reveal that BLUR not only surpasses transformers and traditional RNNs in accuracy but also significantly reduces computational costs.
arXiv Detail & Related papers (2025-04-11T20:42:58Z)
Fixed-Point RNNs: Interpolating from Diagonal to Dense [18.06917701940596]
Linear recurrent neural networks (RNNs) and state-space models (SSMs) have become promising alternatives to softmax-attention as sequence mixing layers in Transformer architectures.<n>Current models, however, do not exhibit the full state-tracking expressivity of RNNs because they rely on channel-wise (i.e. diagonal) sequence mixing.<n>In this paper, we investigate parameterizations of a large class of dense linear RNNs as fixed-points of parallelizable diagonal RNNs.
arXiv Detail & Related papers (2025-03-13T18:50:22Z)
Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences [60.489682735061415]
We propose CHELA, which replaces state space models with short-long convolutions and implements linear attention in a divide-and-conquer manner. Our experiments on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2024-06-12T12:12:38Z)
Efficient State Space Model via Fast Tensor Convolution and Block Diagonalization [5.260841516691153]
We propose a new state space layer based on multiple-input multiple-output SSM, called efficient SSM.<n>Our eSSM is built on the convolutional representation of multi-input and multi-input (MIMO) SSM.<n>In the model efficiency benchmark, the parameters of eSSM are only 12.89% of LSTM and 13.24% of Mamba.
arXiv Detail & Related papers (2024-02-23T12:36:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.