Related papers: Scaling Up Liquid-Resistance Liquid-Capacitance Networks for Efficient Sequence Modeling

Scaling Up Liquid-Resistance Liquid-Capacitance Networks for Efficient Sequence Modeling

URL: http://arxiv.org/abs/2505.21717v3
Date: Wed, 02 Jul 2025 09:09:49 GMT
Title: Scaling Up Liquid-Resistance Liquid-Capacitance Networks for Efficient Sequence Modeling
Authors: Mónika Farsang, Ramin Hasani, Daniela Rus, Radu Grosu,
Abstract summary: We present LrcSSM, a $textitnonlinear$ recurrent model that processes long sequences as fast as today's linear state-space layers.<n>LrcSSM offers a formal gradient-stability guarantee that other input-varying systems such as Liquid-S4 and Mamba do not provide.<n>We show that on a series of long-range forecasting tasks, LrcSSM outperforms LRU, S5 and Mamba.
Score: 53.925413758281096
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We present LrcSSM, a $\textit{nonlinear}$ recurrent model that processes long sequences as fast as today's linear state-space layers. By forcing the state-transition matrix to be diagonal and learned at every step, the full sequence can be solved in parallel with a single prefix-scan, giving $\mathcal{O}(TD)$ time and memory and only $\mathcal{O}(\log T)$ sequential depth, for input-sequence length $T$ and a state dimension $D$. Moreover, LrcSSM offers a formal gradient-stability guarantee that other input-varying systems such as Liquid-S4 and Mamba do not provide. Lastly, for network depth $L$, as the forward and backward passes cost $\Theta(T\,D\,L)$ FLOPs, with its low sequential depth and parameter count $\Theta(D\,L)$, the model follows the compute-optimal scaling law regime ($\beta \approx 0.42$) recently observed for Mamba, outperforming quadratic-attention Transformers at equal compute while avoiding the memory overhead of FFT-based long convolutions. We show that on a series of long-range forecasting tasks, LrcSSM outperforms LRU, S5 and Mamba.

Related papers

Learning quadratic neural networks in high dimensions: SGD dynamics and scaling laws [21.18373933718468]
We study the optimization and sample complexity of gradient-based training of a two-layer neural network with quadratic activation function in the high-dimensional regime.<n>We present a sharp analysis of the dynamics in the feature learning regime, for both the population limit and the finite-sample discretization.
arXiv Detail & Related papers (2025-08-05T17:57:56Z)
From Continual Learning to SGD and Back: Better Rates for Continual Linear Models [50.11453013647086]
We analyze the forgetting, i.e., loss on previously seen tasks, after $k$ iterations.<n>We develop novel last-iterate upper bounds in the realizable least squares setup.<n>We prove for the first time that randomization alone, with no task repetition, can prevent catastrophic in sufficiently long task sequences.
arXiv Detail & Related papers (2025-04-06T18:39:45Z)
Learning Networks from Wide-Sense Stationary Stochastic Processes [7.59499154221528]
A key inference problem here is to learn edge connectivity from node outputs (potentials)<n>We use a Whittle's maximum likelihood estimator (MLE) to learn the support of $Last$ from temporally correlated samples.<n>We show that the MLE problem is strictly convex, admitting a unique solution.
arXiv Detail & Related papers (2024-12-04T23:14:00Z)
Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency [47.8739414267201]
We consider gradient descent (GD) with a constant stepsize applied to logistic regression with linearly separable data. We show that GD exits this initial oscillatory phase rapidly -- in $mathcalO(eta)$ steps -- and subsequently achieves an $tildemathcalO (1 / (eta t) )$ convergence rate. Our results imply that, given a budget of $T$ steps, GD can achieve an accelerated loss of $tildemathcalO (1/T2)$ with an aggressive stepsize
arXiv Detail & Related papers (2024-02-24T23:10:28Z)
TCNCA: Temporal Convolution Network with Chunked Attention for Scalable Sequence Processing [52.64837396100988]
MEGA is a recent transformer-based architecture, which utilizes a linear recurrent operator whose parallel computation, based on the FFT, scales as $O(LlogL)$, with $L$ being the sequence length. We build upon their approach by replacing the linear recurrence with a special temporal convolutional network which permits larger receptive field size with shallower networks, and reduces the computational complexity to $O(L)$. We evaluate TCNCA on EnWik8 language modeling, long-range-arena (LRA) sequence classification, as well as a synthetic reasoning benchmark associative recall.
arXiv Detail & Related papers (2023-12-09T16:12:25Z)
Convolutional State Space Models for Long-Range Spatiotemporal Modeling [65.0993000439043]
ConvS5 is an efficient variant for long-rangetemporal modeling. It significantly outperforms Transformers and ConvNISTTM on a long horizon Moving-Lab experiment while training 3X faster than ConvLSTM and generating samples 400X faster than Transformers.
arXiv Detail & Related papers (2023-10-30T16:11:06Z)
Restarted Bayesian Online Change-point Detection for Non-Stationary Markov Decision Processes [12.229154524476405]
We introduce a variant of the Restarted Bayesian Online Change-Point Detection algorithm (R-BOCPD) We propose an improved version of the UCRL2 algorithm for MDPs with state transition kernel sampled from a multinomial distribution. We show that R-BOCPD-UCRL2 enjoys a favorable regret bound of $Oleft(D O sqrtA T K_T logleft (fracTdelta right) + fracK_Tdeltaminlimits_ell.
arXiv Detail & Related papers (2023-04-01T05:26:41Z)
Simplifying and Understanding State Space Models with Diagonal Linear RNNs [56.33053691749856]
This work disposes of the discretization step, and proposes a model based on vanilla Diagonal Linear RNNs. We empirically show that, despite being conceptually much simpler, $mathrmDLR$ is as performant as previously-proposed SSMs. We also characterize the expressivity of SSMs and attention-based models via a suite of $13$ synthetic sequence-to-sequence tasks.
arXiv Detail & Related papers (2022-12-01T18:53:06Z)
Model-based Reinforcement Learning for Continuous Control with Posterior Sampling [10.91557009257615]
We study model-based posterior sampling for reinforcement learning (PSRL) in continuous state-action spaces. We present MPC-PSRL, a model-based posterior sampling algorithm with model predictive control for action selection.
arXiv Detail & Related papers (2020-11-20T21:00:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.