Sequential-Parallel Duality in Prefix Scannable Models
- URL: http://arxiv.org/abs/2506.10918v1
- Date: Thu, 12 Jun 2025 17:32:02 GMT
- Title: Sequential-Parallel Duality in Prefix Scannable Models
- Authors: Morris Yau, Sharut Gupta, Valerie Engelmayer, Kazuki Irie, Stefanie Jegelka, Jacob Andreas,
- Abstract summary: Recent developments have given rise to various models, such as Gated Linear Attention (GLA) and Mamba.<n>This raises a natural question: can we characterize the full class of neural sequence models that support near-constant-time parallel evaluation and linear-time, constant-space sequential inference?
- Score: 68.39855814099997
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern neural sequence models are designed to meet the dual mandate of parallelizable training and fast sequential inference. Recent developments have given rise to various models, such as Gated Linear Attention (GLA) and Mamba, that achieve such ``sequential-parallel duality.'' This raises a natural question: can we characterize the full class of neural sequence models that support near-constant-time parallel evaluation and linear-time, constant-space sequential inference? We begin by describing a broad class of such models -- state space models -- as those whose state updates can be computed using the classic parallel prefix scan algorithm with a custom associative aggregation operator. We then define a more general class, Prefix-Scannable Models (PSMs), by relaxing the state aggregation operator to allow arbitrary (potentially non-associative) functions such as softmax attention. This generalization unifies many existing architectures, including element-wise RNNs (e.g., Mamba) and linear transformers (e.g., GLA, Mamba2, mLSTM), while also introducing new models with softmax-like operators that achieve O(1) amortized compute per token and log(N) memory for sequence length N. We empirically evaluate such models on illustrative small-scale language modeling and canonical synthetic tasks, including state tracking and associative recall. Empirically, we find that PSMs retain the expressivity of transformer-based architectures while matching the inference efficiency of state space models -- in some cases exhibiting better length generalization than either.
Related papers
- Bridging Expressivity and Scalability with Adaptive Unitary SSMs [4.61803711540329]
We introduce the Adaptive Unitary State Space Model (AUSSM)- a novel class of SSMs that skew-symmetric, input-dependent recurrence to achieve unitary evolution and high expressive power.<n>Our results demonstrate that adaptive unitary recurrence provides a powerful and efficient algorithmic bias for both symbolic and continuous sequence modeling.
arXiv Detail & Related papers (2025-07-07T17:47:16Z) - MesaNet: Sequence Modeling by Locally Optimal Test-Time Training [67.45211108321203]
We introduce a numerically stable, chunkwise parallelizable version of the recently proposed Mesa layer.<n>We show that optimal test-time training enables reaching lower language modeling perplexity and higher downstream benchmark performance than previous RNNs.
arXiv Detail & Related papers (2025-06-05T16:50:23Z) - Latent Space Energy-based Neural ODEs [73.01344439786524]
This paper introduces novel deep dynamical models designed to represent continuous-time sequences.<n>We train the model using maximum likelihood estimation with Markov chain Monte Carlo.<n> Experimental results on oscillating systems, videos and real-world state sequences (MuJoCo) demonstrate that our model with the learnable energy-based prior outperforms existing counterparts.
arXiv Detail & Related papers (2024-09-05T18:14:22Z) - State Soup: In-Context Skill Learning, Retrieval and Mixing [22.485700977542127]
A new breed of gated-linear recurrent neural networks has reached state-of-the-art performance on a range of sequence modeling problems.
Here, we explore another advantage of these stateful sequence models, inspired by the success of model merging through parameter.
Building on parallels between fine-tuning and in-context learning, we investigate whether we can treat internal states as task vectors that can be stored, retrieved, and then linearly combined.
arXiv Detail & Related papers (2024-06-12T17:06:07Z) - Mamba: Linear-Time Sequence Modeling with Selective State Spaces [31.985243136674146]
Foundation models are almost universally based on the Transformer architecture and its core attention module.
We identify that a key weakness of such models is their inability to perform content-based reasoning.
We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even blocks (Mamba)
As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics.
arXiv Detail & Related papers (2023-12-01T18:01:34Z) - Convolutional State Space Models for Long-Range Spatiotemporal Modeling [65.0993000439043]
ConvS5 is an efficient variant for long-rangetemporal modeling.
It significantly outperforms Transformers and ConvNISTTM on a long horizon Moving-Lab experiment while training 3X faster than ConvLSTM and generating samples 400X faster than Transformers.
arXiv Detail & Related papers (2023-10-30T16:11:06Z) - Long Sequence Hopfield Memory [32.28395813801847]
Sequence memory enables agents to encode, store, and retrieve complex sequences of stimuli and actions.
We introduce a nonlinear interaction term, enhancing separation between the patterns.
We extend this model to store sequences with variable timing between states' transitions.
arXiv Detail & Related papers (2023-06-07T15:41:03Z) - Low-Rank Constraints for Fast Inference in Structured Models [110.38427965904266]
This work demonstrates a simple approach to reduce the computational and memory complexity of a large class of structured models.
Experiments with neural parameterized structured models for language modeling, polyphonic music modeling, unsupervised grammar induction, and video modeling show that our approach matches the accuracy of standard models at large state spaces.
arXiv Detail & Related papers (2022-01-08T00:47:50Z) - Combining Recurrent, Convolutional, and Continuous-time Models with
Linear State-Space Layers [21.09321438439848]
We introduce a simple sequence model inspired by control systems that generalize.
We show that LSSL models are closely related to the three aforementioned families of models and inherit their strengths.
For example, they generalize convolutions to continuous-time, explain common RNN-1s, and share features of NDEs such as time-scale adaptation.
arXiv Detail & Related papers (2021-10-26T19:44:53Z) - Structured Reordering for Modeling Latent Alignments in Sequence
Transduction [86.94309120789396]
We present an efficient dynamic programming algorithm performing exact marginal inference of separable permutations.
The resulting seq2seq model exhibits better systematic generalization than standard models on synthetic problems and NLP tasks.
arXiv Detail & Related papers (2021-06-06T21:53:54Z) - Tensor Networks for Probabilistic Sequence Modeling [7.846449972735859]
We use a uniform matrix product state (u-MPS) model for probabilistic modeling of sequence data.
We then introduce a novel generative algorithm giving trained u-MPS the ability to efficiently sample from a wide variety of conditional distributions.
Experiments on sequence modeling with synthetic and real text data show u-MPS outperforming a variety of baselines.
arXiv Detail & Related papers (2020-03-02T17:16:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.