Overcoming Long-Context Limitations of State-Space Models via Context-Dependent Sparse Attention
- URL: http://arxiv.org/abs/2507.00449v2
- Date: Sat, 16 Aug 2025 00:36:51 GMT
- Title: Overcoming Long-Context Limitations of State-Space Models via Context-Dependent Sparse Attention
- Authors: Zhihao Zhan, Jianan Zhao, Zhaocheng Zhu, Jian Tang,
- Abstract summary: We focus on analyzing and improving the long-context modeling capabilities of state-space models (SSMs)<n>We show that the widely used synthetic task, associative recall, insufficiently represents the complexities of real-world long-context modeling.<n>To bridge the gap between theoretical analysis and real-world applications, we propose locality-sensitive Hashing Attention with sparse Key Selection.
- Score: 17.498728107106817
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Efficient long-context modeling remains a critical challenge for natural language processing (NLP), as the time complexity of the predominant Transformer architecture scales quadratically with the sequence length. While state-space models (SSMs) offer alternative sub-quadratic solutions, they struggle to capture long-range dependencies effectively. In this work, we focus on analyzing and improving the long-context modeling capabilities of SSMs. We show that the widely used synthetic task, associative recall, which requires a model to recall a value associated with a single key without context, insufficiently represents the complexities of real-world long-context modeling. To address this limitation, we extend the associative recall to a novel synthetic task, \emph{joint recall}, which requires a model to recall the value associated with a key given in a specified context. Theoretically, we prove that SSMs do not have the expressiveness to solve multi-query joint recall in sub-quadratic time complexity. To resolve this issue, we propose a solution based on integrating SSMs with Context-Dependent Sparse Attention (CDSA), which has the expressiveness to solve multi-query joint recall with sub-quadratic computation. To bridge the gap between theoretical analysis and real-world applications, we propose locality-sensitive Hashing Attention with sparse Key Selection (HAX), which instantiates the theoretical solution and is further tailored to natural language domains. Extensive experiments on both synthetic and real-world long-context benchmarks show that HAX consistently outperforms SSM baselines and SSMs integrated with context-independent sparse attention (CISA).
Related papers
- AgentLongBench: A Controllable Long Benchmark For Long-Contexts Agents via Environment Rollouts [78.33143446024485]
We introduce textbfAgentLongBench, which evaluates agents through simulated environment rollouts based on Lateral Thinking Puzzles.<n>This framework generates rigorous interaction trajectories across knowledge-intensive and knowledge-free scenarios.
arXiv Detail & Related papers (2026-01-28T16:05:44Z) - MS-SSM: A Multi-Scale State Space Model for Efficient Sequence Modeling [60.648359990090846]
State-space models (SSMs) have recently attention as an efficient alternative to computationally expensive attention-based models for sequence modeling.<n>This paper introduces a multi-scale SSM framework that represents sequence dynamics across multiple resolution and processing each resolution with specialized state-space dynamics.
arXiv Detail & Related papers (2025-12-29T19:36:28Z) - NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching [64.10695425442164]
We introduce NExT-OMNI, an open-source omnimodal foundation model that achieves unified modeling through discrete flow paradigms.<n>Trained on large-scale interleaved text, image, video, and audio data, NExT-OMNI delivers competitive performance on multimodal generation and understanding benchmarks.<n>To advance further research, we release training details, data protocols, and open-source both the code and model checkpoints.
arXiv Detail & Related papers (2025-10-15T16:25:18Z) - A Modular Multitask Reasoning Framework Integrating Spatio-temporal Models and LLMs [38.304628241767055]
We introduce STReason, a framework that integrates large language models with analytical capabilities for multi-task inference and execution.<n>We show that STReason significantly outperforms LLM baselines across all metrics, particularly in excelling in complex, reasoningintensive-temporal scenarios.<n>Human evaluations validate STReason's credibility and practical utility, demonstrating potential to reduce expert workload and broaden the applicability to real-world, multi-faceted decision scenarios.
arXiv Detail & Related papers (2025-06-25T00:55:34Z) - LLM-Symbolic Integration for Robust Temporal Tabular Reasoning [69.27153114778748]
We introduce TempTabQA-C, a synthetic dataset designed for systematic and controlled evaluations.<n>This structured approach allows Large Language Models (LLMs) to generate and executesql queries, enhancing generalization and mitigating biases.
arXiv Detail & Related papers (2025-06-06T05:14:04Z) - Modeling Response Consistency in Multi-Agent LLM Systems: A Comparative Analysis of Shared and Separate Context Approaches [0.0]
We introduce the Response Consistency Index (RCI) as a metric to evaluate the effects of context limitations, noise, and inter-agent dependencies on system performance.<n>Our approach differs from existing research by focusing on the interplay between memory constraints and noise management.
arXiv Detail & Related papers (2025-04-09T21:54:21Z) - Provable Benefits of Complex Parameterizations for Structured State Space Models [51.90574950170374]
Structured state space models (SSMs) are linear dynamical systems adhering to a specified structure.
In contrast to typical neural network modules, whose parameterizations are real, SSMs often use complex parameterizations.
This paper takes a step towards explaining the benefits of complex parameterizations for SSMs by establishing formal gaps between real and complex diagonal SSMs.
arXiv Detail & Related papers (2024-10-17T22:35:50Z) - Multi-granularity Contrastive Cross-modal Collaborative Generation for End-to-End Long-term Video Question Answering [53.39158264785098]
Long-term Video Question Answering (VideoQA) is a challenging vision-and-language bridging task.
We present an entirely end-to-end solution for VideoQA: Multi-granularity Contrastive cross-modal collaborative Generation model.
arXiv Detail & Related papers (2024-10-12T06:21:58Z) - Efficient High-Resolution Visual Representation Learning with State Space Model for Human Pose Estimation [60.80423207808076]
Capturing long-range dependencies while preserving high-resolution visual representations is crucial for dense prediction tasks such as human pose estimation.<n>We propose the Dynamic Visual State Space (DVSS) block, which augments visual state space models with multi-scale convolutional operations.<n>We build HRVMamba, a novel model for efficient high-resolution representation learning.
arXiv Detail & Related papers (2024-10-04T06:19:29Z) - SDE: A Simplified and Disentangled Dependency Encoding Framework for State Space Models in Time Series Forecasting [8.841699904757506]
We identify and formally define three critical dependencies that are fundamental to forecasting accuracy.<n>We propose SDE (Simplified and Disentangled Dependency entangle), a novel framework designed to enhance the capability of SSMs for time series forecasting.
arXiv Detail & Related papers (2024-08-22T02:14:59Z) - Universal In-Context Approximation By Prompting Fully Recurrent Models [86.61942787684272]
We show that RNNs, LSTMs, GRUs, Linear RNNs, and linear gated architectures can serve as universal in-context approximators.
We introduce a programming language called LSRL that compiles to fully recurrent architectures.
arXiv Detail & Related papers (2024-06-03T15:25:13Z) - Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement [67.1393112206885]
Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks.
We introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level.
We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks.
arXiv Detail & Related papers (2024-02-09T07:45:26Z) - Reinforcement Learning for Adaptive Mesh Refinement [63.7867809197671]
We propose a novel formulation of AMR as a Markov decision process and apply deep reinforcement learning to train refinement policies directly from simulation.
The model sizes of these policy architectures are independent of the mesh size and hence scale to arbitrarily large and complex simulations.
arXiv Detail & Related papers (2021-03-01T22:55:48Z) - Relational State-Space Model for Stochastic Multi-Object Systems [24.234120525358456]
This paper introduces the relational state-space model (R-SSM), a sequential hierarchical latent variable model.
R-SSM makes use of graph neural networks (GNNs) to simulate the joint state transitions of multiple correlated objects.
The utility of R-SSM is empirically evaluated on synthetic and real time-series datasets.
arXiv Detail & Related papers (2020-01-13T03:45:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.