LongMamba: Enhancing Mamba's Long Context Capabilities via Training-Free Receptive Field Enlargement
- URL: http://arxiv.org/abs/2504.16053v1
- Date: Tue, 22 Apr 2025 17:30:36 GMT
- Title: LongMamba: Enhancing Mamba's Long Context Capabilities via Training-Free Receptive Field Enlargement
- Authors: Zhifan Ye, Kejing Xia, Yonggan Fu, Xin Dong, Jihoon Hong, Xiangchi Yuan, Shizhe Diao, Jan Kautz, Pavlo Molchanov, Yingyan Celine Lin,
- Abstract summary: State space models (SSMs) have emerged as an efficient alternative to Transformer models for language modeling.<n>Recent studies have shown that SSMs, such as Mamba models, generally underperform compared to Transformers in long-context understanding tasks.<n>We propose LongMamba, a training-free technique that significantly enhances the long-context capabilities of Mamba models.
- Score: 54.518582813434
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: State space models (SSMs) have emerged as an efficient alternative to Transformer models for language modeling, offering linear computational complexity and constant memory usage as context length increases. However, despite their efficiency in handling long contexts, recent studies have shown that SSMs, such as Mamba models, generally underperform compared to Transformers in long-context understanding tasks. To address this significant shortfall and achieve both efficient and accurate long-context understanding, we propose LongMamba, a training-free technique that significantly enhances the long-context capabilities of Mamba models. LongMamba builds on our discovery that the hidden channels in Mamba can be categorized into local and global channels based on their receptive field lengths, with global channels primarily responsible for long-context capability. These global channels can become the key bottleneck as the input context lengthens. Specifically, when input lengths largely exceed the training sequence length, global channels exhibit limitations in adaptively extend their receptive fields, leading to Mamba's poor long-context performance. The key idea of LongMamba is to mitigate the hidden state memory decay in these global channels by preventing the accumulation of unimportant tokens in their memory. This is achieved by first identifying critical tokens in the global channels and then applying token filtering to accumulate only those critical tokens. Through extensive benchmarking across synthetic and real-world long-context scenarios, LongMamba sets a new standard for Mamba's long-context performance, significantly extending its operational range without requiring additional training. Our code is available at https://github.com/GATECH-EIC/LongMamba.
Related papers
- Random Long-Context Access for Mamba via Hardware-aligned Hierarchical Sparse Attention [43.3704626107852]
We propose textbfHierarchical textbfSparse textbfAttention (HSA), a novel attention mechanism that enhances RNNs with long-range random access flexibility.
HSA divides inputs into chunks, selecting the top-$k$ chunks and hierarchically aggregates information.
By combining HSA with Mamba, we introduce RAMba, which achieves perfect accuracy in passkey retrieval across 64 million contexts.
arXiv Detail & Related papers (2025-04-23T15:15:06Z) - Taipan: Efficient and Expressive State Space Language Models with Selective Attention [100.16383527459429]
Long-context language modeling is a significant challenge in Natural Language Processing (NLP)
Recent State Space Models (SSMs) such as Mamba offer alternatives with constant memory usage, but they underperform in tasks requiring extensive in-context retrieval.
We introduce Taipan, a novel hybrid architecture that combines Mamba-2 with Selective Attention Layers (SALs)
Our experiments demonstrate Taipan's superior performance across various scales and tasks, offering a promising solution for efficient long-context language modeling.
arXiv Detail & Related papers (2024-10-24T09:25:37Z) - LaMamba-Diff: Linear-Time High-Fidelity Diffusion Models Based on Local Attention and Mamba [54.85262314960038]
Local Attentional Mamba blocks capture both global contexts and local details with linear complexity.
Our model exhibits exceptional scalability and surpasses the performance of DiT across various model scales on ImageNet at 256x256 resolution.
Compared to state-of-the-art diffusion models on ImageNet 256x256 and 512x512, our largest model presents notable advantages, such as a reduction of up to 62% GFLOPs.
arXiv Detail & Related papers (2024-08-05T16:39:39Z) - DeciMamba: Exploring the Length Extrapolation Potential of Mamba [89.07242846058023]
We introduce DeciMamba, a context-extension method specifically designed for Mamba.<n>Experiments over real-world long-range NLP tasks show that DeciMamba can extrapolate to context lengths significantly longer than the ones seen during training.
arXiv Detail & Related papers (2024-06-20T17:40:18Z) - MambaDepth: Enhancing Long-range Dependency for Self-Supervised Fine-Structured Monocular Depth Estimation [0.0]
MambaDepth is a versatile network tailored for self-supervised depth estimation.
MambaDepth combines the U-Net's effectiveness in self-supervised depth estimation with the advanced capabilities of Mamba.
MambaDepth proves its superior generalization capacities on other datasets such as Make3D and Cityscapes.
arXiv Detail & Related papers (2024-06-06T22:08:48Z) - Bi-Mamba+: Bidirectional Mamba for Time Series Forecasting [5.166854384000439]
Long-term time series forecasting (LTSF) provides longer insights into future trends and patterns.
Recently, a new state space model (SSM) named Mamba is proposed.
With the selective capability on input data and the hardware-aware parallel computing algorithm, Mamba has shown great potential in balancing predicting performance and computational efficiency.
arXiv Detail & Related papers (2024-04-24T09:45:48Z) - MiM-ISTD: Mamba-in-Mamba for Efficient Infrared Small Target Detection [72.46396769642787]
We develop a nested structure, Mamba-in-Mamba (MiM-ISTD), for efficient infrared small target detection.
MiM-ISTD is $8 times$ faster than the SOTA method and reduces GPU memory usage by 62.2$%$ when testing on $2048 times 2048$ images.
arXiv Detail & Related papers (2024-03-04T15:57:29Z) - MambaByte: Token-free Selective State Space Model [71.90159903595514]
MambaByte is a token-free adaptation of the Mamba SSM trained autoregressively on byte sequences.
We show MambaByte to be competitive with, and even to outperform, state-of-the-art subword Transformers on language modeling tasks.
arXiv Detail & Related papers (2024-01-24T18:53:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.