Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling
- URL: http://arxiv.org/abs/2406.07522v2
- Date: Tue, 03 Dec 2024 08:27:49 GMT
- Title: Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling
- Authors: Liliang Ren, Yang Liu, Yadong Lu, Yelong Shen, Chen Liang, Weizhu Chen,
- Abstract summary: We present Samba, a simple hybrid architecture that layer-wise combines Mamba, a selective State Space Model (SSM), with Sliding Window Attention (SWA)
Samba selectively compresses a given sequence into recurrent hidden states while still maintaining the ability to precisely recall recent memories with the attention mechanism.
We scale Samba up to 3.8B parameters with 3.2T training tokens and demonstrate that it significantly outperforms state-of-the-art models across a variety of benchmarks.
- Score: 70.94320930424331
- License:
- Abstract: Efficiently modeling sequences with infinite context length has long been a challenging problem. Previous approaches have either suffered from quadratic computational complexity or limited extrapolation ability in length generalization. In this work, we present Samba, a simple hybrid architecture that layer-wise combines Mamba, a selective State Space Model (SSM), with Sliding Window Attention (SWA). Samba selectively compresses a given sequence into recurrent hidden states while still maintaining the ability to precisely recall recent memories with the attention mechanism. We scale Samba up to 3.8B parameters with 3.2T training tokens and demonstrate that it significantly outperforms state-of-the-art models across a variety of benchmarks. Pretrained on sequences of 4K length, Samba shows improved perplexity in context lengths of up to 1M in zero-shot. When finetuned on 4K-length sequences, Samba efficiently extrapolates to a 256K context length with perfect memory recall on the Passkey Retrieval task, and exhibits superior retrieval extrapolation on the challenging Phonebook task compared to full-attention models. As a linear-time sequence model, Samba achieves a 3.73x higher throughput compared to Transformers with grouped-query attention for user prompts of 128K length, and a 3.64x speedup when generating 64K tokens with unlimited streaming. Our code for training on open source data is publicly available at https://github.com/microsoft/Samba.
Related papers
- Mixture-of-Mamba: Enhancing Multi-Modal State-Space Models with Modality-Aware Sparsity [56.0251572416922]
State Space Models (SSMs) have emerged as efficient alternatives to Transformers for sequential modeling.
We propose a novel SSM architecture that introduces modality-aware sparsity through modality-specific parameterization of the Mamba block.
We evaluate Mixture-of-Mamba across three multi-modal pretraining settings.
arXiv Detail & Related papers (2025-01-27T18:35:05Z) - Mamba-SEUNet: Mamba UNet for Monaural Speech Enhancement [54.427965535613886]
Mamba, as a novel state-space model (SSM), has gained widespread application in natural language processing and computer vision.
In this work, we introduce Mamba-SEUNet, an innovative architecture that integrates Mamba with U-Net for SE tasks.
arXiv Detail & Related papers (2024-12-21T13:43:51Z) - Taipan: Efficient and Expressive State Space Language Models with Selective Attention [100.16383527459429]
Long-context language modeling is a significant challenge in Natural Language Processing (NLP)
Recent State Space Models (SSMs) such as Mamba offer alternatives with constant memory usage, but they underperform in tasks requiring extensive in-context retrieval.
We introduce Taipan, a novel hybrid architecture that combines Mamba-2 with Selective Attention Layers (SALs)
Our experiments demonstrate Taipan's superior performance across various scales and tasks, offering a promising solution for efficient long-context language modeling.
arXiv Detail & Related papers (2024-10-24T09:25:37Z) - MambaMIM: Pre-training Mamba with State Space Token-interpolation [14.343466340528687]
We introduce a generative self-supervised learning method for Mamba (MambaMIM) based on Selective Structure State Space Sequence Token-interpolation (S6T)
MambaMIM can be used on any single or hybrid Mamba architectures to enhance the Mamba long-range representation capability.
arXiv Detail & Related papers (2024-08-15T10:35:26Z) - PackMamba: Efficient Processing of Variable-Length Sequences in Mamba training [13.926804198202582]
Mamba, emerging as a groundbreaking architecture in the field of generative AI, demonstrates remarkable proficiency in handling elongated sequences.
Existing training framework of Mamba presents inefficiency with variable-length sequence inputs.
We propose PackMamba, a high- throughput Mamba that efficiently handles variable-length sequences.
arXiv Detail & Related papers (2024-08-07T16:13:43Z) - An Empirical Study of Mamba-based Language Models [69.74383762508805]
Selective state-space models (SSMs) like Mamba overcome some shortcomings of Transformers.
We present a direct comparison between 8B-context Mamba, Mamba-2, and Transformer models trained on the same datasets.
We find that the 8B Mamba-2-Hybrid exceeds the 8B Transformer on all 12 standard tasks.
arXiv Detail & Related papers (2024-06-12T05:25:15Z) - SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series [2.4379295576598436]
We propose SiMBA, a new architecture that introduces Einstein FFT (EinFFT) for channel modeling by specific eigenvalue computations and uses the Mamba block for sequence modeling.
We show that SiMBA outperforms existing SSMs, bridging the performance gap with state-of-the-art transformers.
arXiv Detail & Related papers (2024-03-22T17:22:56Z) - BlackMamba: Mixture of Experts for State-Space Models [10.209192169793772]
State-space models (SSMs) have recently demonstrated competitive performance to transformers at large-scale language modeling benchmarks.
MoE models have shown remarkable performance while significantly reducing the compute and latency costs of inference.
We present BlackMamba, a novel architecture that combines the Mamba SSM with MoE to obtain the benefits of both.
arXiv Detail & Related papers (2024-02-01T07:15:58Z) - MambaByte: Token-free Selective State Space Model [71.90159903595514]
MambaByte is a token-free adaptation of the Mamba SSM trained autoregressively on byte sequences.
We show MambaByte to be competitive with, and even to outperform, state-of-the-art subword Transformers on language modeling tasks.
arXiv Detail & Related papers (2024-01-24T18:53:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.