Related papers: Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling

Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling

URL: http://arxiv.org/abs/2406.07522v1
Date: Tue, 11 Jun 2024 17:50:51 GMT
Title: Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling
Authors: Liliang Ren, Yang Liu, Yadong Lu, Yelong Shen, Chen Liang, Weizhu Chen,
Abstract summary: We present Samba, a simple hybrid architecture that layer-wise combines Mamba, a selective State Space Model (SSM), with Sliding Window Attention (SWA) Samba selectively compresses a given sequence into recurrent hidden states while still maintaining the ability to precisely recall memories with the attention mechanism. As a linear-time sequence model, Samba enjoys a 3.73x higher throughput compared to Transformers with grouped-attention when processing user prompts of 128K length, and 3.64x speedup when generating 64K tokens with unlimited streaming.
Score: 70.94320930424331
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Efficiently modeling sequences with infinite context length has been a long-standing problem. Past works suffer from either the quadratic computation complexity or the limited extrapolation ability on length generalization. In this work, we present Samba, a simple hybrid architecture that layer-wise combines Mamba, a selective State Space Model (SSM), with Sliding Window Attention (SWA). Samba selectively compresses a given sequence into recurrent hidden states while still maintaining the ability to precisely recall memories with the attention mechanism. We scale Samba up to 3.8B parameters with 3.2T training tokens and show that Samba substantially outperforms the state-of-the-art models based on pure attention or SSMs on a wide range of benchmarks. When trained on 4K length sequences, Samba can be efficiently extrapolated to 256K context length with perfect memory recall and show improved token predictions up to 1M context length. As a linear-time sequence model, Samba enjoys a 3.73x higher throughput compared to Transformers with grouped-query attention when processing user prompts of 128K length, and 3.64x speedup when generating 64K tokens with unlimited streaming. A sample implementation of Samba is publicly available in https://github.com/microsoft/Samba.

Related papers

LongMamba: Enhancing Mamba's Long Context Capabilities via Training-Free Receptive Field Enlargement [54.518582813434]
State space models (SSMs) have emerged as an efficient alternative to Transformer models for language modeling. Recent studies have shown that SSMs, such as Mamba models, generally underperform compared to Transformers in long-context understanding tasks. We propose LongMamba, a training-free technique that significantly enhances the long-context capabilities of Mamba models.
arXiv Detail & Related papers (2025-04-22T17:30:36Z)
Mixture-of-Mamba: Enhancing Multi-Modal State-Space Models with Modality-Aware Sparsity [56.0251572416922]
State Space Models (SSMs) have emerged as efficient alternatives to Transformers for sequential modeling. We propose a novel SSM architecture that introduces modality-aware sparsity through modality-specific parameterization of the Mamba block. We evaluate Mixture-of-Mamba across three multi-modal pretraining settings.
arXiv Detail & Related papers (2025-01-27T18:35:05Z)
Taipan: Efficient and Expressive State Space Language Models with Selective Attention [100.16383527459429]
Long-context language modeling is a significant challenge in Natural Language Processing (NLP) Recent State Space Models (SSMs) such as Mamba offer alternatives with constant memory usage, but they underperform in tasks requiring extensive in-context retrieval. We introduce Taipan, a novel hybrid architecture that combines Mamba-2 with Selective Attention Layers (SALs) Our experiments demonstrate Taipan's superior performance across various scales and tasks, offering a promising solution for efficient long-context language modeling.
arXiv Detail & Related papers (2024-10-24T09:25:37Z)
MambaMIM: Pre-training Mamba with State Space Token-interpolation [14.343466340528687]
We introduce a generative self-supervised learning method for Mamba (MambaMIM) based on Selective Structure State Space Sequence Token-interpolation (S6T) MambaMIM can be used on any single or hybrid Mamba architectures to enhance the Mamba long-range representation capability.
arXiv Detail & Related papers (2024-08-15T10:35:26Z)
PackMamba: Efficient Processing of Variable-Length Sequences in Mamba training [13.926804198202582]
Mamba, emerging as a groundbreaking architecture in the field of generative AI, demonstrates remarkable proficiency in handling elongated sequences. Existing training framework of Mamba presents inefficiency with variable-length sequence inputs. We propose PackMamba, a high- throughput Mamba that efficiently handles variable-length sequences.
arXiv Detail & Related papers (2024-08-07T16:13:43Z)
DeciMamba: Exploring the Length Extrapolation Potential of Mamba [89.07242846058023]
We introduce DeciMamba, a context-extension method specifically designed for Mamba. We show that DeciMamba can extrapolate context lengths 25x longer than the ones seen during training, and does so without utilizing additional computational resources.
arXiv Detail & Related papers (2024-06-20T17:40:18Z)
An Empirical Study of Mamba-based Language Models [69.74383762508805]
Selective state-space models (SSMs) like Mamba overcome some shortcomings of Transformers. We present a direct comparison between 8B-context Mamba, Mamba-2, and Transformer models trained on the same datasets. We find that the 8B Mamba-2-Hybrid exceeds the 8B Transformer on all 12 standard tasks.
arXiv Detail & Related papers (2024-06-12T05:25:15Z)
MambaOut: Do We Really Need Mamba for Vision? [70.60495392198686]
Mamba, an architecture with RNN-like token mixer of state space model (SSM), was recently introduced to address the quadratic complexity of the attention mechanism. This paper conceptually concludes that Mamba is ideally suited for tasks with long-sequence and autoregressive characteristics. We construct a series of models named MambaOut through stacking Mamba blocks while removing their core token mixer, SSM.
arXiv Detail & Related papers (2024-05-13T17:59:56Z)
SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series [2.4379295576598436]
We propose SiMBA, a new architecture that introduces Einstein FFT (EinFFT) for channel modeling by specific eigenvalue computations and uses the Mamba block for sequence modeling. We show that SiMBA outperforms existing SSMs, bridging the performance gap with state-of-the-art transformers.
arXiv Detail & Related papers (2024-03-22T17:22:56Z)
BlackMamba: Mixture of Experts for State-Space Models [10.209192169793772]
State-space models (SSMs) have recently demonstrated competitive performance to transformers at large-scale language modeling benchmarks. MoE models have shown remarkable performance while significantly reducing the compute and latency costs of inference. We present BlackMamba, a novel architecture that combines the Mamba SSM with MoE to obtain the benefits of both.
arXiv Detail & Related papers (2024-02-01T07:15:58Z)
MambaByte: Token-free Selective State Space Model [71.90159903595514]
MambaByte is a token-free adaptation of the Mamba SSM trained autoregressively on byte sequences. We show MambaByte to be competitive with, and even to outperform, state-of-the-art subword Transformers on language modeling tasks.
arXiv Detail & Related papers (2024-01-24T18:53:53Z)
SegMamba: Long-range Sequential Modeling Mamba For 3D Medical Image Segmentation [16.476244833079182]
We introduce SegMamba, a novel 3D medical image textbfSegmentation textbfMamba model. SegMamba excels in whole volume feature modeling from a state space model standpoint. Experiments on the BraTS2023 dataset demonstrate the effectiveness and efficiency of our SegMamba.
arXiv Detail & Related papers (2024-01-24T16:17:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.