Mamba: Linear-Time Sequence Modeling with Selective State Spaces
- URL: http://arxiv.org/abs/2312.00752v2
- Date: Fri, 31 May 2024 17:55:27 GMT
- Title: Mamba: Linear-Time Sequence Modeling with Selective State Spaces
- Authors: Albert Gu, Tri Dao,
- Abstract summary: Foundation models are almost universally based on the Transformer architecture and its core attention module.
We identify that a key weakness of such models is their inability to perform content-based reasoning.
We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even blocks (Mamba)
As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics.
- Score: 31.985243136674146
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers' computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. Second, even though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba). Mamba enjoys fast inference (5$\times$ higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modeling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation.
Related papers
- Longhorn: State Space Models are Amortized Online Learners [51.10124201221601]
We introduce a novel deep SSM architecture based on the implicit update for optimizing an online regression objective.
Our experimental results show that our models outperform state-of-the-art SSMs on standard sequence modeling benchmarks and language modeling tasks.
arXiv Detail & Related papers (2024-07-19T11:12:08Z) - LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory [63.41820940103348]
Self-attention mechanism's computational cost limits its practicality for long sequences.
We propose a new method called LongVQ to compress the global abstraction as a length-fixed codebook.
LongVQ effectively maintains dynamic global and local patterns, which helps to complement the lack of long-range dependency issues.
arXiv Detail & Related papers (2024-04-17T08:26:34Z) - MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection [5.37935922811333]
MambaMixer is a new architecture with data-dependent weights that uses a dual selection mechanism across tokens and channels.
As a proof of concept, we design Vision MambaMixer (ViM2) and Time Series MambaMixer (TSM2) architectures based on the MambaMixer block.
arXiv Detail & Related papers (2024-03-29T00:05:13Z) - Is Mamba Capable of In-Context Learning? [63.682741783013306]
State of the art foundation models such as GPT-4 perform surprisingly well at in-context learning (ICL)
This work provides empirical evidence that Mamba, a newly proposed state space model, has similar ICL capabilities.
arXiv Detail & Related papers (2024-02-05T16:39:12Z) - MambaByte: Token-free Selective State Space Model [71.90159903595514]
MambaByte is a token-free adaptation of the Mamba SSM trained autoregressively on byte sequences.
We show MambaByte to be competitive with, and even to outperform, state-of-the-art subword Transformers on language modeling tasks.
arXiv Detail & Related papers (2024-01-24T18:53:53Z) - TransNormerLLM: A Faster and Better Large Language Model with Improved
TransNormer [34.790081960470964]
We present TransNormerLLM, the first linear attention-based Large Language Model (LLM)
We make advanced modifications that include positional embedding, linear attention acceleration, gating mechanisms, tensor normalization, and inference acceleration and stabilization.
We validate our model design through a series of ablations and train models with sizes of 385M, 1B, and 7B on our self-collected corpus.
arXiv Detail & Related papers (2023-07-27T16:45:33Z) - Blockwise Parallel Transformer for Large Context Models [70.97386897478238]
Blockwise Parallel Transformer (BPT) is a blockwise computation of self-attention and feedforward network fusion to minimize memory costs.
By processing longer input sequences while maintaining memory efficiency, BPT enables training sequences 32 times longer than vanilla Transformers and up to 4 times longer than previous memory-efficient methods.
arXiv Detail & Related papers (2023-05-30T19:25:51Z) - Toeplitz Neural Network for Sequence Modeling [46.04964190407727]
We show that a Toeplitz matrix-vector production trick can reduce the space-time complexity of the sequence modeling to log linear.
A lightweight sub-network called relative position encoder is proposed to generate relative position coefficients with a fixed budget of parameters.
Despite being trained on 512-token sequences, our model can extrapolate input sequence length up to 14K tokens in inference with consistent performance.
arXiv Detail & Related papers (2023-05-08T14:49:01Z) - Long-Short Transformer: Efficient Transformers for Language and Vision [97.2850205384295]
Long-Short Transformer (Transformer-LS) is an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks.
It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations.
Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification.
arXiv Detail & Related papers (2021-07-05T18:00:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.