Related papers: Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

URL: http://arxiv.org/abs/2405.21060v1
Date: Fri, 31 May 2024 17:50:01 GMT
Title: Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Authors: Tri Dao, Albert Gu,
Abstract summary: We show that state-space models (SSMs) such as Mamba have been shown to match or outperform Transformers at small to medium scale. Our state space duality (SSD) framework allows us to design a new architecture (Mamba-2) whose core layer is an a refinement of Mamba's selective SSM that is 2-8X faster.
Score: 31.985243136674146
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While Transformers have been the main architecture behind deep learning's success in language modeling, state-space models (SSMs) such as Mamba have recently been shown to match or outperform Transformers at small to medium scale. We show that these families of models are actually quite closely related, and develop a rich framework of theoretical connections between SSMs and variants of attention, connected through various decompositions of a well-studied class of structured semiseparable matrices. Our state space duality (SSD) framework allows us to design a new architecture (Mamba-2) whose core layer is an a refinement of Mamba's selective SSM that is 2-8X faster, while continuing to be competitive with Transformers on language modeling.

Related papers

Scaling Reasoning without Attention [44.42046576158219]
We introduce ourmodel, an attention-free language model that addresses both issues through architectural and data-centric innovations.<n>Our model eliminates the need for self-attention and key-value caching, enabling fixed-memory, constant-time inference.<n>On benchmark evaluations, ourmodel-7B outperforms strong Transformer and hybrid models of comparable scale.
arXiv Detail & Related papers (2025-05-28T14:52:15Z)
TransMamba: Fast Universal Architecture Adaption from Transformers to Mamba [88.31117598044725]
We explore cross-architecture training to transfer the ready knowledge in existing Transformer models to alternative architecture Mamba, termed TransMamba. Our approach employs a two-stage strategy to expedite training new Mamba models, ensuring effectiveness in across uni-modal and cross-modal tasks. For cross-modal learning, we propose a cross-Mamba module that integrates language awareness into Mamba's visual features, enhancing the cross-modal interaction capabilities of Mamba architecture.
arXiv Detail & Related papers (2025-02-21T01:22:01Z)
State Space Models are Strong Text Rerankers [33.41687512973575]
State space models (SSMs) like Mamba offer promising advantages. Despite their potential, SSMs' effectiveness at text reranking remains underexplored. Mamba architectures achieve competitive text ranking performance, comparable to transformer-based models of similar size.
arXiv Detail & Related papers (2024-12-18T21:42:15Z)
Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models [92.36510016591782]
We present a method that is able to distill a pretrained Transformer architecture into alternative architectures such as state space models (SSMs) Our method, called MOHAWK, is able to distill a Mamba-2 variant based on the Phi-1.5 architecture using only 3B tokens and a hybrid version (Hybrid Phi-Mamba) using 5B tokens. Despite using less than 1% of the training data typically used to train models from scratch, Phi-Mamba boasts substantially stronger performance compared to all past open-source non-Transformer models.
arXiv Detail & Related papers (2024-08-19T17:48:11Z)
Enhanced Structured State Space Models via Grouped FIR Filtering and Attention Sink Mechanisms [0.6718184400443239]
We propose an advanced architecture that mitigates challenges by decomposing A-multiplications into multiple groups. Inspired by the "attention sink" phenomenon identified in streaming language models, we incorporate a similar mechanism to enhance the stability and performance of our model.
arXiv Detail & Related papers (2024-08-01T02:49:58Z)
Longhorn: State Space Models are Amortized Online Learners [51.10124201221601]
State-space models (SSMs) offer linear decoding efficiency while maintaining parallelism during training. In this work, we explore SSM design through the lens of online learning, conceptualizing SSMs as meta-modules for specific online learning problems. We introduce a novel deep SSM architecture, Longhorn, whose update resembles the closed-form solution for solving the online associative recall problem.
arXiv Detail & Related papers (2024-07-19T11:12:08Z)
An Empirical Study of Mamba-based Language Models [69.74383762508805]
Selective state-space models (SSMs) like Mamba overcome some shortcomings of Transformers. We present a direct comparison between 8B-context Mamba, Mamba-2, and Transformer models trained on the same datasets. We find that the 8B Mamba-2-Hybrid exceeds the 8B Transformer on all 12 standard tasks.
arXiv Detail & Related papers (2024-06-12T05:25:15Z)
The Expressive Capacity of State Space Models: A Formal Language Perspective [0.8948475969696075]
recurrent models based on linear state space models (SSMs) have shown promising performance in language modeling (LM), competititve with transformers. We present a comprehensive theoretical study of the capacity of such SSMs as it compares to that of transformers and traditional RNNs.
arXiv Detail & Related papers (2024-05-27T17:46:57Z)
The Hidden Attention of Mamba Models [54.50526986788175]
The Mamba layer offers an efficient selective state space model (SSM) that is highly effective in modeling multiple domains. We show that such models can be viewed as attention-driven models. This new perspective enables us to empirically and theoretically compare the underlying mechanisms to that of the self-attention layers in transformers.
arXiv Detail & Related papers (2024-03-03T18:58:21Z)
Can Mamba Learn How to Learn? A Comparative Study on In-Context Learning Tasks [25.092302463435523]
State-space models (SSMs) have been proposed as alternatives to Transformer networks in language modeling. In this study, we evaluate the ICL performance of SSMs, focusing on Mamba, against Transformer models across various tasks.
arXiv Detail & Related papers (2024-02-06T18:56:35Z)
Repeat After Me: Transformers are Better than State Space Models at Copying [53.47717661441142]
We show that while generalized state space models are promising in terms of inference-time efficiency, they are limited compared to transformer models on tasks that require copying from the input context.
arXiv Detail & Related papers (2024-02-01T21:44:11Z)
Mamba: Linear-Time Sequence Modeling with Selective State Spaces [31.985243136674146]
Foundation models are almost universally based on the Transformer architecture and its core attention module. We identify that a key weakness of such models is their inability to perform content-based reasoning. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even blocks (Mamba) As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics.
arXiv Detail & Related papers (2023-12-01T18:01:34Z)
A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP [121.35904748477421]
Convolutional neural networks (CNN) are the dominant deep neural network (DNN) architecture for computer vision. Transformer and multi-layer perceptron (MLP)-based models, such as Vision Transformer and Vision-Mixer, started to lead new trends. In this paper, we conduct empirical studies on these DNN structures and try to understand their respective pros and cons.
arXiv Detail & Related papers (2021-08-30T06:09:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.