Related papers: MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts

MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts

URL: http://arxiv.org/abs/2401.04081v2
Date: Mon, 26 Feb 2024 17:04:41 GMT
Title: MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts
Authors: Maciej Pi\'oro, Kamil Ciebiera, Krystian Kr\'ol, Jan Ludziejewski, Micha{\l} Krutul, Jakub Krajewski, Szymon Antoniak, Piotr Mi{\l}o\'s, Marek Cygan, Sebastian Jaszczur
Abstract summary: State Space Models (SSMs) have become serious contenders in the field of sequential modeling. MoE has significantly improved Transformer-based Large Language Models, including recent state-of-the-art open models. We propose that to unlock the potential of SSMs for scaling, they should be combined with MoE.
Score: 4.293771840782942
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: State Space Models (SSMs) have become serious contenders in the field of sequential modeling, challenging the dominance of Transformers. At the same time, Mixture of Experts (MoE) has significantly improved Transformer-based Large Language Models, including recent state-of-the-art open models. We propose that to unlock the potential of SSMs for scaling, they should be combined with MoE. We showcase this on Mamba, a recent SSM-based model that achieves remarkable performance. Our model, MoE-Mamba, outperforms both Mamba and baseline Transformer-MoE. In particular, MoE-Mamba reaches the same performance as Mamba in $2.35\times$ fewer training steps while preserving the inference performance gains of Mamba against Transformer.

Related papers

Routing Mamba: Scaling State Space Models with Mixture-of-Experts Projection [88.47928738482719]
Linear State Space Models (SSMs) offer remarkable performance gains in sequence modeling.<n>Recent advances, such as Mamba, further enhance SSMs with input-dependent gating and hardware-aware implementations.<n>We introduce Routing Mamba (RoM), a novel approach that scales SSM parameters using sparse mixtures of linear projection experts.
arXiv Detail & Related papers (2025-06-22T19:26:55Z)
MobileMamba: Lightweight Multi-Receptive Visual Mamba Network [51.33486891724516]
Previous research on lightweight models has primarily focused on CNNs and Transformer-based designs. We propose the MobileMamba framework, which balances efficiency and performance. MobileMamba achieves up to 83.6% on Top-1, surpassing existing state-of-the-art methods.
arXiv Detail & Related papers (2024-11-24T18:01:05Z)
Bi-Mamba: Towards Accurate 1-Bit State Space Models [28.478762133816726]
Bi-Mamba is a scalable and powerful 1-bit Mamba architecture designed for more efficient large language models. Bi-Mamba achieves performance comparable to its full-precision counterparts (e.g., FP16 or BF16) and much better accuracy than post-training-binarization (PTB) Mamba baselines.
arXiv Detail & Related papers (2024-11-18T18:59:15Z)
MambaPEFT: Exploring Parameter-Efficient Fine-Tuning for Mamba [0.5530212768657544]
Mamba, a State Space Model (SSM)-based model, has attracted attention as a potential alternative to Transformers. We investigate the effectiveness of existing PEFT methods for Transformers when applied to Mamba. We propose new Mamba-specific PEFT methods that leverage the distinctive structure of Mamba.
arXiv Detail & Related papers (2024-11-06T11:57:55Z)
ReMamba: Equip Mamba with Effective Long-Sequence Modeling [50.530839868893786]
We propose ReMamba, which enhances Mamba's ability to comprehend long contexts. ReMamba incorporates selective compression and adaptation techniques within a two-stage re-forward process.
arXiv Detail & Related papers (2024-08-28T02:47:27Z)
MambaVision: A Hybrid Mamba-Transformer Vision Backbone [54.965143338206644]
We propose a novel hybrid Mamba-Transformer backbone, denoted as MambaVision, which is specifically tailored for vision applications. Our core contribution includes redesigning the Mamba formulation to enhance its capability for efficient modeling of visual features. We conduct a comprehensive ablation study on the feasibility of integrating Vision Transformers (ViT) with Mamba.
arXiv Detail & Related papers (2024-07-10T23:02:45Z)
Snakes and Ladders: Two Steps Up for VideoMamba [10.954210339694841]
In this paper, we theoretically analyze the differences between self-attention and Mamba. We propose VideoMambaPro models that surpass VideoMamba by 1.6-2.8% and 1.1-1.9% top-1. Our two solutions are to recent advances in Vision Mamba models, and are likely to provide further improvements in future models.
arXiv Detail & Related papers (2024-06-27T08:45:31Z)
An Empirical Study of Mamba-based Language Models [69.74383762508805]
Selective state-space models (SSMs) like Mamba overcome some shortcomings of Transformers. We present a direct comparison between 8B-context Mamba, Mamba-2, and Transformer models trained on the same datasets. We find that the 8B Mamba-2-Hybrid exceeds the 8B Transformer on all 12 standard tasks.
arXiv Detail & Related papers (2024-06-12T05:25:15Z)
Demystify Mamba in Vision: A Linear Attention Perspective [72.93213667713493]
Mamba is an effective state space model with linear computation complexity. We show that Mamba shares surprising similarities with linear attention Transformer. We propose a Mamba-Like Linear Attention (MLLA) model by incorporating the merits of these two key designs into linear attention.
arXiv Detail & Related papers (2024-05-26T15:31:09Z)
MambaOut: Do We Really Need Mamba for Vision? [70.60495392198686]
Mamba, an architecture with RNN-like token mixer of state space model (SSM), was recently introduced to address the quadratic complexity of the attention mechanism. This paper conceptually concludes that Mamba is ideally suited for tasks with long-sequence and autoregressive characteristics. We construct a series of models named MambaOut through stacking Mamba blocks while removing their core token mixer, SSM.
arXiv Detail & Related papers (2024-05-13T17:59:56Z)
CLIP-Mamba: CLIP Pretrained Mamba Models with OOD and Hessian Evaluation [18.383760896304604]
This report introduces the first attempt to train a Mamba model utilizing contrastive technical-image pretraining (CLIP) A Mamba model 67 million parameters is on par with a 307 million- parameters Vision Transformer (ViT) model in zero-shot classification tasks.
arXiv Detail & Related papers (2024-04-30T09:40:07Z)
BlackMamba: Mixture of Experts for State-Space Models [10.209192169793772]
State-space models (SSMs) have recently demonstrated competitive performance to transformers at large-scale language modeling benchmarks. MoE models have shown remarkable performance while significantly reducing the compute and latency costs of inference. We present BlackMamba, a novel architecture that combines the Mamba SSM with MoE to obtain the benefits of both.
arXiv Detail & Related papers (2024-02-01T07:15:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.