Mixture-of-Mamba: Enhancing Multi-Modal State-Space Models with Modality-Aware Sparsity
- URL: http://arxiv.org/abs/2501.16295v1
- Date: Mon, 27 Jan 2025 18:35:05 GMT
- Title: Mixture-of-Mamba: Enhancing Multi-Modal State-Space Models with Modality-Aware Sparsity
- Authors: Weixin Liang, Junhong Shen, Genghan Zhang, Ning Dong, Luke Zettlemoyer, Lili Yu,
- Abstract summary: State Space Models (SSMs) have emerged as efficient alternatives to Transformers for sequential modeling.
We propose a novel SSM architecture that introduces modality-aware sparsity through modality-specific parameterization of the Mamba block.
We evaluate Mixture-of-Mamba across three multi-modal pretraining settings.
- Score: 56.0251572416922
- License:
- Abstract: State Space Models (SSMs) have emerged as efficient alternatives to Transformers for sequential modeling, but their inability to leverage modality-specific features limits their performance in multi-modal pretraining. Here, we propose Mixture-of-Mamba, a novel SSM architecture that introduces modality-aware sparsity through modality-specific parameterization of the Mamba block. Building on Mixture-of-Transformers (W. Liang et al. arXiv:2411.04996; 2024), we extend the benefits of modality-aware sparsity to SSMs while preserving their computational efficiency. We evaluate Mixture-of-Mamba across three multi-modal pretraining settings: Transfusion (interleaved text and continuous image tokens with diffusion loss), Chameleon (interleaved text and discrete image tokens), and an extended three-modality framework incorporating speech. Mixture-of-Mamba consistently reaches the same loss values at earlier training steps with significantly reduced computational costs. In the Transfusion setting, Mixture-of-Mamba achieves equivalent image loss using only 34.76% of the training FLOPs at the 1.4B scale. In the Chameleon setting, Mixture-of-Mamba reaches similar image loss with just 42.50% of the FLOPs at the 1.4B scale, and similar text loss with just 65.40% of the FLOPs. In the three-modality setting, MoM matches speech loss at 24.80% of the FLOPs at the 1.4B scale. Our ablation study highlights the synergistic effects of decoupling projection components, where joint decoupling yields greater gains than individual modifications. These results establish modality-aware sparsity as a versatile and effective design principle, extending its impact from Transformers to SSMs and setting new benchmarks in multi-modal pretraining. Our code can be accessed at https://github.com/Weixin-Liang/Mixture-of-Mamba
Related papers
- MatIR: A Hybrid Mamba-Transformer Image Restoration Model [95.17418386046054]
We propose a Mamba-Transformer hybrid image restoration model called MatIR.
MatIR cross-cycles the blocks of the Transformer layer and the Mamba layer to extract features.
In the Mamba module, we introduce the Image Inpainting State Space (IRSS) module, which traverses along four scan paths.
arXiv Detail & Related papers (2025-01-30T14:55:40Z) - Detail Matters: Mamba-Inspired Joint Unfolding Network for Snapshot Spectral Compressive Imaging [40.80197280147993]
We propose a Mamba-inspired Joint Unfolding Network (MiJUN) to overcome the inherent nonlinear and ill-posed characteristics of HSI reconstruction.
We introduce an accelerated unfolding network scheme, which reduces the reliance on initial optimization stages.
We refine the scanning strategy with Mamba by integrating the tensor mode-$k$ unfolding into the Mamba network.
arXiv Detail & Related papers (2025-01-02T13:56:23Z) - TinyViM: Frequency Decoupling for Tiny Hybrid Vision Mamba [11.176993272867396]
Mamba has shown great potential for computer vision due to its linear complexity.
Existing lightweight Mamba-based backbones cannot demonstrate performance that matches Convolution or Transformer-based methods.
By integrating mobile-friendly convolution and efficient Laplace mixer, we build a series of tiny hybrid vision Mamba called TinyViM.
arXiv Detail & Related papers (2024-11-26T14:34:36Z) - MobileMamba: Lightweight Multi-Receptive Visual Mamba Network [51.33486891724516]
Previous research on lightweight models has primarily focused on CNNs and Transformer-based designs.
We propose the MobileMamba framework, which balances efficiency and performance.
MobileMamba achieves up to 83.6% on Top-1, surpassing existing state-of-the-art methods.
arXiv Detail & Related papers (2024-11-24T18:01:05Z) - Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models [111.97026994761254]
Mixture-of-Transformers (MoT) is a sparse multi-modal transformer architecture.
MoT decouples non-embedding parameters of the model by modality.
We evaluate MoT across multiple settings and model scales.
arXiv Detail & Related papers (2024-11-07T18:59:06Z) - MaskMamba: A Hybrid Mamba-Transformer Model for Masked Image Generation [63.73137438677585]
MaskMamba is a novel hybrid model that combines Mamba and Transformer architectures.
It achieves a remarkable $54.44%$ improvement in inference speed at a resolution of $2048times 2048$ over Transformer.
arXiv Detail & Related papers (2024-09-30T04:28:55Z) - MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts [95.26323548734692]
MoMa is a modality-aware mixture-of-experts architecture for pre-training mixed-modal, early-fusion language models.
Under a 1-trillion-token training budget, the MoMa 1.4B model, featuring 4 text experts and 4 image experts, achieves impressive FLOPs savings.
arXiv Detail & Related papers (2024-07-31T17:46:51Z) - Mamba State-Space Models Are Lyapunov-Stable Learners [1.6385815610837167]
Mamba state-space models (SSMs) were recently shown to outperform Transformer large language models (LLMs) across various tasks.
We show that Mamba's recurrent dynamics are robust to small input changes.
We also show that instruction tuning allows Mamba models to narrow this gap to 81% and Mamba-2 models to skyrocket over this gap to 132%.
arXiv Detail & Related papers (2024-05-31T21:46:23Z) - BlackMamba: Mixture of Experts for State-Space Models [10.209192169793772]
State-space models (SSMs) have recently demonstrated competitive performance to transformers at large-scale language modeling benchmarks.
MoE models have shown remarkable performance while significantly reducing the compute and latency costs of inference.
We present BlackMamba, a novel architecture that combines the Mamba SSM with MoE to obtain the benefits of both.
arXiv Detail & Related papers (2024-02-01T07:15:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.