MaTVLM: Hybrid Mamba-Transformer for Efficient Vision-Language Modeling
- URL: http://arxiv.org/abs/2503.13440v2
- Date: Tue, 18 Mar 2025 07:07:28 GMT
- Title: MaTVLM: Hybrid Mamba-Transformer for Efficient Vision-Language Modeling
- Authors: Yingyue Li, Bencheng Liao, Wenyu Liu, Xinggang Wang,
- Abstract summary: We present a hybrid model MaTVLM by substituting a portion of the transformer decoder layers in a pre-trained VLM with Mamba-2 layers.<n>We employ a single-stage distillation process, using the pre-trained VLM as the teacher model to transfer knowledge to the MaTVLM.<n>Remarkably, the MaTVLM achieves up to 3.6x faster inference than the teacher model while reducing GPU memory consumption by 27.5%.
- Score: 36.527618275553955
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the advancement of RNN models with linear complexity, the quadratic complexity challenge of transformers has the potential to be overcome. Notably, the emerging Mamba-2 has demonstrated competitive performance, bridging the gap between RNN models and transformers. However, due to sequential processing and vanishing gradients, RNN models struggle to capture long-range dependencies, limiting contextual understanding. This results in slow convergence, high resource demands, and poor performance on downstream understanding and complex reasoning tasks. In this work, we present a hybrid model MaTVLM by substituting a portion of the transformer decoder layers in a pre-trained VLM with Mamba-2 layers. Leveraging the inherent relationship between attention and Mamba-2, we initialize Mamba-2 with corresponding attention weights to accelerate convergence. Subsequently, we employ a single-stage distillation process, using the pre-trained VLM as the teacher model to transfer knowledge to the MaTVLM, further enhancing convergence speed and performance. Furthermore, we investigate the impact of differential distillation loss within our training framework. We evaluate the MaTVLM on multiple benchmarks, demonstrating competitive performance against the teacher model and existing VLMs while surpassing both Mamba-based VLMs and models of comparable parameter scales. Remarkably, the MaTVLM achieves up to 3.6x faster inference than the teacher model while reducing GPU memory consumption by 27.5%, all without compromising performance. Code and models are released at http://github.com/hustvl/MaTVLM.
Related papers
- Diffusion Transformer-to-Mamba Distillation for High-Resolution Image Generation [65.46359545280546]
This paper introduces diffusion transformer-to-mamba distillation (T2MD) to form an efficient training pipeline.<n>We establish a diffusion self-attention and Mamba hybrid model that simultaneously achieves efficiency and global dependencies.<n>Experiments demonstrate that our training path leads to low overhead but high-quality text-to-image generation.
arXiv Detail & Related papers (2025-06-23T18:01:19Z) - Routing Mamba: Scaling State Space Models with Mixture-of-Experts Projection [88.47928738482719]
Linear State Space Models (SSMs) offer remarkable performance gains in sequence modeling.<n>Recent advances, such as Mamba, further enhance SSMs with input-dependent gating and hardware-aware implementations.<n>We introduce Routing Mamba (RoM), a novel approach that scales SSM parameters using sparse mixtures of linear projection experts.
arXiv Detail & Related papers (2025-06-22T19:26:55Z) - Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation [36.44678935063189]
mmMamba is a framework for developing linear-complexity native multimodal state space models.<n>Our approach enables the direct conversion of trained decoder-only MLLMs to linear-complexity architectures.
arXiv Detail & Related papers (2025-02-18T18:59:57Z) - M2R2: Mixture of Multi-Rate Residuals for Efficient Transformer Inference [8.792650582656913]
We introduce Mixture of Multi-rate Residuals (M2R2), a framework that dynamically modulates residual velocity to improve early alignment.
M2R2 surpasses state-of-the-art distance-based strategies, balancing generation quality and speedup.
In self-speculative decoding setup, M2R2 achieves up to 2.8x speedups on MT-Bench.
arXiv Detail & Related papers (2025-02-04T06:13:52Z) - MatIR: A Hybrid Mamba-Transformer Image Restoration Model [95.17418386046054]
We propose a Mamba-Transformer hybrid image restoration model called MatIR.
MatIR cross-cycles the blocks of the Transformer layer and the Mamba layer to extract features.
In the Mamba module, we introduce the Image Inpainting State Space (IRSS) module, which traverses along four scan paths.
arXiv Detail & Related papers (2025-01-30T14:55:40Z) - p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay [18.958138693220704]
We propose to build efficient multimodal large language models (MLLMs) by leveraging the Mixture-of-Depths (MoD) mechanism.<n>We adapt the MoD module with two novel designs: tanh-gated weight normalization (TanhNorm) and symmetric token reweighting (STRing)<n>Our model, p-MoD, matches or even surpasses the performance of the baseline models, with only 55.6% TFLOPs and 53.8% KV cache storage during inference, and 77.7% GPU hours during training.
arXiv Detail & Related papers (2024-12-05T18:58:03Z) - MobileMamba: Lightweight Multi-Receptive Visual Mamba Network [51.33486891724516]
Previous research on lightweight models has primarily focused on CNNs and Transformer-based designs.
We propose the MobileMamba framework, which balances efficiency and performance.
MobileMamba achieves up to 83.6% on Top-1, surpassing existing state-of-the-art methods.
arXiv Detail & Related papers (2024-11-24T18:01:05Z) - MaskMamba: A Hybrid Mamba-Transformer Model for Masked Image Generation [63.73137438677585]
MaskMamba is a novel hybrid model that combines Mamba and Transformer architectures.
It achieves a remarkable $54.44%$ improvement in inference speed at a resolution of $2048times 2048$ over Transformer.
arXiv Detail & Related papers (2024-09-30T04:28:55Z) - The Mamba in the Llama: Distilling and Accelerating Hybrid Models [76.64055251296548]
We show how to distill large Transformers into linear RNNs by reusing the linear projection weights from attention layers with academic GPU resources.<n>The resulting hybrid model achieves performance comparable to the original Transformer in chat benchmarks.<n>We also introduce a hardware-aware speculative decoding algorithm that accelerates the inference speed of Mamba and hybrid models.
arXiv Detail & Related papers (2024-08-27T17:56:11Z) - Improving Rare Word Recognition with LM-aware MWER Training [50.241159623691885]
We introduce LMs in the learning of hybrid autoregressive transducer (HAT) models in the discriminative training framework.
For the shallow fusion setup, we use LMs during both hypotheses generation and loss computation, and the LM-aware MWER-trained model achieves 10% relative improvement.
For the rescoring setup, we learn a small neural module to generate per-token fusion weights in a data-dependent manner.
arXiv Detail & Related papers (2022-04-15T17:19:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.