TransMamba: Fast Universal Architecture Adaption from Transformers to Mamba
- URL: http://arxiv.org/abs/2502.15130v2
- Date: Thu, 09 Oct 2025 07:04:31 GMT
- Title: TransMamba: Fast Universal Architecture Adaption from Transformers to Mamba
- Authors: Xiuwei Chen, Wentao Hu, Xiao Dong, Sihao Lin, Zisheng Chen, Meng Cao, Yina Zhuang, Jianhua Han, Hang Xu, Xiaodan Liang,
- Abstract summary: We propose a cross-architecture knowledge transfer paradigm, TransMamba, that facilitates the reuse of Transformer pre-trained knowledge.<n>We propose a two-stage framework to accelerate the training of Mamba-based models, ensuring their effectiveness across both uni-modal and multi-modal tasks.
- Score: 66.80624029365448
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer-based architectures have become the backbone of both uni-modal and multi-modal foundation models, largely due to their scalability via attention mechanisms, resulting in a rich ecosystem of publicly available pre-trained models such as LLaVA, CLIP, and DeiT, etc. In parallel, emerging sub-quadratic architectures like Mamba offer promising efficiency gains by enabling global context modeling with linear complexity. However, training these architectures from scratch remains resource-intensive (e.g., in terms of data and time). Motivated by this challenge, we explore a cross-architecture knowledge transfer paradigm, termed TransMamba, that facilitates the reuse of Transformer pre-trained knowledge. We propose a two-stage framework to accelerate the training of Mamba-based models, ensuring their effectiveness across both uni-modal and multi-modal tasks. The first stage leverages pre-trained Transformer models to initialize critical components of the Mamba architecture. To bridge architectural and dimensional gaps, we develop a selective weight subcloning strategy and a layered initialization scheme that prioritizes the early $n$ layers. Building on this initialization, the second stage introduces an adaptive multi-directional knowledge distillation method. This mechanism employs layer-wise adaptive scaling factors to align Mamba representations with their Transformer counterparts, while accommodating the scanning order variations inherent to multi-modal Mamba architectures. Despite operating with a reduced training dataset and a more compact model architecture, TransMamba consistently outperforms baseline approaches across diverse mamba-based backbones (e.g., PlainMamba, Vmamba, ViM and VideoMamba) and downstream tasks (e.g., image classification, visual question answering, text-video retrieval and multimodal reasoning). All code and implementation details will be released.
Related papers
- Trained Mamba Emulates Online Gradient Descent in In-Context Linear Regression [90.93281146423378]
Mamba is an efficient Transformer alternative with linear complexity for long-sequence modeling.<n>Recent empirical works demonstrate Mamba's in-context learning (ICL) competitive with Transformers.<n>This paper studies the training dynamics of Mamba on the linear regression ICL task.
arXiv Detail & Related papers (2025-09-28T09:48:49Z) - DYNAMAX: Dynamic computing for Transformers and Mamba based architectures [2.5739385355356714]
Early exits (EEs) offer a promising approach to reducing computational costs and latency by dynamically terminating inference once a satisfactory prediction confidence on a data sample is achieved.<n>This work introduces DYNAMAX, the first framework to exploit the unique properties of Mamba architectures for early exit mechanisms.
arXiv Detail & Related papers (2025-04-29T16:38:15Z) - RoMA: Scaling up Mamba-based Foundation Models for Remote Sensing [47.536214063122515]
RoMA is a framework that enables scalable self-supervised pretraining of RS foundation models using large-scale, diverse, unlabeled data.<n>RoMA enhances scalability for high-resolution images through a tailored auto-regressive learning strategy.<n> experiments across scene classification, object detection, and semantic segmentation tasks demonstrate that RoMA-pretrained Mamba models consistently outperform ViT-based counterparts in both accuracy and computational efficiency.
arXiv Detail & Related papers (2025-03-13T14:09:18Z) - A Survey on Mamba Architecture for Vision Applications [7.216568558372857]
Mamba architecture addresses scalability challenges in visual tasks.
Vision Mamba and VideoMamba introduce bidirectional scanning, selective mechanisms, andtemporal processing to enhance image and video understanding.
These advancements position Mamba as a promising architecture in computer vision research and applications.
arXiv Detail & Related papers (2025-02-11T00:59:30Z) - MatIR: A Hybrid Mamba-Transformer Image Restoration Model [95.17418386046054]
We propose a Mamba-Transformer hybrid image restoration model called MatIR.<n>MatIR cross-cycles the blocks of the Transformer layer and the Mamba layer to extract features.<n>In the Mamba module, we introduce the Image Inpainting State Space (IRSS) module, which traverses along four scan paths.
arXiv Detail & Related papers (2025-01-30T14:55:40Z) - Mamba-SEUNet: Mamba UNet for Monaural Speech Enhancement [54.427965535613886]
Mamba, as a novel state-space model (SSM), has gained widespread application in natural language processing and computer vision.<n>In this work, we introduce Mamba-SEUNet, an innovative architecture that integrates Mamba with U-Net for SE tasks.
arXiv Detail & Related papers (2024-12-21T13:43:51Z) - MobileMamba: Lightweight Multi-Receptive Visual Mamba Network [51.33486891724516]
Previous research on lightweight models has primarily focused on CNNs and Transformer-based designs.
We propose the MobileMamba framework, which balances efficiency and performance.
MobileMamba achieves up to 83.6% on Top-1, surpassing existing state-of-the-art methods.
arXiv Detail & Related papers (2024-11-24T18:01:05Z) - Parameter Efficient Mamba Tuning via Projector-targeted Diagonal-centric Linear Transformation [14.57480367514423]
We introduce two key insights-driven strategies for parameter-efficient fine-tuning (PEFT) in Mamba architecture.<n>We propose a novel PEFT method specialized to Mamba architecture: Projector-targeted Diagonal-centric Linear Transformation (ProDiaL)
arXiv Detail & Related papers (2024-11-21T04:58:20Z) - MaskMamba: A Hybrid Mamba-Transformer Model for Masked Image Generation [63.73137438677585]
MaskMamba is a novel hybrid model that combines Mamba and Transformer architectures.
It achieves a remarkable $54.44%$ improvement in inference speed at a resolution of $2048times 2048$ over Transformer.
arXiv Detail & Related papers (2024-09-30T04:28:55Z) - Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models [92.36510016591782]
We present a method that is able to distill a pretrained Transformer architecture into alternative architectures such as state space models (SSMs)
Our method, called MOHAWK, is able to distill a Mamba-2 variant based on the Phi-1.5 architecture using only 3B tokens and a hybrid version (Hybrid Phi-Mamba) using 5B tokens.
Despite using less than 1% of the training data typically used to train models from scratch, Phi-Mamba boasts substantially stronger performance compared to all past open-source non-Transformer models.
arXiv Detail & Related papers (2024-08-19T17:48:11Z) - MambaMIM: Pre-training Mamba with State Space Token-interpolation [14.343466340528687]
We introduce a generative self-supervised learning method for Mamba (MambaMIM) based on Selective Structure State Space Sequence Token-interpolation (S6T)
MambaMIM can be used on any single or hybrid Mamba architectures to enhance the Mamba long-range representation capability.
arXiv Detail & Related papers (2024-08-15T10:35:26Z) - A Survey of Mamba [27.939712558507516]
Recently, a novel architecture named Mamba has emerged as a promising alternative for building foundation models.
This study investigates the advancements of Mamba-based models, the techniques of adapting Mamba to diverse data, and the applications where Mamba can excel.
arXiv Detail & Related papers (2024-08-02T09:18:41Z) - Dimba: Transformer-Mamba Diffusion Models [32.04949173308355]
This paper unveils Dimba, a new text-to-image diffusion model that employs a distinctive hybrid architecture combining Transformer and Mamba elements.
Extensive experiments indicate that Dimba achieves comparable performance compared with benchmarks in terms of image quality, artistic rendering, and semantic control.
arXiv Detail & Related papers (2024-06-03T09:51:59Z) - ChangeMamba: Remote Sensing Change Detection With Spatiotemporal State Space Model [18.063680125378347]
Mamba architecture has shown remarkable performance in a series of natural language processing tasks.<n>We tailor the corresponding frameworks, called MambaBCD, MambaSCD, and MambaBDA, for binary change detection, semantic change detection, and building damage assessment.<n>All three frameworks adopt the cutting-edge Visual Mamba architecture as the encoder, which allows full learning of global spatial contextual information from the input images.
arXiv Detail & Related papers (2024-04-04T13:06:25Z) - Is Mamba Capable of In-Context Learning? [63.682741783013306]
State of the art foundation models such as GPT-4 perform surprisingly well at in-context learning (ICL)
This work provides empirical evidence that Mamba, a newly proposed state space model, has similar ICL capabilities.
arXiv Detail & Related papers (2024-02-05T16:39:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.