Data Efficient Any Transformer-to-Mamba Distillation via Attention Bridge
- URL: http://arxiv.org/abs/2510.19266v2
- Date: Thu, 23 Oct 2025 07:03:35 GMT
- Title: Data Efficient Any Transformer-to-Mamba Distillation via Attention Bridge
- Authors: Penghao Wang, Yuhao Zhou, Mengxuan Wu, Panpan Zhang, Zhangyang Wang, Kai Wang,
- Abstract summary: State-space models (SSMs) have emerged as efficient alternatives to Transformers for sequence modeling, offering superior scalability through recurrent structures.<n>We propose Cross-architecture distillation via Attention Bridge (CAB), a novel data-efficient distillation framework that efficiently transfers attention knowledge from Transformer teachers to state-space student models.<n>Our findings suggest that attention-based knowledge can be efficiently transferred to recurrent models, enabling rapid utilization of Transformer expertise for building a stronger SSM community.
- Score: 54.948715010753745
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: State-space models (SSMs) have emerged as efficient alternatives to Transformers for sequence modeling, offering superior scalability through recurrent structures. However, their training remains costly and the ecosystem around them is far less mature than that of Transformers. Moreover, the structural heterogeneity between SSMs and Transformers makes it challenging to efficiently distill knowledge from pretrained attention models. In this work, we propose Cross-architecture distillation via Attention Bridge (CAB), a novel data-efficient distillation framework that efficiently transfers attention knowledge from Transformer teachers to state-space student models. Unlike conventional knowledge distillation that transfers knowledge only at the output level, CAB enables token-level supervision via a lightweight bridge and flexible layer-wise alignment, improving both efficiency and transferability. We further introduce flexible layer-wise alignment strategies to accommodate architectural discrepancies between teacher and student. Extensive experiments across vision and language domains demonstrate that our method consistently improves the performance of state-space models, even under limited training data, outperforming both standard and cross-architecture distillation methods. Our findings suggest that attention-based knowledge can be efficiently transferred to recurrent models, enabling rapid utilization of Transformer expertise for building a stronger SSM community.
Related papers
- ReMem: Mutual Information-Aware Fine-tuning of Pretrained Vision Transformers for Effective Knowledge Distillation [55.55242848676581]
Knowledge distillation from pretrained visual representation models offers an effective approach to improve small, task-specific production models.<n>However, the effectiveness of such knowledge transfer drops significantly when distilling from strong models that are pretrained in a large scale.<n>Motivated by the connection between mutual information and distillation effectiveness, we propose to employ mutual information-aware optimization during finetuning.
arXiv Detail & Related papers (2025-06-29T00:25:23Z) - Skip-Layer Attention: Bridging Abstract and Detailed Dependencies in Transformers [56.264673865476986]
This paper introduces Skip-Layer Attention (SLA) to enhance Transformer models.
SLA improves the model's ability to capture dependencies between high-level abstract features and low-level details.
Our implementation extends the Transformer's functionality by enabling queries in a given layer to interact with keys and values from both the current layer and one preceding layer.
arXiv Detail & Related papers (2024-06-17T07:24:38Z) - Promoting CNNs with Cross-Architecture Knowledge Distillation for Efficient Monocular Depth Estimation [4.242540533823568]
Transformer models are usually computationally-expensive, and their effectiveness in light-weight models are limited compared to convolutions.
We propose a cross-architecture knowledge distillation method for MDE, dubbed DisDepth, to enhance efficient CNN models with the supervision of state-of-the-art transformer models.
Our method achieves significant improvements on various efficient backbones, showcasing its potential for efficient monocular depth estimation.
arXiv Detail & Related papers (2024-04-25T07:55:47Z) - Cross-Architecture Transfer Learning for Linear-Cost Inference Transformers [1.1499643186017316]
We propose Cross-Architecture Transfer Learning (XATL) to improve efficiency of Transformer Language Models.
Methodabbr significantly reduces the training time up to 2.5x times and converges to a better minimum with up to 2.6% stronger model on the LM benchmarks within the same compute budget.
arXiv Detail & Related papers (2024-04-03T12:27:36Z) - LIB-KD: Teaching Inductive Bias for Efficient Vision Transformer Distillation and Compression [4.0120180943504655]
Vision Transformers (ViTs) offer the tantalising prospect of unified information processing across visual and textual domains.<n>We introduce an innovative ensemble-based distillation approach that distils inductive bias from complementary lightweight teacher models to make their applications practical.
arXiv Detail & Related papers (2023-09-30T13:21:29Z) - Cross-Architecture Knowledge Distillation [32.689574589575244]
It is natural to distill complementary knowledge from Transformer to convolutional neural network (CNN)
To deal with this problem, a novel cross-architecture knowledge distillation method is proposed.
The proposed method outperforms 14 state-of-the-arts on both small-scale and large-scale datasets.
arXiv Detail & Related papers (2022-07-12T02:50:48Z) - TransKD: Transformer Knowledge Distillation for Efficient Semantic Segmentation [49.794142076551026]
Transformer-based Knowledge Distillation (TransKD) framework learns compact student transformers by distilling both feature maps and patch embeddings of large teacher transformers.
Experiments on Cityscapes, ACDC, NYUv2, and Pascal VOC2012 datasets show that TransKD outperforms state-of-the-art distillation frameworks.
arXiv Detail & Related papers (2022-02-27T16:34:10Z) - Efficient Transformers in Reinforcement Learning using Actor-Learner
Distillation [91.05073136215886]
"Actor-Learner Distillation" transfers learning progress from a large capacity learner model to a small capacity actor model.
We demonstrate in several challenging memory environments that using Actor-Learner Distillation recovers the clear sample-efficiency gains of the transformer learner model.
arXiv Detail & Related papers (2021-04-04T17:56:34Z) - Transfer Heterogeneous Knowledge Among Peer-to-Peer Teammates: A Model
Distillation Approach [55.83558520598304]
We propose a brand new solution to reuse experiences and transfer value functions among multiple students via model distillation.
We also describe how to design an efficient communication protocol to exploit heterogeneous knowledge.
Our proposed framework, namely Learning and Teaching Categorical Reinforcement, shows promising performance on stabilizing and accelerating learning progress.
arXiv Detail & Related papers (2020-02-06T11:31:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.