Revisiting Cross-Architecture Distillation: Adaptive Dual-Teacher Transfer for Lightweight Video Models
- URL: http://arxiv.org/abs/2511.09469v1
- Date: Thu, 13 Nov 2025 01:56:41 GMT
- Title: Revisiting Cross-Architecture Distillation: Adaptive Dual-Teacher Transfer for Lightweight Video Models
- Authors: Ying Peng, Hongsen Ye, Changxin Huang, Xiping Hu, Jian Chen, Runhao Zeng,
- Abstract summary: Cross-Architecture Knowledge Distillation (CAKD) addresses this by transferring knowledge from ViTs to CNNs.<n>We propose a Dual-Teacher Knowledge Distillation framework that leverages both a heterogeneous ViT teacher and a homogeneous CNN teacher.<n>Our method consistently outperforms state-of-the-art distillation approaches.
- Score: 23.54434747665913
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformers (ViTs) have achieved strong performance in video action recognition, but their high computational cost limits their practicality. Lightweight CNNs are more efficient but suffer from accuracy gaps. Cross-Architecture Knowledge Distillation (CAKD) addresses this by transferring knowledge from ViTs to CNNs, yet existing methods often struggle with architectural mismatch and overlook the value of stronger homogeneous CNN teachers. To tackle these challenges, we propose a Dual-Teacher Knowledge Distillation framework that leverages both a heterogeneous ViT teacher and a homogeneous CNN teacher to collaboratively guide a lightweight CNN student. We introduce two key components: (1) Discrepancy-Aware Teacher Weighting, which dynamically fuses the predictions from ViT and CNN teachers by assigning adaptive weights based on teacher confidence and prediction discrepancy with the student, enabling more informative and effective supervision; and (2) a Structure Discrepancy-Aware Distillation strategy, where the student learns the residual features between ViT and CNN teachers via a lightweight auxiliary branch, focusing on transferable architectural differences without mimicking all of ViT's high-dimensional patterns. Extensive experiments on benchmarks including HMDB51, EPIC-KITCHENS-100, and Kinetics-400 demonstrate that our method consistently outperforms state-of-the-art distillation approaches, achieving notable performance improvements with a maximum accuracy gain of 5.95% on HMDB51.
Related papers
- ACAM-KD: Adaptive and Cooperative Attention Masking for Knowledge Distillation [2.7624021966289605]
ACAM-KD adapts to the student's evolving needs throughout the entire distillation process.<n>It improves object detection performance by up to 1.4 mAP over the state-of-the-art.<n>For semantic segmentation on Cityscapes, it boosts mIoU by 3.09 over the baseline.
arXiv Detail & Related papers (2025-03-08T18:51:53Z) - HDKD: Hybrid Data-Efficient Knowledge Distillation Network for Medical Image Classification [0.0]
Vision Transformers (ViTs) have achieved significant advancement in computer vision tasks due to their powerful modeling capacity.<n>Previous approaches to Knowledge Distillation (KD) have pursued two primary paths: some focused solely on distilling the logit distribution from CNN teacher to ViT student.<n>This paper presents Hybrid Data-efficient Knowledge Distillation (HDKD) paradigm which employs a CNN teacher and a hybrid student.
arXiv Detail & Related papers (2024-07-10T10:09:12Z) - Object-Centric Diffusion for Efficient Video Editing [64.71639719352636]
Diffusion-based video editing has reached impressive quality.
Such solutions typically incur heavy memory and computational costs to generate temporally-coherent frames.
We propose modifications that allow significant speed-ups whilst maintaining quality.
arXiv Detail & Related papers (2024-01-11T08:36:15Z) - One-for-All: Bridge the Gap Between Heterogeneous Architectures in
Knowledge Distillation [69.65734716679925]
Knowledge distillation has proven to be a highly effective approach for enhancing model performance through a teacher-student training scheme.
Most existing distillation methods are designed under the assumption that the teacher and student models belong to the same model family.
We propose a simple yet effective one-for-all KD framework called OFA-KD, which significantly improves the distillation performance between heterogeneous architectures.
arXiv Detail & Related papers (2023-10-30T11:13:02Z) - Distilling Efficient Vision Transformers from CNNs for Semantic
Segmentation [12.177329445930276]
We propose a novel CNN-to-ViT KD framework, dubbed C2VKD.
We first propose a novel visual-linguistic feature distillation (VLFD) module that explores efficient KD among the aligned visual and linguistic-compatible representations.
We then propose a pixel-wise decoupled distillation (PDD) module to supervise the student under the combination of labels and teacher's predictions from the decoupled target and non-target classes.
arXiv Detail & Related papers (2023-10-11T07:45:37Z) - Learning Lightweight Object Detectors via Multi-Teacher Progressive
Distillation [56.053397775016755]
We propose a sequential approach to knowledge distillation that progressively transfers the knowledge of a set of teacher detectors to a given lightweight student.
To the best of our knowledge, we are the first to successfully distill knowledge from Transformer-based teacher detectors to convolution-based students.
arXiv Detail & Related papers (2023-08-17T17:17:08Z) - A Good Student is Cooperative and Reliable: CNN-Transformer
Collaborative Learning for Semantic Segmentation [8.110815355364947]
We propose an online knowledge distillation (KD) framework that can simultaneously learn CNN-based and ViT-based models.
Our proposed framework outperforms the state-of-the-art online distillation methods by a large margin.
arXiv Detail & Related papers (2023-07-24T07:46:06Z) - Cross Architecture Distillation for Face Recognition [49.55061794917994]
We develop an Adaptable Prompting Teacher network (APT) that integrates prompts into the teacher, enabling it to manage distillation-specific knowledge.
Experiments on popular face benchmarks and two large-scale verification sets demonstrate the superiority of our method.
arXiv Detail & Related papers (2023-06-26T12:54:28Z) - Cross-Architecture Knowledge Distillation [32.689574589575244]
It is natural to distill complementary knowledge from Transformer to convolutional neural network (CNN)
To deal with this problem, a novel cross-architecture knowledge distillation method is proposed.
The proposed method outperforms 14 state-of-the-arts on both small-scale and large-scale datasets.
arXiv Detail & Related papers (2022-07-12T02:50:48Z) - Dynamic Contrastive Distillation for Image-Text Retrieval [90.05345397400144]
We present a novel plug-in dynamic contrastive distillation (DCD) framework to compress image-text retrieval models.
We successfully apply our proposed DCD strategy to two state-of-the-art vision-language pretrained models, i.e. ViLT and METER.
Experiments on MS-COCO and Flickr30K benchmarks show the effectiveness and efficiency of our DCD framework.
arXiv Detail & Related papers (2022-07-04T14:08:59Z) - On the benefits of knowledge distillation for adversarial robustness [53.41196727255314]
We show that knowledge distillation can be used directly to boost the performance of state-of-the-art models in adversarial robustness.
We present Adversarial Knowledge Distillation (AKD), a new framework to improve a model's robust performance.
arXiv Detail & Related papers (2022-03-14T15:02:13Z) - Differentiable Feature Aggregation Search for Knowledge Distillation [47.94874193183427]
We introduce the feature aggregation to imitate the multi-teacher distillation in the single-teacher distillation framework.
DFA is a two-stage Differentiable Feature Aggregation search method motivated by DARTS in neural architecture search.
Experimental results show that DFA outperforms existing methods on CIFAR-100 and CINIC-10 datasets.
arXiv Detail & Related papers (2020-08-02T15:42:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.