Cross-Architecture Knowledge Distillation
- URL: http://arxiv.org/abs/2207.05273v1
- Date: Tue, 12 Jul 2022 02:50:48 GMT
- Title: Cross-Architecture Knowledge Distillation
- Authors: Yufan Liu, Jiajiong Cao, Bing Li, Weiming Hu, Jingting Ding, Liang Li
- Abstract summary: It is natural to distill complementary knowledge from Transformer to convolutional neural network (CNN)
To deal with this problem, a novel cross-architecture knowledge distillation method is proposed.
The proposed method outperforms 14 state-of-the-arts on both small-scale and large-scale datasets.
- Score: 32.689574589575244
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer attracts much attention because of its ability to learn global
relations and superior performance. In order to achieve higher performance, it
is natural to distill complementary knowledge from Transformer to convolutional
neural network (CNN). However, most existing knowledge distillation methods
only consider homologous-architecture distillation, such as distilling
knowledge from CNN to CNN. They may not be suitable when applying to
cross-architecture scenarios, such as from Transformer to CNN. To deal with
this problem, a novel cross-architecture knowledge distillation method is
proposed. Specifically, instead of directly mimicking output/intermediate
features of the teacher, a partially cross attention projector and a group-wise
linear projector are introduced to align the student features with the
teacher's in two projected feature spaces. And a multi-view robust training
scheme is further presented to improve the robustness and stability of the
framework. Extensive experiments show that the proposed method outperforms 14
state-of-the-arts on both small-scale and large-scale datasets.
Related papers
- CNN-Transformer Rectified Collaborative Learning for Medical Image Segmentation [60.08541107831459]
This paper proposes a CNN-Transformer rectified collaborative learning framework to learn stronger CNN-based and Transformer-based models for medical image segmentation.
Specifically, we propose a rectified logit-wise collaborative learning (RLCL) strategy which introduces the ground truth to adaptively select and rectify the wrong regions in student soft labels.
We also propose a class-aware feature-wise collaborative learning (CFCL) strategy to achieve effective knowledge transfer between CNN-based and Transformer-based models in the feature space.
arXiv Detail & Related papers (2024-08-25T01:27:35Z) - Visual Prompt Tuning in Null Space for Continual Learning [51.96411454304625]
Existing prompt-tuning methods have demonstrated impressive performances in continual learning (CL)
This paper aims to learn each task by tuning the prompts in the direction orthogonal to the subspace spanned by previous tasks' features.
In practice, an effective null-space-based approximation solution has been proposed to implement the prompt gradient projection.
arXiv Detail & Related papers (2024-06-09T05:57:40Z) - Promoting CNNs with Cross-Architecture Knowledge Distillation for Efficient Monocular Depth Estimation [4.242540533823568]
Transformer models are usually computationally-expensive, and their effectiveness in light-weight models are limited compared to convolutions.
We propose a cross-architecture knowledge distillation method for MDE, dubbed DisDepth, to enhance efficient CNN models with the supervision of state-of-the-art transformer models.
Our method achieves significant improvements on various efficient backbones, showcasing its potential for efficient monocular depth estimation.
arXiv Detail & Related papers (2024-04-25T07:55:47Z) - Cross-Architecture Transfer Learning for Linear-Cost Inference Transformers [1.1499643186017316]
We propose Cross-Architecture Transfer Learning (XATL) to improve efficiency of Transformer Language Models.
Methodabbr significantly reduces the training time up to 2.5x times and converges to a better minimum with up to 2.6% stronger model on the LM benchmarks within the same compute budget.
arXiv Detail & Related papers (2024-04-03T12:27:36Z) - Self-Supervised Pre-Training for Table Structure Recognition Transformer [25.04573593082671]
We propose a self-supervised pre-training (SSP) method for table structure recognition transformers.
We discover that the performance gap between the linear projection transformer and the hybrid CNN-transformer can be mitigated by SSP of the visual encoder in the TSR model.
arXiv Detail & Related papers (2024-02-23T19:34:06Z) - Distilling Inductive Bias: Knowledge Distillation Beyond Model
Compression [6.508088032296086]
Vision Transformers (ViTs) offer the tantalizing prospect of unified information processing across visual and textual domains.
We introduce an innovative ensemble-based distillation approach distilling inductive bias from complementary lightweight teacher models.
Our proposed framework also involves precomputing and storing logits in advance, essentially the unnormalized predictions of the model.
arXiv Detail & Related papers (2023-09-30T13:21:29Z) - Learning Lightweight Object Detectors via Multi-Teacher Progressive
Distillation [56.053397775016755]
We propose a sequential approach to knowledge distillation that progressively transfers the knowledge of a set of teacher detectors to a given lightweight student.
To the best of our knowledge, we are the first to successfully distill knowledge from Transformer-based teacher detectors to convolution-based students.
arXiv Detail & Related papers (2023-08-17T17:17:08Z) - A Good Student is Cooperative and Reliable: CNN-Transformer
Collaborative Learning for Semantic Segmentation [8.110815355364947]
We propose an online knowledge distillation (KD) framework that can simultaneously learn CNN-based and ViT-based models.
Our proposed framework outperforms the state-of-the-art online distillation methods by a large margin.
arXiv Detail & Related papers (2023-07-24T07:46:06Z) - Cross Architecture Distillation for Face Recognition [49.55061794917994]
We develop an Adaptable Prompting Teacher network (APT) that integrates prompts into the teacher, enabling it to manage distillation-specific knowledge.
Experiments on popular face benchmarks and two large-scale verification sets demonstrate the superiority of our method.
arXiv Detail & Related papers (2023-06-26T12:54:28Z) - Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution [50.10987776141901]
Recent vision transformers along with self-attention have achieved promising results on various computer vision tasks.
We introduce an effective hybrid architecture for super-resolution (SR) tasks, which leverages local features from CNNs and long-range dependencies captured by transformers.
Our proposed method achieves state-of-the-art SR results on numerous benchmark datasets.
arXiv Detail & Related papers (2022-03-15T06:52:25Z) - Knowledge Distillation Meets Self-Supervision [109.6400639148393]
Knowledge distillation involves extracting "dark knowledge" from a teacher network to guide the learning of a student network.
We show that the seemingly different self-supervision task can serve as a simple yet powerful solution.
By exploiting the similarity between those self-supervision signals as an auxiliary task, one can effectively transfer the hidden information from the teacher to the student.
arXiv Detail & Related papers (2020-06-12T12:18:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.