A Simple and Generic Framework for Feature Distillation via Channel-wise
Transformation
- URL: http://arxiv.org/abs/2303.13212v2
- Date: Fri, 24 Mar 2023 02:40:47 GMT
- Title: A Simple and Generic Framework for Feature Distillation via Channel-wise
Transformation
- Authors: Ziwei Liu, Yongtao Wang, Xiaojie Chu
- Abstract summary: We propose a learnable nonlinear channel-wise transformation to align the features of the student and the teacher model.
Our method achieves significant performance improvements in various computer vision tasks.
- Score: 35.233203757760066
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation is a popular technique for transferring the knowledge
from a large teacher model to a smaller student model by mimicking. However,
distillation by directly aligning the feature maps between teacher and student
may enforce overly strict constraints on the student thus degrade the
performance of the student model. To alleviate the above feature misalignment
issue, existing works mainly focus on spatially aligning the feature maps of
the teacher and the student, with pixel-wise transformation. In this paper, we
newly find that aligning the feature maps between teacher and student along the
channel-wise dimension is also effective for addressing the feature
misalignment issue. Specifically, we propose a learnable nonlinear channel-wise
transformation to align the features of the student and the teacher model.
Based on it, we further propose a simple and generic framework for feature
distillation, with only one hyper-parameter to balance the distillation loss
and the task specific loss. Extensive experimental results show that our method
achieves significant performance improvements in various computer vision tasks
including image classification (+3.28% top-1 accuracy for MobileNetV1 on
ImageNet-1K), object detection (+3.9% bbox mAP for ResNet50-based Faster-RCNN
on MS COCO), instance segmentation (+2.8% Mask mAP for ResNet50-based
Mask-RCNN), and semantic segmentation (+4.66% mIoU for ResNet18-based PSPNet in
semantic segmentation on Cityscapes), which demonstrates the effectiveness and
the versatility of the proposed method. The code will be made publicly
available.
Related papers
- ScaleKD: Strong Vision Transformers Could Be Excellent Teachers [15.446480934024652]
We present a simple and effective knowledge distillation method, called ScaleKD.
Our method can train student backbones that span across a variety of convolutional neural network (CNN), multi-layer perceptron (MLP), and ViT architectures on image classification datasets.
When scaling up the size of teacher models or their pre-training datasets, our method showcases the desired scalable properties.
arXiv Detail & Related papers (2024-11-11T08:25:21Z) - Learning Lightweight Object Detectors via Multi-Teacher Progressive
Distillation [56.053397775016755]
We propose a sequential approach to knowledge distillation that progressively transfers the knowledge of a set of teacher detectors to a given lightweight student.
To the best of our knowledge, we are the first to successfully distill knowledge from Transformer-based teacher detectors to convolution-based students.
arXiv Detail & Related papers (2023-08-17T17:17:08Z) - Improving Knowledge Distillation via Regularizing Feature Norm and
Direction [16.98806338782858]
Knowledge distillation (KD) exploits a large well-trained model (i.e., teacher) to train a small student model on the same dataset for the same task.
Treating teacher features as knowledge, prevailing methods of knowledge distillation train student by aligning its features with the teacher's, e.g., by minimizing the KL-divergence between their logits or L2 distance between their intermediate features.
While it is natural to believe that better alignment of student features to the teacher better distills teacher knowledge, simply forcing this alignment does not directly contribute to the student's performance, e.g.
arXiv Detail & Related papers (2023-05-26T15:05:19Z) - NORM: Knowledge Distillation via N-to-One Representation Matching [18.973254404242507]
We present a new two-stage knowledge distillation method, which relies on a simple Feature Transform (FT) module consisting of two linear layers.
In view of preserving the intact information learnt by the teacher network, our FT module is merely inserted after the last convolutional layer of the student network.
By sequentially splitting the expanded student representation into N non-overlapping feature segments having the same number of feature channels as the teacher's, they can be readily forced to approximate the intact teacher representation simultaneously.
arXiv Detail & Related papers (2023-05-23T08:15:45Z) - A Light-weight Deep Learning Model for Remote Sensing Image
Classification [70.66164876551674]
We present a high-performance and light-weight deep learning model for Remote Sensing Image Classification (RSIC)
By conducting extensive experiments on the NWPU-RESISC45 benchmark, our proposed teacher-student models outperforms the state-of-the-art systems.
arXiv Detail & Related papers (2023-02-25T09:02:01Z) - AMD: Adaptive Masked Distillation for Object [8.668808292258706]
We propose a spatial-channel adaptive masked distillation (AMD) network for object detection.
We employ a simple and efficient module to allow the student network channel to be adaptive.
With the help of our proposed distillation method, the student networks report 41.3%, 42.4%, and 42.7% mAP scores.
arXiv Detail & Related papers (2023-01-31T10:32:13Z) - Masked Generative Distillation [23.52519832438352]
Masked Generative Distillation (MGD) is a general feature-based distillation method.
This paper shows that teachers can also improve students' representation power by guiding students' feature recovery.
arXiv Detail & Related papers (2022-05-03T14:30:26Z) - Deep Structured Instance Graph for Distilling Object Detectors [82.16270736573176]
We present a simple knowledge structure to exploit and encode information inside the detection system to facilitate detector knowledge distillation.
We achieve new state-of-the-art results on the challenging COCO object detection task with diverse student-teacher pairs on both one- and two-stage detectors.
arXiv Detail & Related papers (2021-09-27T08:26:00Z) - DisCo: Remedy Self-supervised Learning on Lightweight Models with
Distilled Contrastive Learning [94.89221799550593]
Self-supervised representation learning (SSL) has received widespread attention from the community.
Recent research argue that its performance will suffer a cliff fall when the model size decreases.
We propose a simple yet effective Distilled Contrastive Learning (DisCo) to ease the issue by a large margin.
arXiv Detail & Related papers (2021-04-19T08:22:52Z) - Distilling Object Detectors with Task Adaptive Regularization [97.52935611385179]
Current state-of-the-art object detectors are at the expense of high computational costs and are hard to deploy to low-end devices.
Knowledge distillation, which aims at training a smaller student network by transferring knowledge from a larger teacher model, is one of the promising solutions for model miniaturization.
arXiv Detail & Related papers (2020-06-23T15:58:22Z) - ResNeSt: Split-Attention Networks [86.25490825631763]
We present a modularized architecture, which applies the channel-wise attention on different network branches to leverage their success in capturing cross-feature interactions and learning diverse representations.
Our model, named ResNeSt, outperforms EfficientNet in accuracy and latency trade-off on image classification.
arXiv Detail & Related papers (2020-04-19T20:40:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.