Related papers: A Simple and Generic Framework for Feature Distillation via Channel-wise Transformation

A Simple and Generic Framework for Feature Distillation via Channel-wise Transformation

URL: http://arxiv.org/abs/2303.13212v2
Date: Fri, 24 Mar 2023 02:40:47 GMT
Title: A Simple and Generic Framework for Feature Distillation via Channel-wise Transformation
Authors: Ziwei Liu, Yongtao Wang, Xiaojie Chu
Abstract summary: We propose a learnable nonlinear channel-wise transformation to align the features of the student and the teacher model. Our method achieves significant performance improvements in various computer vision tasks.
Score: 35.233203757760066
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Knowledge distillation is a popular technique for transferring the knowledge from a large teacher model to a smaller student model by mimicking. However, distillation by directly aligning the feature maps between teacher and student may enforce overly strict constraints on the student thus degrade the performance of the student model. To alleviate the above feature misalignment issue, existing works mainly focus on spatially aligning the feature maps of the teacher and the student, with pixel-wise transformation. In this paper, we newly find that aligning the feature maps between teacher and student along the channel-wise dimension is also effective for addressing the feature misalignment issue. Specifically, we propose a learnable nonlinear channel-wise transformation to align the features of the student and the teacher model. Based on it, we further propose a simple and generic framework for feature distillation, with only one hyper-parameter to balance the distillation loss and the task specific loss. Extensive experimental results show that our method achieves significant performance improvements in various computer vision tasks including image classification (+3.28% top-1 accuracy for MobileNetV1 on ImageNet-1K), object detection (+3.9% bbox mAP for ResNet50-based Faster-RCNN on MS COCO), instance segmentation (+2.8% Mask mAP for ResNet50-based Mask-RCNN), and semantic segmentation (+4.66% mIoU for ResNet18-based PSPNet in semantic segmentation on Cityscapes), which demonstrates the effectiveness and the versatility of the proposed method. The code will be made publicly available.

Related papers

ACAM-KD: Adaptive and Cooperative Attention Masking for Knowledge Distillation [2.7624021966289605]
ACAM-KD adapts to the student's evolving needs throughout the entire distillation process. It improves object detection performance by up to 1.4 mAP over the state-of-the-art. For semantic segmentation on Cityscapes, it boosts mIoU by 3.09 over the baseline.
arXiv Detail & Related papers (2025-03-08T18:51:53Z)
ScaleKD: Strong Vision Transformers Could Be Excellent Teachers [15.446480934024652]
We present a simple and effective knowledge distillation method, called ScaleKD. Our method can train student backbones that span across a variety of convolutional neural network (CNN), multi-layer perceptron (MLP), and ViT architectures on image classification datasets. When scaling up the size of teacher models or their pre-training datasets, our method showcases the desired scalable properties.
arXiv Detail & Related papers (2024-11-11T08:25:21Z)
Learning Lightweight Object Detectors via Multi-Teacher Progressive Distillation [56.053397775016755]
We propose a sequential approach to knowledge distillation that progressively transfers the knowledge of a set of teacher detectors to a given lightweight student. To the best of our knowledge, we are the first to successfully distill knowledge from Transformer-based teacher detectors to convolution-based students.
arXiv Detail & Related papers (2023-08-17T17:17:08Z)
Improving Knowledge Distillation via Regularizing Feature Norm and Direction [16.98806338782858]
Knowledge distillation (KD) exploits a large well-trained model (i.e., teacher) to train a small student model on the same dataset for the same task. Treating teacher features as knowledge, prevailing methods of knowledge distillation train student by aligning its features with the teacher's, e.g., by minimizing the KL-divergence between their logits or L2 distance between their intermediate features. While it is natural to believe that better alignment of student features to the teacher better distills teacher knowledge, simply forcing this alignment does not directly contribute to the student's performance, e.g.
arXiv Detail & Related papers (2023-05-26T15:05:19Z)
NORM: Knowledge Distillation via N-to-One Representation Matching [18.973254404242507]
We present a new two-stage knowledge distillation method, which relies on a simple Feature Transform (FT) module consisting of two linear layers. In view of preserving the intact information learnt by the teacher network, our FT module is merely inserted after the last convolutional layer of the student network. By sequentially splitting the expanded student representation into N non-overlapping feature segments having the same number of feature channels as the teacher's, they can be readily forced to approximate the intact teacher representation simultaneously.
arXiv Detail & Related papers (2023-05-23T08:15:45Z)
A Light-weight Deep Learning Model for Remote Sensing Image Classification [70.66164876551674]
We present a high-performance and light-weight deep learning model for Remote Sensing Image Classification (RSIC) By conducting extensive experiments on the NWPU-RESISC45 benchmark, our proposed teacher-student models outperforms the state-of-the-art systems.
arXiv Detail & Related papers (2023-02-25T09:02:01Z)
AMD: Adaptive Masked Distillation for Object [8.668808292258706]
We propose a spatial-channel adaptive masked distillation (AMD) network for object detection. We employ a simple and efficient module to allow the student network channel to be adaptive. With the help of our proposed distillation method, the student networks report 41.3%, 42.4%, and 42.7% mAP scores.
arXiv Detail & Related papers (2023-01-31T10:32:13Z)
Masked Generative Distillation [23.52519832438352]
Masked Generative Distillation (MGD) is a general feature-based distillation method. This paper shows that teachers can also improve students' representation power by guiding students' feature recovery.
arXiv Detail & Related papers (2022-05-03T14:30:26Z)
Deep Structured Instance Graph for Distilling Object Detectors [82.16270736573176]
We present a simple knowledge structure to exploit and encode information inside the detection system to facilitate detector knowledge distillation. We achieve new state-of-the-art results on the challenging COCO object detection task with diverse student-teacher pairs on both one- and two-stage detectors.
arXiv Detail & Related papers (2021-09-27T08:26:00Z)
DisCo: Remedy Self-supervised Learning on Lightweight Models with Distilled Contrastive Learning [94.89221799550593]
Self-supervised representation learning (SSL) has received widespread attention from the community. Recent research argue that its performance will suffer a cliff fall when the model size decreases. We propose a simple yet effective Distilled Contrastive Learning (DisCo) to ease the issue by a large margin.
arXiv Detail & Related papers (2021-04-19T08:22:52Z)
Distilling Object Detectors with Task Adaptive Regularization [97.52935611385179]
Current state-of-the-art object detectors are at the expense of high computational costs and are hard to deploy to low-end devices. Knowledge distillation, which aims at training a smaller student network by transferring knowledge from a larger teacher model, is one of the promising solutions for model miniaturization.
arXiv Detail & Related papers (2020-06-23T15:58:22Z)
ResNeSt: Split-Attention Networks [86.25490825631763]
We present a modularized architecture, which applies the channel-wise attention on different network branches to leverage their success in capturing cross-feature interactions and learning diverse representations. Our model, named ResNeSt, outperforms EfficientNet in accuracy and latency trade-off on image classification.
arXiv Detail & Related papers (2020-04-19T20:40:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.