Channel-wise Knowledge Distillation for Dense Prediction
- URL: http://arxiv.org/abs/2011.13256v4
- Date: Fri, 27 Aug 2021 03:05:25 GMT
- Title: Channel-wise Knowledge Distillation for Dense Prediction
- Authors: Changyong Shu, Yifan Liu, Jianfei Gao, Zheng Yan, Chunhua Shen
- Abstract summary: We propose to align features channel-wise between the student and teacher networks.
We consistently achieve superior performance on three benchmarks with various network structures.
- Score: 73.99057249472735
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Knowledge distillation (KD) has been proven to be a simple and effective tool
for training compact models. Almost all KD variants for dense prediction tasks
align the student and teacher networks' feature maps in the spatial domain,
typically by minimizing point-wise and/or pair-wise discrepancy. Observing that
in semantic segmentation, some layers' feature activations of each channel tend
to encode saliency of scene categories (analogue to class activation mapping),
we propose to align features channel-wise between the student and teacher
networks. To this end, we first transform the feature map of each channel into
a probabilty map using softmax normalization, and then minimize the
Kullback-Leibler (KL) divergence of the corresponding channels of the two
networks. By doing so, our method focuses on mimicking the soft distributions
of channels between networks. In particular, the KL divergence enables learning
to pay more attention to the most salient regions of the channel-wise maps,
presumably corresponding to the most useful signals for semantic segmentation.
Experiments demonstrate that our channel-wise distillation outperforms almost
all existing spatial distillation methods for semantic segmentation
considerably, and requires less computational cost during training. We
consistently achieve superior performance on three benchmarks with various
network structures. Code is available at: https://git.io/Distiller
Related papers
- Distilling Channels for Efficient Deep Tracking [68.13422829310835]
This paper presents a novel framework termed channel distillation to facilitate deep trackers.
We show that an integrated formulation can turn feature compression, response map generation, and model update into a unified energy minimization problem.
The resulting deep tracker is accurate, fast, and has low memory requirements.
arXiv Detail & Related papers (2024-09-18T08:09:20Z) - Group channel pruning and spatial attention distilling for object
detection [2.8675002818821542]
We introduce a three-stage model compression method: dynamic sparse training, group channel pruning, and spatial attention distilling.
Our method reduces the parameters of the model by 64.7 % and the calculation by 34.9%.
arXiv Detail & Related papers (2023-06-02T13:26:23Z) - Fully Attentional Network for Semantic Segmentation [17.24768249911501]
We propose Fully Attentional Network (FLANet) to encode both spatial and channel attentions in a single similarity map.
Our new method has achieved state-of-the-art performance on three challenging semantic segmentation datasets.
arXiv Detail & Related papers (2021-12-08T04:34:55Z) - Group Fisher Pruning for Practical Network Compression [58.25776612812883]
We present a general channel pruning approach that can be applied to various complicated structures.
We derive a unified metric based on Fisher information to evaluate the importance of a single channel and coupled channels.
Our method can be used to prune any structures including those with coupled channels.
arXiv Detail & Related papers (2021-08-02T08:21:44Z) - Operation-Aware Soft Channel Pruning using Differentiable Masks [51.04085547997066]
We propose a data-driven algorithm, which compresses deep neural networks in a differentiable way by exploiting the characteristics of operations.
We perform extensive experiments and achieve outstanding performance in terms of the accuracy of output networks.
arXiv Detail & Related papers (2020-07-08T07:44:00Z) - DMCP: Differentiable Markov Channel Pruning for Neural Networks [67.51334229530273]
We propose a novel differentiable method for channel pruning, named Differentiable Markov Channel Pruning (DMCP)
Our method is differentiable and can be directly optimized by gradient descent with respect to standard task loss and budget regularization.
To validate the effectiveness of our method, we perform extensive experiments on Imagenet with ResNet and MobilenetV2.
arXiv Detail & Related papers (2020-05-07T09:39:55Z) - Channel Interaction Networks for Fine-Grained Image Categorization [61.095320862647476]
Fine-grained image categorization is challenging due to the subtle inter-class differences.
We propose a channel interaction network (CIN), which models the channel-wise interplay both within an image and across images.
Our model can be trained efficiently in an end-to-end fashion without the need of multi-stage training and testing.
arXiv Detail & Related papers (2020-03-11T11:51:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.