Visualizing the embedding space to explain the effect of knowledge
distillation
- URL: http://arxiv.org/abs/2110.04483v1
- Date: Sat, 9 Oct 2021 07:04:26 GMT
- Title: Visualizing the embedding space to explain the effect of knowledge
distillation
- Authors: Hyun Seung Lee, Christian Wallraven
- Abstract summary: Recent research has found that knowledge distillation can be effective in reducing the size of a network.
Despite these advances, it still is relatively unclear emphwhy this method works, that is, what the resulting student model does 'better'
- Score: 5.678337324555035
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent research has found that knowledge distillation can be effective in
reducing the size of a network and in increasing generalization. A pre-trained,
large teacher network, for example, was shown to be able to bootstrap a student
model that eventually outperforms the teacher in a limited label environment.
Despite these advances, it still is relatively unclear \emph{why} this method
works, that is, what the resulting student model does 'better'. To address this
issue, here, we utilize two non-linear, low-dimensional embedding methods
(t-SNE and IVIS) to visualize representation spaces of different layers in a
network. We perform a set of extensive experiments with different architecture
parameters and distillation methods. The resulting visualizations and metrics
clearly show that distillation guides the network to find a more compact
representation space for higher accuracy already in earlier layers compared to
its non-distilled version.
Related papers
- Leveraging Angular Distributions for Improved Knowledge Distillation [4.751886527142779]
In computer vision applications, it is seen that the feature activation learned by a higher capacity model contains richer knowledge, highlighting complete objects while focusing less on the background.
We propose a new loss function for distillation, called angular margin-based distillation (AMD) loss.
We show that the proposed method has advantages in compatibility with other learning techniques, such as using fine-grained features, augmentation, and other distillation methods.
arXiv Detail & Related papers (2023-02-27T20:34:30Z) - HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained
Transformers [49.79405257763856]
This paper focuses on task-agnostic distillation.
It produces a compact pre-trained model that can be easily fine-tuned on various tasks with small computational costs and memory footprints.
We propose Homotopic Distillation (HomoDistil), a novel task-agnostic distillation approach equipped with iterative pruning.
arXiv Detail & Related papers (2023-02-19T17:37:24Z) - EmbedDistill: A Geometric Knowledge Distillation for Information
Retrieval [83.79667141681418]
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR)
We propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model.
We show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
arXiv Detail & Related papers (2023-01-27T22:04:37Z) - Slimmable Networks for Contrastive Self-supervised Learning [69.9454691873866]
Self-supervised learning makes significant progress in pre-training large models, but struggles with small models.
We introduce another one-stage solution to obtain pre-trained small models without the need for extra teachers.
A slimmable network consists of a full network and several weight-sharing sub-networks, which can be pre-trained once to obtain various networks.
arXiv Detail & Related papers (2022-09-30T15:15:05Z) - Knowledge Distillation Meets Open-Set Semi-Supervised Learning [69.21139647218456]
We propose a novel em modelname (bfem shortname) method dedicated for distilling representational knowledge semantically from a pretrained teacher to a target student.
At the problem level, this establishes an interesting connection between knowledge distillation with open-set semi-supervised learning (SSL)
Our shortname outperforms significantly previous state-of-the-art knowledge distillation methods on both coarse object classification and fine face recognition tasks.
arXiv Detail & Related papers (2022-05-13T15:15:27Z) - It's All in the Head: Representation Knowledge Distillation through
Classifier Sharing [0.29360071145551075]
We introduce two approaches for enhancing representation distillation using classifier sharing between the teacher and student.
We show the effectiveness of the proposed methods on various datasets and tasks, including image classification, fine-grained classification, and face verification.
arXiv Detail & Related papers (2022-01-18T13:10:36Z) - SimReg: Regression as a Simple Yet Effective Tool for Self-supervised
Knowledge Distillation [14.739041141948032]
Feature regression is a simple way to distill large neural network models to smaller ones.
We show that with simple changes to the network architecture, regression can outperform more complex state-of-the-art approaches for knowledge distillation.
arXiv Detail & Related papers (2022-01-13T18:41:46Z) - Teacher's pet: understanding and mitigating biases in distillation [61.44867470297283]
Several works have shown that distillation significantly boosts the student's overall performance.
However, are these gains uniform across all data subgroups?
We show that distillation can harm performance on certain subgroups.
We present techniques which soften the teacher influence for subgroups where it is less reliable.
arXiv Detail & Related papers (2021-06-19T13:06:25Z) - Knowledge distillation: A good teacher is patient and consistent [71.14922743774864]
There is a growing discrepancy in computer vision between large-scale models that achieve state-of-the-art performance and models that are affordable in practical applications.
We identify certain implicit design choices, which may drastically affect the effectiveness of distillation.
We obtain a state-of-the-art ResNet-50 model for ImageNet, which achieves 82.8% top-1 accuracy.
arXiv Detail & Related papers (2021-06-09T17:20:40Z) - Students are the Best Teacher: Exit-Ensemble Distillation with
Multi-Exits [25.140055086630838]
This paper proposes a novel knowledge distillation-based learning method to improve the classification performance of convolutional neural networks (CNNs)
Unlike the conventional notion of distillation where teachers only teach students, we show that students can also help other students and even the teacher to learn better.
arXiv Detail & Related papers (2021-04-01T07:10:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.