Visualizing the embedding space to explain the effect of knowledge
  distillation
        - URL: http://arxiv.org/abs/2110.04483v1
- Date: Sat, 9 Oct 2021 07:04:26 GMT
- Title: Visualizing the embedding space to explain the effect of knowledge
  distillation
- Authors: Hyun Seung Lee, Christian Wallraven
- Abstract summary: Recent research has found that knowledge distillation can be effective in reducing the size of a network.
Despite these advances, it still is relatively unclear emphwhy this method works, that is, what the resulting student model does 'better'
- Score: 5.678337324555035
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract:   Recent research has found that knowledge distillation can be effective in
reducing the size of a network and in increasing generalization. A pre-trained,
large teacher network, for example, was shown to be able to bootstrap a student
model that eventually outperforms the teacher in a limited label environment.
Despite these advances, it still is relatively unclear \emph{why} this method
works, that is, what the resulting student model does 'better'. To address this
issue, here, we utilize two non-linear, low-dimensional embedding methods
(t-SNE and IVIS) to visualize representation spaces of different layers in a
network. We perform a set of extensive experiments with different architecture
parameters and distillation methods. The resulting visualizations and metrics
clearly show that distillation guides the network to find a more compact
representation space for higher accuracy already in earlier layers compared to
its non-distilled version.
 
      
        Related papers
        - Adversarial Curriculum Graph-Free Knowledge Distillation for Graph   Neural Networks [61.608453110751206]
 We propose a fast and high-quality data-free knowledge distillation approach for graph neural networks.
The proposed graph-free KD method (ACGKD) significantly reduces the spatial complexity of pseudo-graphs.
ACGKD eliminates the dimensional ambiguity between the student and teacher models by increasing the student's dimensions.
 arXiv  Detail & Related papers  (2025-04-01T08:44:27Z)
- Faithful Label-free Knowledge Distillation [8.572967695281054]
 This paper presents a label-free knowledge distillation approach called Teacher in the Middle (TinTeM)
It produces a more faithful student, which better replicates the behavior of the teacher network across a range of benchmarks testing model robustness, generalisability and out-of-distribution detection.
 arXiv  Detail & Related papers  (2024-11-22T01:48:44Z)
- HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained
  Transformers [49.79405257763856]
 This paper focuses on task-agnostic distillation.
It produces a compact pre-trained model that can be easily fine-tuned on various tasks with small computational costs and memory footprints.
We propose Homotopic Distillation (HomoDistil), a novel task-agnostic distillation approach equipped with iterative pruning.
 arXiv  Detail & Related papers  (2023-02-19T17:37:24Z)
- EmbedDistill: A Geometric Knowledge Distillation for Information
  Retrieval [83.79667141681418]
 Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR)
We propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model.
We show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
 arXiv  Detail & Related papers  (2023-01-27T22:04:37Z)
- Slimmable Networks for Contrastive Self-supervised Learning [69.9454691873866]
 Self-supervised learning makes significant progress in pre-training large models, but struggles with small models.
We introduce another one-stage solution to obtain pre-trained small models without the need for extra teachers.
A slimmable network consists of a full network and several weight-sharing sub-networks, which can be pre-trained once to obtain various networks.
 arXiv  Detail & Related papers  (2022-09-30T15:15:05Z)
- Knowledge Distillation Meets Open-Set Semi-Supervised Learning [69.21139647218456]
 We propose a novel em modelname (bfem shortname) method dedicated for distilling representational knowledge semantically from a pretrained teacher to a target student.
At the problem level, this establishes an interesting connection between knowledge distillation with open-set semi-supervised learning (SSL)
Our shortname outperforms significantly previous state-of-the-art knowledge distillation methods on both coarse object classification and fine face recognition tasks.
 arXiv  Detail & Related papers  (2022-05-13T15:15:27Z)
- It's All in the Head: Representation Knowledge Distillation through
  Classifier Sharing [0.29360071145551075]
 We introduce two approaches for enhancing representation distillation using classifier sharing between the teacher and student.
We show the effectiveness of the proposed methods on various datasets and tasks, including image classification, fine-grained classification, and face verification.
 arXiv  Detail & Related papers  (2022-01-18T13:10:36Z)
- SimReg: Regression as a Simple Yet Effective Tool for Self-supervised
  Knowledge Distillation [14.739041141948032]
 Feature regression is a simple way to distill large neural network models to smaller ones.
We show that with simple changes to the network architecture, regression can outperform more complex state-of-the-art approaches for knowledge distillation.
 arXiv  Detail & Related papers  (2022-01-13T18:41:46Z)
- Teacher's pet: understanding and mitigating biases in distillation [61.44867470297283]
 Several works have shown that distillation significantly boosts the student's overall performance.
However, are these gains uniform across all data subgroups?
We show that distillation can harm performance on certain subgroups.
We present techniques which soften the teacher influence for subgroups where it is less reliable.
 arXiv  Detail & Related papers  (2021-06-19T13:06:25Z)
- Knowledge distillation: A good teacher is patient and consistent [71.14922743774864]
 There is a growing discrepancy in computer vision between large-scale models that achieve state-of-the-art performance and models that are affordable in practical applications.
We identify certain implicit design choices, which may drastically affect the effectiveness of distillation.
We obtain a state-of-the-art ResNet-50 model for ImageNet, which achieves 82.8% top-1 accuracy.
 arXiv  Detail & Related papers  (2021-06-09T17:20:40Z)
- Students are the Best Teacher: Exit-Ensemble Distillation with
  Multi-Exits [25.140055086630838]
 This paper proposes a novel knowledge distillation-based learning method to improve the classification performance of convolutional neural networks (CNNs)
Unlike the conventional notion of distillation where teachers only teach students, we show that students can also help other students and even the teacher to learn better.
 arXiv  Detail & Related papers  (2021-04-01T07:10:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.