Related papers: Visualizing the embedding space to explain the effect of knowledge distillation

Visualizing the embedding space to explain the effect of knowledge distillation

URL: http://arxiv.org/abs/2110.04483v1
Date: Sat, 9 Oct 2021 07:04:26 GMT
Title: Visualizing the embedding space to explain the effect of knowledge distillation
Authors: Hyun Seung Lee, Christian Wallraven
Abstract summary: Recent research has found that knowledge distillation can be effective in reducing the size of a network. Despite these advances, it still is relatively unclear emphwhy this method works, that is, what the resulting student model does 'better'
Score: 5.678337324555035
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent research has found that knowledge distillation can be effective in reducing the size of a network and in increasing generalization. A pre-trained, large teacher network, for example, was shown to be able to bootstrap a student model that eventually outperforms the teacher in a limited label environment. Despite these advances, it still is relatively unclear \emph{why} this method works, that is, what the resulting student model does 'better'. To address this issue, here, we utilize two non-linear, low-dimensional embedding methods (t-SNE and IVIS) to visualize representation spaces of different layers in a network. We perform a set of extensive experiments with different architecture parameters and distillation methods. The resulting visualizations and metrics clearly show that distillation guides the network to find a more compact representation space for higher accuracy already in earlier layers compared to its non-distilled version.

Related papers

Adversarial Curriculum Graph-Free Knowledge Distillation for Graph Neural Networks [61.608453110751206]
We propose a fast and high-quality data-free knowledge distillation approach for graph neural networks. The proposed graph-free KD method (ACGKD) significantly reduces the spatial complexity of pseudo-graphs. ACGKD eliminates the dimensional ambiguity between the student and teacher models by increasing the student's dimensions.
arXiv Detail & Related papers (2025-04-01T08:44:27Z)
Faithful Label-free Knowledge Distillation [8.572967695281054]
This paper presents a label-free knowledge distillation approach called Teacher in the Middle (TinTeM) It produces a more faithful student, which better replicates the behavior of the teacher network across a range of benchmarks testing model robustness, generalisability and out-of-distribution detection.
arXiv Detail & Related papers (2024-11-22T01:48:44Z)
HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained Transformers [49.79405257763856]
This paper focuses on task-agnostic distillation. It produces a compact pre-trained model that can be easily fine-tuned on various tasks with small computational costs and memory footprints. We propose Homotopic Distillation (HomoDistil), a novel task-agnostic distillation approach equipped with iterative pruning.
arXiv Detail & Related papers (2023-02-19T17:37:24Z)
EmbedDistill: A Geometric Knowledge Distillation for Information Retrieval [83.79667141681418]
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR) We propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model. We show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
arXiv Detail & Related papers (2023-01-27T22:04:37Z)
Slimmable Networks for Contrastive Self-supervised Learning [69.9454691873866]
Self-supervised learning makes significant progress in pre-training large models, but struggles with small models. We introduce another one-stage solution to obtain pre-trained small models without the need for extra teachers. A slimmable network consists of a full network and several weight-sharing sub-networks, which can be pre-trained once to obtain various networks.
arXiv Detail & Related papers (2022-09-30T15:15:05Z)
Knowledge Distillation Meets Open-Set Semi-Supervised Learning [69.21139647218456]
We propose a novel em modelname (bfem shortname) method dedicated for distilling representational knowledge semantically from a pretrained teacher to a target student. At the problem level, this establishes an interesting connection between knowledge distillation with open-set semi-supervised learning (SSL) Our shortname outperforms significantly previous state-of-the-art knowledge distillation methods on both coarse object classification and fine face recognition tasks.
arXiv Detail & Related papers (2022-05-13T15:15:27Z)
It's All in the Head: Representation Knowledge Distillation through Classifier Sharing [0.29360071145551075]
We introduce two approaches for enhancing representation distillation using classifier sharing between the teacher and student. We show the effectiveness of the proposed methods on various datasets and tasks, including image classification, fine-grained classification, and face verification.
arXiv Detail & Related papers (2022-01-18T13:10:36Z)
SimReg: Regression as a Simple Yet Effective Tool for Self-supervised Knowledge Distillation [14.739041141948032]
Feature regression is a simple way to distill large neural network models to smaller ones. We show that with simple changes to the network architecture, regression can outperform more complex state-of-the-art approaches for knowledge distillation.
arXiv Detail & Related papers (2022-01-13T18:41:46Z)
Teacher's pet: understanding and mitigating biases in distillation [61.44867470297283]
Several works have shown that distillation significantly boosts the student's overall performance. However, are these gains uniform across all data subgroups? We show that distillation can harm performance on certain subgroups. We present techniques which soften the teacher influence for subgroups where it is less reliable.
arXiv Detail & Related papers (2021-06-19T13:06:25Z)
Knowledge distillation: A good teacher is patient and consistent [71.14922743774864]
There is a growing discrepancy in computer vision between large-scale models that achieve state-of-the-art performance and models that are affordable in practical applications. We identify certain implicit design choices, which may drastically affect the effectiveness of distillation. We obtain a state-of-the-art ResNet-50 model for ImageNet, which achieves 82.8% top-1 accuracy.
arXiv Detail & Related papers (2021-06-09T17:20:40Z)
Students are the Best Teacher: Exit-Ensemble Distillation with Multi-Exits [25.140055086630838]
This paper proposes a novel knowledge distillation-based learning method to improve the classification performance of convolutional neural networks (CNNs) Unlike the conventional notion of distillation where teachers only teach students, we show that students can also help other students and even the teacher to learn better.
arXiv Detail & Related papers (2021-04-01T07:10:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.