Simple Unsupervised Knowledge Distillation With Space Similarity
- URL: http://arxiv.org/abs/2409.13939v1
- Date: Fri, 20 Sep 2024 22:54:39 GMT
- Title: Simple Unsupervised Knowledge Distillation With Space Similarity
- Authors: Aditya Singh, Haohan Wang,
- Abstract summary: Self-supervised learning (SSL) does not readily extend to smaller architectures.
We propose a simple objective to capture the lost information due to normalisation.
Our proposed loss component, termed textbfspace similarity, motivates each dimension of a student's feature space to be similar to the corresponding dimension of its teacher.
- Score: 15.341380611979524
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As per recent studies, Self-supervised learning (SSL) does not readily extend to smaller architectures. One direction to mitigate this shortcoming while simultaneously training a smaller network without labels is to adopt unsupervised knowledge distillation (UKD). Existing UKD approaches handcraft preservation worthy inter/intra sample relationships between the teacher and its student. However, this may overlook/ignore other key relationships present in the mapping of a teacher. In this paper, instead of heuristically constructing preservation worthy relationships between samples, we directly motivate the student to model the teacher's embedding manifold. If the mapped manifold is similar, all inter/intra sample relationships are indirectly conserved. We first demonstrate that prior methods cannot preserve teacher's latent manifold due to their sole reliance on $L_2$ normalised embedding features. Subsequently, we propose a simple objective to capture the lost information due to normalisation. Our proposed loss component, termed \textbf{space similarity}, motivates each dimension of a student's feature space to be similar to the corresponding dimension of its teacher. We perform extensive experiments demonstrating strong performance of our proposed approach on various benchmarks.
Related papers
- Progressive distillation induces an implicit curriculum [44.528775476168654]
A better teacher does not always yield a better student, to which a common mitigation is to use additional supervision from several teachers.
One empirically validated variant of this principle is progressive distillation, where the student learns from successive intermediate checkpoints of the teacher.
Using sparse parity as a sandbox, we identify an implicit curriculum as one mechanism through which progressive distillation accelerates the student's learning.
arXiv Detail & Related papers (2024-10-07T19:49:24Z) - Relational Representation Distillation [6.24302896438145]
We introduce Representation Distillation (RRD) to explore and reinforce relationships between teacher and student models.
Inspired by self-supervised learning principles, it uses a relaxed contrastive loss that focuses on similarity than exact replication.
Our approach demonstrates superior performance on CIFAR-100 and ImageNet ILSVRC-2012 and sometimes even outperforms the teacher network when combined with KD.
arXiv Detail & Related papers (2024-07-16T14:56:13Z) - EmbedDistill: A Geometric Knowledge Distillation for Information
Retrieval [83.79667141681418]
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR)
We propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model.
We show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
arXiv Detail & Related papers (2023-01-27T22:04:37Z) - Knowledge Distillation from A Stronger Teacher [44.11781464210916]
This paper presents a method dubbed DIST to distill better from a stronger teacher.
We empirically find that the discrepancy of predictions between the student and a stronger teacher may tend to be fairly severer.
Our method is simple yet practical, and extensive experiments demonstrate that it adapts well to various architectures.
arXiv Detail & Related papers (2022-05-21T08:30:58Z) - Knowledge Distillation Meets Open-Set Semi-Supervised Learning [69.21139647218456]
We propose a novel em modelname (bfem shortname) method dedicated for distilling representational knowledge semantically from a pretrained teacher to a target student.
At the problem level, this establishes an interesting connection between knowledge distillation with open-set semi-supervised learning (SSL)
Our shortname outperforms significantly previous state-of-the-art knowledge distillation methods on both coarse object classification and fine face recognition tasks.
arXiv Detail & Related papers (2022-05-13T15:15:27Z) - Generalized Knowledge Distillation via Relationship Matching [53.69235109551099]
Knowledge of a well-trained deep neural network (a.k.a. the "teacher") is valuable for learning similar tasks.
Knowledge distillation extracts knowledge from the teacher and integrates it with the target model.
Instead of enforcing the teacher to work on the same task as the student, we borrow the knowledge from a teacher trained from a general label space.
arXiv Detail & Related papers (2022-05-04T06:49:47Z) - Chaos is a Ladder: A New Theoretical Understanding of Contrastive
Learning via Augmentation Overlap [64.60460828425502]
We propose a new guarantee on the downstream performance of contrastive learning.
Our new theory hinges on the insight that the support of different intra-class samples will become more overlapped under aggressive data augmentations.
We propose an unsupervised model selection metric ARC that aligns well with downstream accuracy.
arXiv Detail & Related papers (2022-03-25T05:36:26Z) - A Low Rank Promoting Prior for Unsupervised Contrastive Learning [108.91406719395417]
We construct a novel probabilistic graphical model that effectively incorporates the low rank promoting prior into the framework of contrastive learning.
Our hypothesis explicitly requires that all the samples belonging to the same instance class lie on the same subspace with small dimension.
Empirical evidences show that the proposed algorithm clearly surpasses the state-of-the-art approaches on multiple benchmarks.
arXiv Detail & Related papers (2021-08-05T15:58:25Z) - Bag of Instances Aggregation Boosts Self-supervised Learning [122.61914701794296]
We propose a simple but effective distillation strategy for unsupervised learning.
Our method, termed as BINGO, targets at transferring the relationship learned by the teacher to the student.
BINGO achieves new state-of-the-art performance on small scale models.
arXiv Detail & Related papers (2021-07-04T17:33:59Z) - ALP-KD: Attention-Based Layer Projection for Knowledge Distillation [30.896957367331137]
Two neural networks, namely a teacher and a student, are coupled together during training.
The teacher network is supposed to be a trustworthy predictor and the student tries to mimic its predictions.
In such a setting, distillation only happens for final predictions whereas the student could also benefit from teacher's supervision for internal components.
arXiv Detail & Related papers (2020-12-27T22:30:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.