Information Theoretic Representation Distillation
- URL: http://arxiv.org/abs/2112.00459v1
- Date: Wed, 1 Dec 2021 12:39:50 GMT
- Title: Information Theoretic Representation Distillation
- Authors: Roy Miles, Adri\'an L\'opez Rodr\'iguez, Krystian Mikolajczyk
- Abstract summary: We forge an alternative connection between information theory and knowledge distillation using a recently proposed entropy-like functional.
Our method achieves competitive performance to state-of-the-art on the knowledge distillation and cross-model transfer tasks.
We shed light to a new state-of-the-art for binary quantisation.
- Score: 20.802135299032308
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the empirical success of knowledge distillation, there still lacks a
theoretical foundation that can naturally lead to computationally inexpensive
implementations. To address this concern, we forge an alternative connection
between information theory and knowledge distillation using a recently proposed
entropy-like functional. In doing so, we introduce two distinct complementary
losses which aim to maximise the correlation and mutual information between the
student and teacher representations. Our method achieves competitive
performance to state-of-the-art on the knowledge distillation and cross-model
transfer tasks, while incurring significantly less training overheads than
closely related and similarly performing approaches. We further demonstrate the
effectiveness of our method on a binary distillation task, whereby we shed
light to a new state-of-the-art for binary quantisation. The code, evaluation
protocols, and trained models will be publicly available.
Related papers
- Multi-Stage Knowledge Integration of Vision-Language Models for Continual Learning [79.46570165281084]
We propose a Multi-Stage Knowledge Integration network (MulKI) to emulate the human learning process in distillation methods.
MulKI achieves this through four stages, including Eliciting Ideas, Adding New Ideas, Distinguishing Ideas, and Making Connections.
Our method demonstrates significant improvements in maintaining zero-shot capabilities while supporting continual learning across diverse downstream tasks.
arXiv Detail & Related papers (2024-11-11T07:36:19Z) - Learning to Maximize Mutual Information for Chain-of-Thought Distillation [13.660167848386806]
Distilling Step-by-Step(DSS) has demonstrated promise by imbuing smaller models with the superior reasoning capabilities of their larger counterparts.
However, DSS overlooks the intrinsic relationship between the two training tasks, leading to ineffective integration of CoT knowledge with the task of label prediction.
We propose a variational approach to solve this problem using a learning-based method.
arXiv Detail & Related papers (2024-03-05T22:21:45Z) - The Staged Knowledge Distillation in Video Classification: Harmonizing
Student Progress by a Complementary Weakly Supervised Framework [21.494759678807686]
We propose a new weakly supervised learning framework for knowledge distillation in video classification.
Our approach leverages the concept of substage-based learning to distill knowledge based on the combination of student substages and the correlation of corresponding substages.
Our proposed substage-based distillation approach has the potential to inform future research on label-efficient learning for video data.
arXiv Detail & Related papers (2023-07-11T12:10:42Z) - Knowledge Distillation via Token-level Relationship Graph [12.356770685214498]
We propose a novel method called Knowledge Distillation with Token-level Relationship Graph (TRG)
By employing TRG, the student model can effectively emulate higher-level semantic information from the teacher model.
We conduct experiments to evaluate the effectiveness of the proposed method against several state-of-the-art approaches.
arXiv Detail & Related papers (2023-06-20T08:16:37Z) - Towards a Unified View of Affinity-Based Knowledge Distillation [5.482532589225552]
We modularise knowledge distillation into a framework of three components, i.e. affinity, normalisation, and loss.
We show how relation-based knowledge distillation could achieve comparable performance to the state of the art in spite of the simplicity.
arXiv Detail & Related papers (2022-09-30T16:12:25Z) - PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense
Passage Retrieval [87.68667887072324]
We propose a novel approach that leverages query-centric and PAssage-centric sImilarity Relations (called PAIR) for dense passage retrieval.
To implement our approach, we make three major technical contributions by introducing formal formulations of the two kinds of similarity relations.
Our approach significantly outperforms previous state-of-the-art models on both MSMARCO and Natural Questions datasets.
arXiv Detail & Related papers (2021-08-13T02:07:43Z) - Self-supervised Co-training for Video Representation Learning [103.69904379356413]
We investigate the benefit of adding semantic-class positives to instance-based Info Noise Contrastive Estimation training.
We propose a novel self-supervised co-training scheme to improve the popular infoNCE loss.
We evaluate the quality of the learnt representation on two different downstream tasks: action recognition and video retrieval.
arXiv Detail & Related papers (2020-10-19T17:59:01Z) - On the Orthogonality of Knowledge Distillation with Other Techniques:
From an Ensemble Perspective [34.494730096460636]
We show that knowledge distillation is a powerful apparatus for practical deployment of efficient neural network.
We also introduce ways to integrate knowledge distillation with other methods effectively.
arXiv Detail & Related papers (2020-09-09T06:14:59Z) - Knowledge Distillation Meets Self-Supervision [109.6400639148393]
Knowledge distillation involves extracting "dark knowledge" from a teacher network to guide the learning of a student network.
We show that the seemingly different self-supervision task can serve as a simple yet powerful solution.
By exploiting the similarity between those self-supervision signals as an auxiliary task, one can effectively transfer the hidden information from the teacher to the student.
arXiv Detail & Related papers (2020-06-12T12:18:52Z) - Residual Knowledge Distillation [96.18815134719975]
This work proposes Residual Knowledge Distillation (RKD), which further distills the knowledge by introducing an assistant (A)
In this way, S is trained to mimic the feature maps of T, and A aids this process by learning the residual error between them.
Experiments show that our approach achieves appealing results on popular classification datasets, CIFAR-100 and ImageNet.
arXiv Detail & Related papers (2020-02-21T07:49:26Z) - Transfer Heterogeneous Knowledge Among Peer-to-Peer Teammates: A Model
Distillation Approach [55.83558520598304]
We propose a brand new solution to reuse experiences and transfer value functions among multiple students via model distillation.
We also describe how to design an efficient communication protocol to exploit heterogeneous knowledge.
Our proposed framework, namely Learning and Teaching Categorical Reinforcement, shows promising performance on stabilizing and accelerating learning progress.
arXiv Detail & Related papers (2020-02-06T11:31:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.