Related papers: Understanding the Role of the Projector in Knowledge Distillation

Understanding the Role of the Projector in Knowledge Distillation

URL: http://arxiv.org/abs/2303.11098v5
Date: Thu, 1 Feb 2024 11:51:01 GMT
Title: Understanding the Role of the Projector in Knowledge Distillation
Authors: Roy Miles and Krystian Mikolajczyk
Abstract summary: We revisit the efficacy of knowledge distillation as a function matching and metric learning problem. We verify three important design decisions, namely the normalisation, soft maximum function, and projection layers. We attain a 77.2% top-1 accuracy with DeiT-Ti on ImageNet.
Score: 22.698845243751293
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper we revisit the efficacy of knowledge distillation as a function matching and metric learning problem. In doing so we verify three important design decisions, namely the normalisation, soft maximum function, and projection layers as key ingredients. We theoretically show that the projector implicitly encodes information on past examples, enabling relational gradients for the student. We then show that the normalisation of representations is tightly coupled with the training dynamics of this projector, which can have a large impact on the students performance. Finally, we show that a simple soft maximum function can be used to address any significant capacity gap problems. Experimental results on various benchmark datasets demonstrate that using these insights can lead to superior or comparable performance to state-of-the-art knowledge distillation techniques, despite being much more computationally efficient. In particular, we obtain these results across image classification (CIFAR100 and ImageNet), object detection (COCO2017), and on more difficult distillation objectives, such as training data efficient transformers, whereby we attain a 77.2% top-1 accuracy with DeiT-Ti on ImageNet. Code and models are publicly available.

Related papers

Underlying Semantic Diffusion for Effective and Efficient In-Context Learning [113.4003355229632]
Underlying Semantic Diffusion (US-Diffusion) is an enhanced diffusion model that boosts underlying semantics learning, computational efficiency, and in-context learning capabilities. We present a Feedback-Aided Learning (FAL) framework, which leverages feedback signals to guide the model in capturing semantic details. We also propose a plug-and-play Efficient Sampling Strategy (ESS) for dense sampling at time steps with high-noise levels.
arXiv Detail & Related papers (2025-03-06T03:06:22Z)
Projection Head is Secretly an Information Bottleneck [33.755883011145755]
We develop an in-depth theoretical understanding of the projection head from the information-theoretic perspective. By establishing the theoretical guarantees on the downstream performance of the features before the projector, we reveal that an effective projector should act as an information bottleneck. Our methods exhibit consistent improvement in the downstream performance across various real-world datasets.
arXiv Detail & Related papers (2025-03-01T14:23:31Z)
Knowledge Composition using Task Vectors with Learned Anisotropic Scaling [51.4661186662329]
We introduce aTLAS, an algorithm that linearly combines parameter blocks with different learned coefficients, resulting in anisotropic scaling at the task vector level. We show that such linear combinations explicitly exploit the low intrinsicity of pre-trained models, with only a few coefficients being the learnable parameters. We demonstrate the effectiveness of our method in task arithmetic, few-shot recognition and test-time adaptation, with supervised or unsupervised objectives.
arXiv Detail & Related papers (2024-07-03T07:54:08Z)
TSCM: A Teacher-Student Model for Vision Place Recognition Using Cross-Metric Knowledge Distillation [6.856317526681759]
Visual place recognition plays a pivotal role in autonomous exploration and navigation of mobile robots. Existing methods overcome this by exploiting powerful yet large networks. We propose a high-performance teacher and lightweight student distillation framework called TSCM.
arXiv Detail & Related papers (2024-04-02T02:29:41Z)
Investigating the Benefits of Projection Head for Representation Learning [11.20245728716827]
An effective technique for obtaining high-quality representations is adding a projection head on top of the encoder during training, then discarding it and using the pre-projection representations. The pre-projection representations are not directly optimized by the loss function, raising the question: what makes them better? We show that implicit bias of training algorithms leads to layer-wise progressive feature weighting, where features become increasingly unequal as we go deeper into the layers.
arXiv Detail & Related papers (2024-03-18T00:48:58Z)
Object-centric Cross-modal Feature Distillation for Event-based Object Detection [87.50272918262361]
RGB detectors still outperform event-based detectors due to sparsity of the event data and missing visual details. We develop a novel knowledge distillation approach to shrink the performance gap between these two modalities. We show that object-centric distillation allows to significantly improve the performance of the event-based student object detector.
arXiv Detail & Related papers (2023-11-09T16:33:08Z)
Understanding the Effects of Projectors in Knowledge Distillation [31.882356225974632]
Even if the student and the teacher have the same feature dimensions, adding a projector still helps to improve the distillation performance. This paper investigates the implicit role that projectors play but so far have been overlooked. Motivated by the positive effects of projectors, we propose a projector ensemble-based feature distillation method to further improve distillation performance.
arXiv Detail & Related papers (2023-10-26T06:30:39Z)
Multi-dataset Training of Transformers for Robust Action Recognition [75.5695991766902]
We study the task of robust feature representations, aiming to generalize well on multiple datasets for action recognition. Here, we propose a novel multi-dataset training paradigm, MultiTrain, with the design of two new loss terms, namely informative loss and projection loss. We verify the effectiveness of our method on five challenging datasets, Kinetics-400, Kinetics-700, Moments-in-Time, Activitynet and Something-something-v2.
arXiv Detail & Related papers (2022-09-26T01:30:43Z)
Effectiveness of Function Matching in Driving Scene Recognition [0.571097144710995]
We experimentally investigate the impact of using such a large amount of unlabeled data for distillation on the performance of student models. We demonstrate that the performance of the compact student model can be improved dramatically and even match the performance of the large-scale teacher.
arXiv Detail & Related papers (2022-08-20T14:32:20Z)
Rich Feature Distillation with Feature Affinity Module for Efficient Image Dehazing [1.1470070927586016]
This work introduces a simple, lightweight, and efficient framework for single-image haze removal. We exploit rich "dark-knowledge" information from a lightweight pre-trained super-resolution model via the notion of heterogeneous knowledge distillation. Our experiments are carried out on the RESIDE-Standard dataset to demonstrate the robustness of our framework to the synthetic and real-world domains.
arXiv Detail & Related papers (2022-07-13T18:32:44Z)
Efficient Vision Transformers via Fine-Grained Manifold Distillation [96.50513363752836]
Vision transformer architectures have shown extraordinary performance on many computer vision tasks. Although the network performance is boosted, transformers are often required more computational resources. We propose to excavate useful information from the teacher transformer through the relationship between images and the divided patches.
arXiv Detail & Related papers (2021-07-03T08:28:34Z)
Knowledge distillation: A good teacher is patient and consistent [71.14922743774864]
There is a growing discrepancy in computer vision between large-scale models that achieve state-of-the-art performance and models that are affordable in practical applications. We identify certain implicit design choices, which may drastically affect the effectiveness of distillation. We obtain a state-of-the-art ResNet-50 model for ImageNet, which achieves 82.8% top-1 accuracy.
arXiv Detail & Related papers (2021-06-09T17:20:40Z)
Knowledge Distillation Meets Self-Supervision [109.6400639148393]
Knowledge distillation involves extracting "dark knowledge" from a teacher network to guide the learning of a student network. We show that the seemingly different self-supervision task can serve as a simple yet powerful solution. By exploiting the similarity between those self-supervision signals as an auxiliary task, one can effectively transfer the hidden information from the teacher to the student.
arXiv Detail & Related papers (2020-06-12T12:18:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.