$V_kD:$ Improving Knowledge Distillation using Orthogonal Projections
- URL: http://arxiv.org/abs/2403.06213v1
- Date: Sun, 10 Mar 2024 13:26:24 GMT
- Title: $V_kD:$ Improving Knowledge Distillation using Orthogonal Projections
- Authors: Roy Miles, Ismail Elezi, Jiankang Deng
- Abstract summary: Knowledge distillation is an effective method for training small and efficient deep learning models.
However, the efficacy of a single method can degenerate when transferring to other tasks, modalities, or other architectures.
We propose a novel constrained feature distillation method to address this limitation.
- Score: 36.27954884906034
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation is an effective method for training small and
efficient deep learning models. However, the efficacy of a single method can
degenerate when transferring to other tasks, modalities, or even other
architectures. To address this limitation, we propose a novel constrained
feature distillation method. This method is derived from a small set of core
principles, which results in two emerging components: an orthogonal projection
and a task-specific normalisation. Equipped with both of these components, our
transformer models can outperform all previous methods on ImageNet and reach up
to a 4.4% relative improvement over the previous state-of-the-art methods. To
further demonstrate the generality of our method, we apply it to object
detection and image generation, whereby we obtain consistent and substantial
performance improvements over state-of-the-art. Code and models are publicly
available: https://github.com/roymiles/vkd
Related papers
- Distill-then-prune: An Efficient Compression Framework for Real-time Stereo Matching Network on Edge Devices [5.696239274365031]
We propose a novel strategy by incorporating knowledge distillation and model pruning to overcome the inherent trade-off between speed and accuracy.
We obtained a model that maintains real-time performance while delivering high accuracy on edge devices.
arXiv Detail & Related papers (2024-05-20T06:03:55Z) - BOOT: Data-free Distillation of Denoising Diffusion Models with
Bootstrapping [64.54271680071373]
Diffusion models have demonstrated excellent potential for generating diverse images.
Knowledge distillation has been recently proposed as a remedy that can reduce the number of inference steps to one or a few.
We present a novel technique called BOOT, that overcomes limitations with an efficient data-free distillation algorithm.
arXiv Detail & Related papers (2023-06-08T20:30:55Z) - Generalizing Dataset Distillation via Deep Generative Prior [75.9031209877651]
We propose to distill an entire dataset's knowledge into a few synthetic images.
The idea is to synthesize a small number of synthetic data points that, when given to a learning algorithm as training data, result in a model approximating one trained on the original data.
We present a new optimization algorithm that distills a large number of images into a few intermediate feature vectors in the generative model's latent space.
arXiv Detail & Related papers (2023-05-02T17:59:31Z) - Distilling from Similar Tasks for Transfer Learning on a Budget [38.998980344852846]
Transfer learning is an effective solution for training with few labels, however often at the expense of a computationally costly fine-tuning of large base models.
We propose to mitigate this unpleasant trade-off between compute and accuracy via semi-supervised cross-domain distillation.
Our methods need no access to source data, and merely need features and pseudo-labels of the source models.
arXiv Detail & Related papers (2023-04-24T17:59:01Z) - Weighted Ensemble Self-Supervised Learning [67.24482854208783]
Ensembling has proven to be a powerful technique for boosting model performance.
We develop a framework that permits data-dependent weighted cross-entropy losses.
Our method outperforms both in multiple evaluation metrics on ImageNet-1K.
arXiv Detail & Related papers (2022-11-18T02:00:17Z) - It's All in the Head: Representation Knowledge Distillation through
Classifier Sharing [0.29360071145551075]
We introduce two approaches for enhancing representation distillation using classifier sharing between the teacher and student.
We show the effectiveness of the proposed methods on various datasets and tasks, including image classification, fine-grained classification, and face verification.
arXiv Detail & Related papers (2022-01-18T13:10:36Z) - Powerpropagation: A sparsity inducing weight reparameterisation [65.85142037667065]
We introduce Powerpropagation, a new weight- parameterisation for neural networks that leads to inherently sparse models.
Models trained in this manner exhibit similar performance, but have a distribution with markedly higher density at zero, allowing more parameters to be pruned safely.
Here, we combine Powerpropagation with a traditional weight-pruning technique as well as recent state-of-the-art sparse-to-sparse algorithms, showing superior performance on the ImageNet benchmark.
arXiv Detail & Related papers (2021-10-01T10:03:57Z) - Knowledge Graph Embedding with Atrous Convolution and Residual Learning [4.582412257655891]
We propose a simple but effective atrous convolution based knowledge graph embedding method.
It effectively increases feature interactions by using atrous convolutions.
It addresses the original information forgotten issue and vanishing/exploding gradient issue.
arXiv Detail & Related papers (2020-10-23T00:57:23Z) - On the Orthogonality of Knowledge Distillation with Other Techniques:
From an Ensemble Perspective [34.494730096460636]
We show that knowledge distillation is a powerful apparatus for practical deployment of efficient neural network.
We also introduce ways to integrate knowledge distillation with other methods effectively.
arXiv Detail & Related papers (2020-09-09T06:14:59Z) - FeatMatch: Feature-Based Augmentation for Semi-Supervised Learning [64.32306537419498]
We propose a novel learned feature-based refinement and augmentation method that produces a varied set of complex transformations.
These transformations also use information from both within-class and across-class representations that we extract through clustering.
We demonstrate that our method is comparable to current state of art for smaller datasets while being able to scale up to larger datasets.
arXiv Detail & Related papers (2020-07-16T17:55:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.