Distilling Inductive Bias: Knowledge Distillation Beyond Model
Compression
- URL: http://arxiv.org/abs/2310.00369v2
- Date: Tue, 10 Oct 2023 09:12:37 GMT
- Title: Distilling Inductive Bias: Knowledge Distillation Beyond Model
Compression
- Authors: Gousia Habib, Tausifa Jan Saleem, Brejesh Lall
- Abstract summary: Vision Transformers (ViTs) offer the tantalizing prospect of unified information processing across visual and textual domains.
We introduce an innovative ensemble-based distillation approach distilling inductive bias from complementary lightweight teacher models.
Our proposed framework also involves precomputing and storing logits in advance, essentially the unnormalized predictions of the model.
- Score: 6.508088032296086
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the rapid development of computer vision, Vision Transformers (ViTs)
offer the tantalizing prospect of unified information processing across visual
and textual domains. But due to the lack of inherent inductive biases in ViTs,
they require enormous amount of data for training. To make their applications
practical, we introduce an innovative ensemble-based distillation approach
distilling inductive bias from complementary lightweight teacher models. Prior
systems relied solely on convolution-based teaching. However, this method
incorporates an ensemble of light teachers with different architectural
tendencies, such as convolution and involution, to instruct the student
transformer jointly. Because of these unique inductive biases, instructors can
accumulate a wide range of knowledge, even from readily identifiable stored
datasets, which leads to enhanced student performance. Our proposed framework
also involves precomputing and storing logits in advance, essentially the
unnormalized predictions of the model. This optimization can accelerate the
distillation process by eliminating the need for repeated forward passes during
knowledge distillation, significantly reducing the computational burden and
enhancing efficiency.
Related papers
- Promoting CNNs with Cross-Architecture Knowledge Distillation for Efficient Monocular Depth Estimation [4.242540533823568]
Transformer models are usually computationally-expensive, and their effectiveness in light-weight models are limited compared to convolutions.
We propose a cross-architecture knowledge distillation method for MDE, dubbed DisDepth, to enhance efficient CNN models with the supervision of state-of-the-art transformer models.
Our method achieves significant improvements on various efficient backbones, showcasing its potential for efficient monocular depth estimation.
arXiv Detail & Related papers (2024-04-25T07:55:47Z) - BOOT: Data-free Distillation of Denoising Diffusion Models with
Bootstrapping [64.54271680071373]
Diffusion models have demonstrated excellent potential for generating diverse images.
Knowledge distillation has been recently proposed as a remedy that can reduce the number of inference steps to one or a few.
We present a novel technique called BOOT, that overcomes limitations with an efficient data-free distillation algorithm.
arXiv Detail & Related papers (2023-06-08T20:30:55Z) - Distillation from Heterogeneous Models for Top-K Recommendation [43.83625440616829]
HetComp is a framework that guides the student model by transferring sequences of knowledge from teachers' trajectories.
HetComp significantly improves the distillation quality and the generalization of the student model.
arXiv Detail & Related papers (2023-03-02T10:23:50Z) - HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained
Transformers [49.79405257763856]
This paper focuses on task-agnostic distillation.
It produces a compact pre-trained model that can be easily fine-tuned on various tasks with small computational costs and memory footprints.
We propose Homotopic Distillation (HomoDistil), a novel task-agnostic distillation approach equipped with iterative pruning.
arXiv Detail & Related papers (2023-02-19T17:37:24Z) - Distilling Knowledge from Self-Supervised Teacher by Embedding Graph
Alignment [52.704331909850026]
We formulate a new knowledge distillation framework to transfer the knowledge from self-supervised pre-trained models to any other student network.
Inspired by the spirit of instance discrimination in self-supervised learning, we model the instance-instance relations by a graph formulation in the feature embedding space.
Our distillation scheme can be flexibly applied to transfer the self-supervised knowledge to enhance representation learning on various student networks.
arXiv Detail & Related papers (2022-11-23T19:27:48Z) - Efficient Vision Transformers via Fine-Grained Manifold Distillation [96.50513363752836]
Vision transformer architectures have shown extraordinary performance on many computer vision tasks.
Although the network performance is boosted, transformers are often required more computational resources.
We propose to excavate useful information from the teacher transformer through the relationship between images and the divided patches.
arXiv Detail & Related papers (2021-07-03T08:28:34Z) - Co-advise: Cross Inductive Bias Distillation [39.61426495884721]
We propose a novel distillation-based method to train vision transformers.
We introduce lightweight teachers with different architectural inductive biases to co-advise the student transformer.
Our vision transformers (termed as CivT) outperform all previous transformers of the same architecture on ImageNet.
arXiv Detail & Related papers (2021-06-23T13:19:59Z) - Learning by Distillation: A Self-Supervised Learning Framework for
Optical Flow Estimation [71.76008290101214]
DistillFlow is a knowledge distillation approach to learning optical flow.
It achieves state-of-the-art unsupervised learning performance on both KITTI and Sintel datasets.
Our models ranked 1st among all monocular methods on the KITTI 2015 benchmark, and outperform all published methods on the Sintel Final benchmark.
arXiv Detail & Related papers (2021-06-08T09:13:34Z) - Generative Adversarial Simulator [2.3986080077861787]
We introduce a simulator-free approach to knowledge distillation in the context of reinforcement learning.
A key challenge is having the student learn the multiplicity of cases that correspond to a given action.
This is the first demonstration of simulator-free knowledge distillation between a teacher and a student policy.
arXiv Detail & Related papers (2020-11-23T15:31:12Z) - Knowledge Distillation Meets Self-Supervision [109.6400639148393]
Knowledge distillation involves extracting "dark knowledge" from a teacher network to guide the learning of a student network.
We show that the seemingly different self-supervision task can serve as a simple yet powerful solution.
By exploiting the similarity between those self-supervision signals as an auxiliary task, one can effectively transfer the hidden information from the teacher to the student.
arXiv Detail & Related papers (2020-06-12T12:18:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.