Extracurricular Learning: Knowledge Transfer Beyond Empirical
Distribution
- URL: http://arxiv.org/abs/2007.00051v2
- Date: Fri, 20 Nov 2020 19:11:09 GMT
- Title: Extracurricular Learning: Knowledge Transfer Beyond Empirical
Distribution
- Authors: Hadi Pouransari, Mojan Javaheripi, Vinay Sharma, Oncel Tuzel
- Abstract summary: We propose extracurricular learning to bridge the gap between a compressed student model and its teacher.
We conduct rigorous evaluations on regression and classification tasks and show that compared to the standard knowledge distillation, extracurricular learning reduces the gap by 46% to 68%.
This leads to major accuracy improvements compared to the empirical risk minimization-based training for various recent neural network architectures.
- Score: 17.996541285382463
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation has been used to transfer knowledge learned by a
sophisticated model (teacher) to a simpler model (student). This technique is
widely used to compress model complexity. However, in most applications the
compressed student model suffers from an accuracy gap with its teacher. We
propose extracurricular learning, a novel knowledge distillation method, that
bridges this gap by (1) modeling student and teacher output distributions; (2)
sampling examples from an approximation to the underlying data distribution;
and (3) matching student and teacher output distributions over this extended
set including uncertain samples. We conduct rigorous evaluations on regression
and classification tasks and show that compared to the standard knowledge
distillation, extracurricular learning reduces the gap by 46% to 68%. This
leads to major accuracy improvements compared to the empirical risk
minimization-based training for various recent neural network architectures:
16% regression error reduction on the MPIIGaze dataset, +3.4% to +9.1%
improvement in top-1 classification accuracy on the CIFAR100 dataset, and +2.9%
top-1 improvement on the ImageNet dataset.
Related papers
- Faithful Label-free Knowledge Distillation [8.572967695281054]
This paper presents a label-free knowledge distillation approach called Teacher in the Middle (TinTeM)
It produces a more faithful student, which better replicates the behavior of the teacher network across a range of benchmarks testing model robustness, generalisability and out-of-distribution detection.
arXiv Detail & Related papers (2024-11-22T01:48:44Z) - Distilling Calibrated Student from an Uncalibrated Teacher [8.101116303448586]
We study how to obtain a student from an uncalibrated teacher.
Our approach relies on the fusion of data-augmentation techniques, including but not limited to cutout, mixup, and CutMix.
We extend our approach beyond traditional knowledge distillation and find it suitable as well.
arXiv Detail & Related papers (2023-02-22T16:18:38Z) - HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained
Transformers [49.79405257763856]
This paper focuses on task-agnostic distillation.
It produces a compact pre-trained model that can be easily fine-tuned on various tasks with small computational costs and memory footprints.
We propose Homotopic Distillation (HomoDistil), a novel task-agnostic distillation approach equipped with iterative pruning.
arXiv Detail & Related papers (2023-02-19T17:37:24Z) - EmbedDistill: A Geometric Knowledge Distillation for Information
Retrieval [83.79667141681418]
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR)
We propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model.
We show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
arXiv Detail & Related papers (2023-01-27T22:04:37Z) - ProBoost: a Boosting Method for Probabilistic Classifiers [55.970609838687864]
ProBoost is a new boosting algorithm for probabilistic classifiers.
It uses the uncertainty of each training sample to determine the most challenging/uncertain ones.
It produces a sequence that progressively focuses on the samples found to have the highest uncertainty.
arXiv Detail & Related papers (2022-09-04T12:49:20Z) - Knowledge Distillation as Semiparametric Inference [44.572422527672416]
A popular approach to model compression is to train an inexpensive student model to mimic the class probabilities of a highly accurate but cumbersome teacher model.
This two-step knowledge distillation process often leads to higher accuracy than training the student directly on labeled data.
We cast knowledge distillation as a semiparametric inference problem with the optimal student model as the target, the unknown Bayes class probabilities as nuisance, and the teacher probabilities as a plug-in nuisance estimate.
arXiv Detail & Related papers (2021-04-20T03:00:45Z) - Beyond Self-Supervision: A Simple Yet Effective Network Distillation
Alternative to Improve Backbones [40.33419553042038]
We propose to improve existing baseline networks via knowledge distillation from off-the-shelf pre-trained big powerful models.
Our solution performs distillation by only driving prediction of the student model consistent with that of the teacher model.
We empirically find that such simple distillation settings perform extremely effective, for example, the top-1 accuracy on ImageNet-1k validation set of MobileNetV3-large and ResNet50-D can be significantly improved.
arXiv Detail & Related papers (2021-03-10T09:32:44Z) - Learning to Augment for Data-Scarce Domain BERT Knowledge Distillation [55.34995029082051]
We propose a method to learn to augment for data-scarce domain BERT knowledge distillation.
We show that the proposed method significantly outperforms state-of-the-art baselines on four different tasks.
arXiv Detail & Related papers (2021-01-20T13:07:39Z) - Online Ensemble Model Compression using Knowledge Distillation [51.59021417947258]
This paper presents a knowledge distillation based model compression framework consisting of a student ensemble.
It enables distillation of simultaneously learnt ensemble knowledge onto each of the compressed student models.
We provide comprehensive experiments using state-of-the-art classification models to validate our framework's effectiveness.
arXiv Detail & Related papers (2020-11-15T04:46:29Z) - Distilling Object Detectors with Task Adaptive Regularization [97.52935611385179]
Current state-of-the-art object detectors are at the expense of high computational costs and are hard to deploy to low-end devices.
Knowledge distillation, which aims at training a smaller student network by transferring knowledge from a larger teacher model, is one of the promising solutions for model miniaturization.
arXiv Detail & Related papers (2020-06-23T15:58:22Z) - An Efficient Method of Training Small Models for Regression Problems
with Knowledge Distillation [1.433758865948252]
We propose a new formalism of knowledge distillation for regression problems.
First, we propose a new loss function, teacher outlier loss rejection, which rejects outliers in training samples using teacher model predictions.
By considering the multi-task network, training of the feature extraction of student models becomes more effective.
arXiv Detail & Related papers (2020-02-28T08:46:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.