Related papers: Extracurricular Learning: Knowledge Transfer Beyond Empirical Distribution

Extracurricular Learning: Knowledge Transfer Beyond Empirical Distribution

URL: http://arxiv.org/abs/2007.00051v2
Date: Fri, 20 Nov 2020 19:11:09 GMT
Title: Extracurricular Learning: Knowledge Transfer Beyond Empirical Distribution
Authors: Hadi Pouransari, Mojan Javaheripi, Vinay Sharma, Oncel Tuzel
Abstract summary: We propose extracurricular learning to bridge the gap between a compressed student model and its teacher. We conduct rigorous evaluations on regression and classification tasks and show that compared to the standard knowledge distillation, extracurricular learning reduces the gap by 46% to 68%. This leads to major accuracy improvements compared to the empirical risk minimization-based training for various recent neural network architectures.
Score: 17.996541285382463
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Knowledge distillation has been used to transfer knowledge learned by a sophisticated model (teacher) to a simpler model (student). This technique is widely used to compress model complexity. However, in most applications the compressed student model suffers from an accuracy gap with its teacher. We propose extracurricular learning, a novel knowledge distillation method, that bridges this gap by (1) modeling student and teacher output distributions; (2) sampling examples from an approximation to the underlying data distribution; and (3) matching student and teacher output distributions over this extended set including uncertain samples. We conduct rigorous evaluations on regression and classification tasks and show that compared to the standard knowledge distillation, extracurricular learning reduces the gap by 46% to 68%. This leads to major accuracy improvements compared to the empirical risk minimization-based training for various recent neural network architectures: 16% regression error reduction on the MPIIGaze dataset, +3.4% to +9.1% improvement in top-1 classification accuracy on the CIFAR100 dataset, and +2.9% top-1 improvement on the ImageNet dataset.

Related papers

Model compression using knowledge distillation with integrated gradients [0.0]
We introduce a novel method enhancing knowledge distillation with integrated gradients (IG)<n>Our approach overlays IG maps onto input images during training, providing student models with deeper insights into teacher models' decision-making processes.<n>Our method precomputes IG maps before training, transforming substantial runtime costs into a one-time preprocessing step.
arXiv Detail & Related papers (2025-06-17T12:00:23Z)
Improving Group Fairness in Knowledge Distillation via Laplace Approximation of Early Exits [0.0]
We propose that leveraging Laplace approximation-based methods to obtain well-calibrated uncertainty estimates can also effectively reweight challenging instances.<n>To validate our claims, we benchmark our approach using a Bert-based model on the MultiNLI dataset.
arXiv Detail & Related papers (2025-05-02T07:18:52Z)
Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation [64.15918654558816]
Self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only. Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-04-19T14:08:56Z)
R+R: Security Vulnerability Dataset Quality Is Critical [0.6906005491572401]
A number of studies have employed datasets that are plagued by high duplication rates, questionable label accuracy, and incomplete samples. Our findings indicate that 56% of the samples had incorrect labels and 44% comprised incomplete samples--only 31% were both accurate and complete. We employ transfer learning using a large deduplicated bugfix corpus to show that these models can exhibit better performance if given larger amounts of high-quality pre-training data.
arXiv Detail & Related papers (2025-03-09T01:49:30Z)
Warmup-Distill: Bridge the Distribution Mismatch between Teacher and Student before Knowledge Distillation [84.38105530043741]
We propose Warmup-Distill, which aligns the distillation of the student to that of the teacher in advance of distillation. Experiments on the seven benchmarks demonstrate that Warmup-Distill could provide a warmup student more suitable for distillation.
arXiv Detail & Related papers (2025-02-17T12:58:12Z)
Faithful Label-free Knowledge Distillation [8.572967695281054]
This paper presents a label-free knowledge distillation approach called Teacher in the Middle (TinTeM) It produces a more faithful student, which better replicates the behavior of the teacher network across a range of benchmarks testing model robustness, generalisability and out-of-distribution detection.
arXiv Detail & Related papers (2024-11-22T01:48:44Z)
Distilling Calibrated Student from an Uncalibrated Teacher [8.101116303448586]
We study how to obtain a student from an uncalibrated teacher. Our approach relies on the fusion of data-augmentation techniques, including but not limited to cutout, mixup, and CutMix. We extend our approach beyond traditional knowledge distillation and find it suitable as well.
arXiv Detail & Related papers (2023-02-22T16:18:38Z)
HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained Transformers [49.79405257763856]
This paper focuses on task-agnostic distillation. It produces a compact pre-trained model that can be easily fine-tuned on various tasks with small computational costs and memory footprints. We propose Homotopic Distillation (HomoDistil), a novel task-agnostic distillation approach equipped with iterative pruning.
arXiv Detail & Related papers (2023-02-19T17:37:24Z)
EmbedDistill: A Geometric Knowledge Distillation for Information Retrieval [83.79667141681418]
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR) We propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model. We show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
arXiv Detail & Related papers (2023-01-27T22:04:37Z)
ProBoost: a Boosting Method for Probabilistic Classifiers [55.970609838687864]
ProBoost is a new boosting algorithm for probabilistic classifiers. It uses the uncertainty of each training sample to determine the most challenging/uncertain ones. It produces a sequence that progressively focuses on the samples found to have the highest uncertainty.
arXiv Detail & Related papers (2022-09-04T12:49:20Z)
Knowledge Distillation as Semiparametric Inference [44.572422527672416]
A popular approach to model compression is to train an inexpensive student model to mimic the class probabilities of a highly accurate but cumbersome teacher model. This two-step knowledge distillation process often leads to higher accuracy than training the student directly on labeled data. We cast knowledge distillation as a semiparametric inference problem with the optimal student model as the target, the unknown Bayes class probabilities as nuisance, and the teacher probabilities as a plug-in nuisance estimate.
arXiv Detail & Related papers (2021-04-20T03:00:45Z)
Beyond Self-Supervision: A Simple Yet Effective Network Distillation Alternative to Improve Backbones [40.33419553042038]
We propose to improve existing baseline networks via knowledge distillation from off-the-shelf pre-trained big powerful models. Our solution performs distillation by only driving prediction of the student model consistent with that of the teacher model. We empirically find that such simple distillation settings perform extremely effective, for example, the top-1 accuracy on ImageNet-1k validation set of MobileNetV3-large and ResNet50-D can be significantly improved.
arXiv Detail & Related papers (2021-03-10T09:32:44Z)
Learning to Augment for Data-Scarce Domain BERT Knowledge Distillation [55.34995029082051]
We propose a method to learn to augment for data-scarce domain BERT knowledge distillation. We show that the proposed method significantly outperforms state-of-the-art baselines on four different tasks.
arXiv Detail & Related papers (2021-01-20T13:07:39Z)
Online Ensemble Model Compression using Knowledge Distillation [51.59021417947258]
This paper presents a knowledge distillation based model compression framework consisting of a student ensemble. It enables distillation of simultaneously learnt ensemble knowledge onto each of the compressed student models. We provide comprehensive experiments using state-of-the-art classification models to validate our framework's effectiveness.
arXiv Detail & Related papers (2020-11-15T04:46:29Z)
Distilling Object Detectors with Task Adaptive Regularization [97.52935611385179]
Current state-of-the-art object detectors are at the expense of high computational costs and are hard to deploy to low-end devices. Knowledge distillation, which aims at training a smaller student network by transferring knowledge from a larger teacher model, is one of the promising solutions for model miniaturization.
arXiv Detail & Related papers (2020-06-23T15:58:22Z)
An Efficient Method of Training Small Models for Regression Problems with Knowledge Distillation [1.433758865948252]
We propose a new formalism of knowledge distillation for regression problems. First, we propose a new loss function, teacher outlier loss rejection, which rejects outliers in training samples using teacher model predictions. By considering the multi-task network, training of the feature extraction of student models becomes more effective.
arXiv Detail & Related papers (2020-02-28T08:46:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.