Related papers: Confidence Conditioned Knowledge Distillation

Confidence Conditioned Knowledge Distillation

URL: http://arxiv.org/abs/2107.06993v1
Date: Tue, 6 Jul 2021 00:33:25 GMT
Title: Confidence Conditioned Knowledge Distillation
Authors: Sourav Mishra and Suresh Sundaram
Abstract summary: A confidence conditioned knowledge distillation (CCKD) scheme for transferring the knowledge from a teacher model to a student model is proposed. CCKD addresses these issues by leveraging the confidence assigned by the teacher model to the correct class to devise sample-specific loss functions and targets. Empirical evaluations on several benchmark datasets show that CCKD methods achieve at least as much generalization performance levels as other state-of-the-art methods.
Score: 8.09591217280048
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: In this paper, a novel confidence conditioned knowledge distillation (CCKD) scheme for transferring the knowledge from a teacher model to a student model is proposed. Existing state-of-the-art methods employ fixed loss functions for this purpose and ignore the different levels of information that need to be transferred for different samples. In addition to that, these methods are also inefficient in terms of data usage. CCKD addresses these issues by leveraging the confidence assigned by the teacher model to the correct class to devise sample-specific loss functions (CCKD-L formulation) and targets (CCKD-T formulation). Further, CCKD improves the data efficiency by employing self-regulation to stop those samples from participating in the distillation process on which the student model learns faster. Empirical evaluations on several benchmark datasets show that CCKD methods achieve at least as much generalization performance levels as other state-of-the-art methods while being data efficient in the process. Student models trained through CCKD methods do not retain most of the misclassifications commited by the teacher model on the training set. Distillation through CCKD methods improves the resilience of the student models against adversarial attacks compared to the conventional KD method. Experiments show at least 3% increase in performance against adversarial attacks for the MNIST and the Fashion MNIST datasets, and at least 6% increase for the CIFAR10 dataset.

Related papers

Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling [81.00825302340984]
We introduce Speculative Knowledge Distillation (SKD) to generate high-quality training data on-the-fly. In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution. We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following.
arXiv Detail & Related papers (2024-10-15T06:51:25Z)
Efficient and Robust Knowledge Distillation from A Stronger Teacher Based on Correlation Matching [0.09999629695552192]
Correlation Matching Knowledge Distillation (CMKD) method combines the Pearson and Spearman correlation coefficients-based KD loss to achieve more efficient and robust distillation from a stronger teacher model. CMKD is simple yet practical, and extensive experiments demonstrate that it can consistently achieve state-of-the-art performance on CIRAR-100 and ImageNet.
arXiv Detail & Related papers (2024-10-09T05:42:47Z)
Adaptive Explicit Knowledge Transfer for Knowledge Distillation [17.739979156009696]
We show that the performance of logit-based knowledge distillation can be improved by effectively delivering the probability distribution for the non-target classes from the teacher model. We propose a new loss that enables the student to learn explicit knowledge along with implicit knowledge in an adaptive manner. Experimental results demonstrate that the proposed method, called adaptive explicit knowledge transfer (AEKT) method, achieves improved performance compared to the state-of-the-art KD methods.
arXiv Detail & Related papers (2024-09-03T07:42:59Z)
Robustness-Reinforced Knowledge Distillation with Correlation Distance and Network Pruning [3.1423836318272773]
Knowledge distillation (KD) improves the performance of efficient and lightweight models. Most existing KD techniques rely on Kullback-Leibler (KL) divergence. We propose a Robustness-Reinforced Knowledge Distillation (R2KD) that leverages correlation distance and network pruning.
arXiv Detail & Related papers (2023-11-23T11:34:48Z)
Comparative Knowledge Distillation [102.35425896967791]
Traditional Knowledge Distillation (KD) assumes readily available access to teacher models for frequent inference. We propose Comparative Knowledge Distillation (CKD), which encourages student models to understand the nuanced differences in a teacher model's interpretations of samples. CKD consistently outperforms state of the art data augmentation and KD techniques.
arXiv Detail & Related papers (2023-11-03T21:55:33Z)
Adapt Your Teacher: Improving Knowledge Distillation for Exemplar-free Continual Learning [14.379472108242235]
We investigate exemplar-free class incremental learning (CIL) with knowledge distillation (KD) as a regularization strategy. KD-based methods are successfully used in CIL, but they often struggle to regularize the model without access to exemplars of the training data from previous tasks. Inspired by recent test-time adaptation methods, we introduce Teacher Adaptation (TA), a method that concurrently updates the teacher and the main models during incremental training.
arXiv Detail & Related papers (2023-08-18T13:22:59Z)
Efficient training of lightweight neural networks using Online Self-Acquired Knowledge Distillation [51.66271681532262]
Online Self-Acquired Knowledge Distillation (OSAKD) is proposed, aiming to improve the performance of any deep neural model in an online manner. We utilize k-nn non-parametric density estimation technique for estimating the unknown probability distributions of the data samples in the output feature space.
arXiv Detail & Related papers (2021-08-26T14:01:04Z)
Distilling and Transferring Knowledge via cGAN-generated Samples for Image Classification and Regression [17.12028267150745]
We propose a unified KD framework based on conditional generative adversarial networks (cGANs) cGAN-KD distills and transfers knowledge from a teacher model to a student model via cGAN-generated samples. Experiments on CIFAR-10 and Tiny-ImageNet show we can incorporate KD methods into the cGAN-KD framework to reach a new state of the art.
arXiv Detail & Related papers (2021-04-07T14:52:49Z)
MixKD: Towards Efficient Distillation of Large-scale Language Models [129.73786264834894]
We propose MixKD, a data-agnostic distillation framework, to endow the resulting model with stronger generalization ability. We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the error and the empirical error. Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
arXiv Detail & Related papers (2020-11-01T18:47:51Z)
Heterogeneous Knowledge Distillation using Information Flow Modeling [82.83891707250926]
We propose a novel KD method that works by modeling the information flow through the various layers of the teacher model. The proposed method is capable of overcoming the aforementioned limitations by using an appropriate supervision scheme during the different phases of the training process.
arXiv Detail & Related papers (2020-05-02T06:56:56Z)
Residual Knowledge Distillation [96.18815134719975]
This work proposes Residual Knowledge Distillation (RKD), which further distills the knowledge by introducing an assistant (A) In this way, S is trained to mimic the feature maps of T, and A aids this process by learning the residual error between them. Experiments show that our approach achieves appealing results on popular classification datasets, CIFAR-100 and ImageNet.
arXiv Detail & Related papers (2020-02-21T07:49:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.