Lipschitz Continuity Guided Knowledge Distillation
- URL: http://arxiv.org/abs/2108.12905v1
- Date: Sun, 29 Aug 2021 20:19:34 GMT
- Title: Lipschitz Continuity Guided Knowledge Distillation
- Authors: Yuzhang Shang, Bin Duan, Ziliang Zong, Liqiang Nie, Yan Yan
- Abstract summary: We propose a novel Lipschitz Continuity Guided Knowledge Distillation framework to faithfully distill knowledge.
We derive an explainable approximation algorithm with an explicit theoretical derivation to address the NP-hard problem of calculating the Lipschitz constant.
Experimental results have shown that our method outperforms other benchmarks over several knowledge distillation tasks.
- Score: 44.77558919044394
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Knowledge distillation has become one of the most important model compression
techniques by distilling knowledge from larger teacher networks to smaller
student ones. Although great success has been achieved by prior distillation
methods via delicately designing various types of knowledge, they overlook the
functional properties of neural networks, which makes the process of applying
those techniques to new tasks unreliable and non-trivial. To alleviate such
problem, in this paper, we initially leverage Lipschitz continuity to better
represent the functional characteristic of neural networks and guide the
knowledge distillation process. In particular, we propose a novel Lipschitz
Continuity Guided Knowledge Distillation framework to faithfully distill
knowledge by minimizing the distance between two neural networks' Lipschitz
constants, which enables teacher networks to better regularize student networks
and improve the corresponding performance. We derive an explainable
approximation algorithm with an explicit theoretical derivation to address the
NP-hard problem of calculating the Lipschitz constant. Experimental results
have shown that our method outperforms other benchmarks over several knowledge
distillation tasks (e.g., classification, segmentation and object detection) on
CIFAR-100, ImageNet, and PASCAL VOC datasets.
Related papers
- Learning to Maximize Mutual Information for Chain-of-Thought Distillation [13.660167848386806]
Distilling Step-by-Step(DSS) has demonstrated promise by imbuing smaller models with the superior reasoning capabilities of their larger counterparts.
However, DSS overlooks the intrinsic relationship between the two training tasks, leading to ineffective integration of CoT knowledge with the task of label prediction.
We propose a variational approach to solve this problem using a learning-based method.
arXiv Detail & Related papers (2024-03-05T22:21:45Z) - AICSD: Adaptive Inter-Class Similarity Distillation for Semantic
Segmentation [12.92102548320001]
This paper proposes a novel method called Inter-Class Similarity Distillation (ICSD) for the purpose of knowledge distillation.
The proposed method transfers high-order relations from the teacher network to the student network by independently computing intra-class distributions for each class from network outputs.
Experiments conducted on two well-known datasets for semantic segmentation, Cityscapes and Pascal VOC 2012, validate the effectiveness of the proposed method.
arXiv Detail & Related papers (2023-08-08T13:17:20Z) - On effects of Knowledge Distillation on Transfer Learning [0.0]
We propose a machine learning architecture we call TL+KD that combines knowledge distillation with transfer learning.
We show that using guidance and knowledge from a larger teacher network during fine-tuning, we can improve the student network to achieve better validation performances like accuracy.
arXiv Detail & Related papers (2022-10-18T08:11:52Z) - A Closer Look at Knowledge Distillation with Features, Logits, and
Gradients [81.39206923719455]
Knowledge distillation (KD) is a substantial strategy for transferring learned knowledge from one neural network model to another.
This work provides a new perspective to motivate a set of knowledge distillation strategies by approximating the classical KL-divergence criteria with different knowledge sources.
Our analysis indicates that logits are generally a more efficient knowledge source and suggests that having sufficient feature dimensions is crucial for the model design.
arXiv Detail & Related papers (2022-03-18T21:26:55Z) - Training Certifiably Robust Neural Networks with Efficient Local
Lipschitz Bounds [99.23098204458336]
Certified robustness is a desirable property for deep neural networks in safety-critical applications.
We show that our method consistently outperforms state-of-the-art methods on MNIST and TinyNet datasets.
arXiv Detail & Related papers (2021-11-02T06:44:10Z) - Efficient training of lightweight neural networks using Online
Self-Acquired Knowledge Distillation [51.66271681532262]
Online Self-Acquired Knowledge Distillation (OSAKD) is proposed, aiming to improve the performance of any deep neural model in an online manner.
We utilize k-nn non-parametric density estimation technique for estimating the unknown probability distributions of the data samples in the output feature space.
arXiv Detail & Related papers (2021-08-26T14:01:04Z) - Interpretable Embedding Procedure Knowledge Transfer via Stacked
Principal Component Analysis and Graph Neural Network [26.55774782646948]
This paper proposes a method of generating interpretable embedding procedure (IEP) knowledge based on principal component analysis.
Experimental results show that the student network trained by the proposed KD method improves 2.28% in the CIFAR100 dataset.
We also demonstrate that the embedding procedure knowledge is interpretable via visualization of the proposed KD process.
arXiv Detail & Related papers (2021-04-28T03:40:37Z) - Annealing Knowledge Distillation [5.396407687999048]
We propose an improved knowledge distillation method (called Annealing-KD) by feeding the rich information provided by the teacher's soft-targets incrementally and more efficiently.
This paper includes theoretical and empirical evidence as well as practical experiments to support the effectiveness of our Annealing-KD method.
arXiv Detail & Related papers (2021-04-14T23:45:03Z) - Computation-Efficient Knowledge Distillation via Uncertainty-Aware Mixup [91.1317510066954]
We study a little-explored but important question, i.e., knowledge distillation efficiency.
Our goal is to achieve a performance comparable to conventional knowledge distillation with a lower computation cost during training.
We show that the UNcertainty-aware mIXup (UNIX) can serve as a clean yet effective solution.
arXiv Detail & Related papers (2020-12-17T06:52:16Z) - Residual Knowledge Distillation [96.18815134719975]
This work proposes Residual Knowledge Distillation (RKD), which further distills the knowledge by introducing an assistant (A)
In this way, S is trained to mimic the feature maps of T, and A aids this process by learning the residual error between them.
Experiments show that our approach achieves appealing results on popular classification datasets, CIFAR-100 and ImageNet.
arXiv Detail & Related papers (2020-02-21T07:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.