Related papers: Understanding the Role of Mixup in Knowledge Distillation: An Empirical Study

Understanding the Role of Mixup in Knowledge Distillation: An Empirical Study

URL: http://arxiv.org/abs/2211.03946v2
Date: Wed, 9 Nov 2022 01:53:34 GMT
Title: Understanding the Role of Mixup in Knowledge Distillation: An Empirical Study
Authors: Hongjun Choi, Eun Som Jeon, Ankita Shukla, Pavan Turaga
Abstract summary: Mixup is a popular data augmentation technique based on creating new samples by linear generalization between two given data samples. Knowledge distillation (KD) is widely used for model compression and transfer learning. "smoothness" is the connecting link between the two and is also a crucial attribute in understanding KD's interplay with mixup.
Score: 4.751886527142779
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Mixup is a popular data augmentation technique based on creating new samples by linear interpolation between two given data samples, to improve both the generalization and robustness of the trained model. Knowledge distillation (KD), on the other hand, is widely used for model compression and transfer learning, which involves using a larger network's implicit knowledge to guide the learning of a smaller network. At first glance, these two techniques seem very different, however, we found that "smoothness" is the connecting link between the two and is also a crucial attribute in understanding KD's interplay with mixup. Although many mixup variants and distillation methods have been proposed, much remains to be understood regarding the role of a mixup in knowledge distillation. In this paper, we present a detailed empirical study on various important dimensions of compatibility between mixup and knowledge distillation. We also scrutinize the behavior of the networks trained with a mixup in the light of knowledge distillation through extensive analysis, visualizations, and comprehensive experiments on image classification. Finally, based on our findings, we suggest improved strategies to guide the student network to enhance its effectiveness. Additionally, the findings of this study provide insightful suggestions to researchers and practitioners that commonly use techniques from KD. Our code is available at https://github.com/hchoi71/MIX-KD.

Related papers

On Distilling the Displacement Knowledge for Few-Shot Class-Incremental Learning [17.819582979803286]
Few-shot Class-Incremental Learning (FSCIL) addresses the challenges of evolving data distributions and the difficulty of data acquisition in real-world scenarios. To counteract the catastrophic forgetting typically encountered in FSCIL, knowledge distillation is employed as a way to maintain the knowledge from learned data distribution.
arXiv Detail & Related papers (2024-12-15T02:10:18Z)
Distribution Shift Matters for Knowledge Distillation with Webly Collected Images [91.66661969598755]
We propose a novel method dubbed Knowledge Distillation between Different Distributions" (KD$3$) We first dynamically select useful training instances from the webly collected data according to the combined predictions of teacher network and student network. We also build a new contrastive learning block called MixDistribution to generate perturbed data with a new distribution for instance alignment.
arXiv Detail & Related papers (2023-07-21T10:08:58Z)
Leveraging Different Learning Styles for Improved Knowledge Distillation in Biomedical Imaging [0.9208007322096533]
Our work endeavors to leverage the concept of knowledge diversification to improve the performance of model compression techniques like Knowledge Distillation (KD) and Mutual Learning (ML) We use a single-teacher and two-student network in a unified framework that not only allows for the transfer of knowledge from teacher to students (KD) but also encourages collaborative learning between students (ML) Unlike the conventional approach, where the teacher shares the same knowledge in the form of predictions or feature representations with the student network, our proposed approach employs a more diversified strategy by training one student with predictions and the other with feature maps from the teacher.
arXiv Detail & Related papers (2022-12-06T12:40:45Z)
Exploring Inconsistent Knowledge Distillation for Object Detection with Data Augmentation [66.25738680429463]
Knowledge Distillation (KD) for object detection aims to train a compact detector by transferring knowledge from a teacher model. We propose inconsistent knowledge distillation (IKD) which aims to distill knowledge inherent in the teacher model's counter-intuitive perceptions. Our method outperforms state-of-the-art KD baselines on one-stage, two-stage and anchor-free object detectors.
arXiv Detail & Related papers (2022-09-20T16:36:28Z)
Knowledge Distillation Meets Open-Set Semi-Supervised Learning [69.21139647218456]
We propose a novel em modelname (bfem shortname) method dedicated for distilling representational knowledge semantically from a pretrained teacher to a target student. At the problem level, this establishes an interesting connection between knowledge distillation with open-set semi-supervised learning (SSL) Our shortname outperforms significantly previous state-of-the-art knowledge distillation methods on both coarse object classification and fine face recognition tasks.
arXiv Detail & Related papers (2022-05-13T15:15:27Z)
A Closer Look at Knowledge Distillation with Features, Logits, and Gradients [81.39206923719455]
Knowledge distillation (KD) is a substantial strategy for transferring learned knowledge from one neural network model to another. This work provides a new perspective to motivate a set of knowledge distillation strategies by approximating the classical KL-divergence criteria with different knowledge sources. Our analysis indicates that logits are generally a more efficient knowledge source and suggests that having sufficient feature dimensions is crucial for the model design.
arXiv Detail & Related papers (2022-03-18T21:26:55Z)
Partial to Whole Knowledge Distillation: Progressive Distilling Decomposed Knowledge Boosts Student Better [18.184818787217594]
We introduce a new concept of knowledge decomposition, and put forward the textbfPartial to textbfWhole textbfKnowledge textbfDistillation(textbfPWKD) paradigm. Then, student extract partial to whole knowledge from the pre-trained teacher within multiple training stages where cyclic learning rate is leveraged to accelerate convergence.
arXiv Detail & Related papers (2021-09-26T06:33:25Z)
Distilling Holistic Knowledge with Graph Neural Networks [37.86539695906857]
Knowledge Distillation (KD) aims at transferring knowledge from a larger well-optimized teacher network to a smaller learnable student network. Existing KD methods have mainly considered two types of knowledge, namely the individual knowledge and the relational knowledge. We propose to distill the novel holistic knowledge based on an attributed graph constructed among instances.
arXiv Detail & Related papers (2021-08-12T02:47:59Z)
Similarity Transfer for Knowledge Distillation [25.042405967561212]
Knowledge distillation is a popular paradigm for learning portable neural networks by transferring the knowledge from a large model into a smaller one. We propose a novel method called similarity transfer for knowledge distillation (STKD), which aims to fully utilize the similarities between categories of multiple samples. It shows that STKD substantially has outperformed the vanilla knowledge distillation and has achieved superior accuracy over the state-of-the-art knowledge distillation methods.
arXiv Detail & Related papers (2021-03-18T06:54:59Z)
Collaborative Teacher-Student Learning via Multiple Knowledge Transfer [79.45526596053728]
We propose a collaborative teacher-student learning via multiple knowledge transfer (CTSL-MKT) It allows multiple students learn knowledge from both individual instances and instance relations in a collaborative way. The experiments and ablation studies on four image datasets demonstrate that the proposed CTSL-MKT significantly outperforms the state-of-the-art KD methods.
arXiv Detail & Related papers (2021-01-21T07:17:04Z)
Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning [93.18238573921629]
We study how Ensemble of deep learning models can improve test accuracy, and how the superior performance of ensemble can be distilled into a single model. We show that ensemble/knowledge distillation in deep learning works very differently from traditional learning theory. We prove that self-distillation can also be viewed as implicitly combining ensemble and knowledge distillation to improve test accuracy.
arXiv Detail & Related papers (2020-12-17T18:34:45Z)
Knowledge Distillation Beyond Model Compression [13.041607703862724]
Knowledge distillation (KD) is commonly deemed as an effective model compression technique in which a compact model (student) is trained under the supervision of a larger pretrained model or ensemble of models (teacher) In this study, we provide an extensive study on nine different KD methods which covers a broad spectrum of approaches to capture and transfer knowledge.
arXiv Detail & Related papers (2020-07-03T19:54:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.