Understanding Self-Distillation and Partial Label Learning in
Multi-Class Classification with Label Noise
- URL: http://arxiv.org/abs/2402.10482v1
- Date: Fri, 16 Feb 2024 07:13:12 GMT
- Title: Understanding Self-Distillation and Partial Label Learning in
Multi-Class Classification with Label Noise
- Authors: Hyeonsu Jeong and Hye Won Chung
- Abstract summary: Self-distillation (SD) is the process of training a student model using the outputs of a teacher model.
Our study theoretically examines SD in multi-class classification with cross-entropy loss.
- Score: 12.636657455986144
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-distillation (SD) is the process of training a student model using the
outputs of a teacher model, with both models sharing the same architecture. Our
study theoretically examines SD in multi-class classification with
cross-entropy loss, exploring both multi-round SD and SD with refined teacher
outputs, inspired by partial label learning (PLL). By deriving a closed-form
solution for the student model's outputs, we discover that SD essentially
functions as label averaging among instances with high feature correlations.
Initially beneficial, this averaging helps the model focus on feature clusters
correlated with a given instance for predicting the label. However, it leads to
diminishing performance with increasing distillation rounds. Additionally, we
demonstrate SD's effectiveness in label noise scenarios and identify the label
corruption condition and minimum number of distillation rounds needed to
achieve 100% classification accuracy. Our study also reveals that one-step
distillation with refined teacher outputs surpasses the efficacy of multi-step
SD using the teacher's direct output in high noise rate regimes.
Related papers
- Multi-Granularity Semantic Revision for Large Language Model Distillation [66.03746866578274]
We propose a multi-granularity semantic revision method for LLM distillation.
At the sequence level, we propose a sequence correction and re-generation strategy.
At the token level, we design a distribution adaptive clipping Kullback-Leibler loss as the distillation objective function.
At the span level, we leverage the span priors of a sequence to compute the probability correlations within spans, and constrain the teacher and student's probability correlations to be consistent.
arXiv Detail & Related papers (2024-07-14T03:51:49Z) - Knowledge Diffusion for Distillation [53.908314960324915]
The representation gap between teacher and student is an emerging topic in knowledge distillation (KD)
We state that the essence of these methods is to discard the noisy information and distill the valuable information in the feature.
We propose a novel KD method dubbed DiffKD, to explicitly denoise and match features using diffusion models.
arXiv Detail & Related papers (2023-05-25T04:49:34Z) - Exploring Inconsistent Knowledge Distillation for Object Detection with
Data Augmentation [66.25738680429463]
Knowledge Distillation (KD) for object detection aims to train a compact detector by transferring knowledge from a teacher model.
We propose inconsistent knowledge distillation (IKD) which aims to distill knowledge inherent in the teacher model's counter-intuitive perceptions.
Our method outperforms state-of-the-art KD baselines on one-stage, two-stage and anchor-free object detectors.
arXiv Detail & Related papers (2022-09-20T16:36:28Z) - Label-Noise Learning with Intrinsically Long-Tailed Data [65.41318436799993]
We propose a learning framework for label-noise learning with intrinsically long-tailed data.
Specifically, we propose two-stage bi-dimensional sample selection (TABASCO) to better separate clean samples from noisy samples.
arXiv Detail & Related papers (2022-08-21T07:47:05Z) - Label Matching Semi-Supervised Object Detection [85.99282969977541]
Semi-supervised object detection has made significant progress with the development of mean teacher driven self-training.
Label mismatch problem is not yet fully explored in the previous works, leading to severe confirmation bias during self-training.
We propose a simple yet effective LabelMatch framework from two different yet complementary perspectives.
arXiv Detail & Related papers (2022-06-14T05:59:41Z) - ALM-KD: Knowledge Distillation with noisy labels via adaptive loss
mixing [25.49637460661711]
Knowledge distillation is a technique where the outputs of a pretrained model are used for training a student model in a supervised setting.
We tackle this problem via the use of an adaptive loss mixing scheme during KD.
We demonstrate performance gains obtained using our approach in the standard KD setting as well as in multi-teacher and self-distillation settings.
arXiv Detail & Related papers (2022-02-07T14:53:22Z) - Anomaly Detection via Reverse Distillation from One-Class Embedding [2.715884199292287]
We propose a novel T-S model consisting of a teacher encoder and a student decoder.
Instead of receiving raw images directly, the student network takes teacher model's one-class embedding as input.
In addition, we introduce a trainable one-class bottleneck embedding module in our T-S model.
arXiv Detail & Related papers (2022-01-26T01:48:37Z) - From Consensus to Disagreement: Multi-Teacher Distillation for
Semi-Supervised Relation Extraction [10.513626483108126]
Semi-supervised relation extraction (SSRE) has been proven to be a promising way for this problem through annotating unlabeled samples as additional training data.
However, the difference set, which contains rich information about unlabeled data, has been long neglected by prior studies.
We develop a simple and general multi-teacher distillation framework, which can be easily integrated into any existing SSRE methods.
arXiv Detail & Related papers (2021-12-02T08:20:23Z) - Deep Semi-supervised Knowledge Distillation for Overlapping Cervical
Cell Instance Segmentation [54.49894381464853]
We propose to leverage both labeled and unlabeled data for instance segmentation with improved accuracy by knowledge distillation.
We propose a novel Mask-guided Mean Teacher framework with Perturbation-sensitive Sample Mining.
Experiments show that the proposed method improves the performance significantly compared with the supervised method learned from labeled data only.
arXiv Detail & Related papers (2020-07-21T13:27:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.