The Effect of Optimal Self-Distillation in Noisy Gaussian Mixture Model
- URL: http://arxiv.org/abs/2501.16226v2
- Date: Sat, 01 Feb 2025 05:31:30 GMT
- Title: The Effect of Optimal Self-Distillation in Noisy Gaussian Mixture Model
- Authors: Kaito Takanami, Takashi Takahashi, Ayaka Sakata,
- Abstract summary: Self-distillation (SD) is a technique where a model refines itself from its own predictions.
Despite its widespread use, the mechanisms underlying its effectiveness remain unclear.
- Score: 2.355460994057843
- License:
- Abstract: Self-distillation (SD), a technique where a model refines itself from its own predictions, has garnered attention as a simple yet powerful approach in machine learning. Despite its widespread use, the mechanisms underlying its effectiveness remain unclear. In this study, we investigate the efficacy of hyperparameter-tuned multi-stage SD in binary classification tasks with noisy labeled Gaussian mixture data, utilizing a replica theory. Our findings reveals that the primary driver of SD's performance improvement is denoising through hard pseudo-labels, with the most notable gains observed in moderately sized datasets. We also demonstrate the efficacy of practical heuristics, such as early stopping for extracting meaningful signal and bias fixation for imbalanced data. These results provide both theoretical guarantees and practical insights, advancing our understanding and application of SD in noisy settings.
Related papers
- Distributionally Robust Graph Out-of-Distribution Recommendation via Diffusion Model [7.92181856602497]
We design a Distributionally Robust Graph model for OOD recommendation (DRGO)
Specifically, our method employs a simple and effective diffusion paradigm to alleviate the noisy effect in the latent space.
We provide a theoretical proof of the generalization error bound of DRGO as well as a theoretical analysis of how our approach mitigates noisy sample effects.
arXiv Detail & Related papers (2025-01-26T15:07:52Z) - Generative Dataset Distillation Based on Self-knowledge Distillation [49.20086587208214]
We present a novel generative dataset distillation method that can improve the accuracy of aligning prediction logits.
Our approach integrates self-knowledge distillation to achieve more precise distribution matching between the synthetic and original data.
Our method outperforms existing state-of-the-art methods, resulting in superior distillation performance.
arXiv Detail & Related papers (2025-01-08T00:43:31Z) - Iso-Diffusion: Improving Diffusion Probabilistic Models Using the Isotropy of the Additive Gaussian Noise [0.0]
We show how to use the isotropy of the additive noise as a constraint on the objective function to enhance the fidelity of DDPMs.
Our approach is simple and can be applied to any DDPM variant.
arXiv Detail & Related papers (2024-03-25T14:05:52Z) - Learning with Noisy Foundation Models [95.50968225050012]
This paper is the first work to comprehensively understand and analyze the nature of noise in pre-training datasets.
We propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization.
arXiv Detail & Related papers (2024-03-11T16:22:41Z) - Fine tuning Pre trained Models for Robustness Under Noisy Labels [34.68018860186995]
The presence of noisy labels in a training dataset can significantly impact the performance of machine learning models.
We introduce a novel algorithm called TURN, which robustly and efficiently transfers the prior knowledge of pre-trained models.
arXiv Detail & Related papers (2023-10-24T20:28:59Z) - AST: Effective Dataset Distillation through Alignment with Smooth and
High-Quality Expert Trajectories [18.266786462036553]
We propose an effective DD framework named AST, standing for Alignment with Smooth and high-quality expert Trajectories.
We conduct extensive experiments on datasets of different scales, sizes, and resolutions.
arXiv Detail & Related papers (2023-10-16T16:13:53Z) - Noisy-ArcMix: Additive Noisy Angular Margin Loss Combined With Mixup
Anomalous Sound Detection [5.1308092683559225]
Unsupervised anomalous sound detection (ASD) aims to identify anomalous sounds by learning the features of normal operational sounds and sensing their deviations.
Recent approaches have focused on the self-supervised task utilizing the classification of normal data, and advanced models have shown that securing representation space for anomalous data is important.
We propose a training technique aimed at ensuring intra-class compactness and increasing the angle gap between normal and abnormal samples.
arXiv Detail & Related papers (2023-10-10T07:04:36Z) - Directed Acyclic Graph Factorization Machines for CTR Prediction via
Knowledge Distillation [65.62538699160085]
We propose a Directed Acyclic Graph Factorization Machine (KD-DAGFM) to learn the high-order feature interactions from existing complex interaction models for CTR prediction via Knowledge Distillation.
KD-DAGFM achieves the best performance with less than 21.5% FLOPs of the state-of-the-art method on both online and offline experiments.
arXiv Detail & Related papers (2022-11-21T03:09:42Z) - Boosting Facial Expression Recognition by A Semi-Supervised Progressive
Teacher [54.50747989860957]
We propose a semi-supervised learning algorithm named Progressive Teacher (PT) to utilize reliable FER datasets as well as large-scale unlabeled expression images for effective training.
Experiments on widely-used databases RAF-DB and FERPlus validate the effectiveness of our method, which achieves state-of-the-art performance with accuracy of 89.57% on RAF-DB.
arXiv Detail & Related papers (2022-05-28T07:47:53Z) - Scale-Equivalent Distillation for Semi-Supervised Object Detection [57.59525453301374]
Recent Semi-Supervised Object Detection (SS-OD) methods are mainly based on self-training, generating hard pseudo-labels by a teacher model on unlabeled data as supervisory signals.
We analyze the challenges these methods meet with the empirical experiment results.
We introduce a novel approach, Scale-Equivalent Distillation (SED), which is a simple yet effective end-to-end knowledge distillation framework robust to large object size variance and class imbalance.
arXiv Detail & Related papers (2022-03-23T07:33:37Z) - Multiplicative noise and heavy tails in stochastic optimization [62.993432503309485]
empirical optimization is central to modern machine learning, but its role in its success is still unclear.
We show that it commonly arises in parameters of discrete multiplicative noise due to variance.
A detailed analysis is conducted in which we describe on key factors, including recent step size, and data, all exhibit similar results on state-of-the-art neural network models.
arXiv Detail & Related papers (2020-06-11T09:58:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.