Related papers: What is Left After Distillation? How Knowledge Transfer Impacts Fairness and Bias

What is Left After Distillation? How Knowledge Transfer Impacts Fairness and Bias

URL: http://arxiv.org/abs/2410.08407v1
Date: Thu, 10 Oct 2024 22:43:00 GMT
Title: What is Left After Distillation? How Knowledge Transfer Impacts Fairness and Bias
Authors: Aida Mohammadshahi, Yani Ioannou,
Abstract summary: As many as 41% of the classes are statistically significantly affected by distillation when comparing class-wise accuracy. This study highlights the uneven effects of Knowledge Distillation on certain classes and its potentially significant role in fairness.
Score: 1.03590082373586
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Knowledge Distillation is a commonly used Deep Neural Network compression method, which often maintains overall generalization performance. However, we show that even for balanced image classification datasets, such as CIFAR-100, Tiny ImageNet and ImageNet, as many as 41% of the classes are statistically significantly affected by distillation when comparing class-wise accuracy (i.e. class bias) between a teacher/distilled student or distilled student/non-distilled student model. Changes in class bias are not necessarily an undesirable outcome when considered outside of the context of a model's usage. Using two common fairness metrics, Demographic Parity Difference (DPD) and Equalized Odds Difference (EOD) on models trained with the CelebA, Trifeature, and HateXplain datasets, our results suggest that increasing the distillation temperature improves the distilled student model's fairness -- for DPD, the distilled student even surpasses the fairness of the teacher model at high temperatures. This study highlights the uneven effects of Knowledge Distillation on certain classes and its potentially significant role in fairness, emphasizing that caution is warranted when using distilled models for sensitive application domains.

Related papers

Warmup-Distill: Bridge the Distribution Mismatch between Teacher and Student before Knowledge Distillation [84.38105530043741]
We propose Warmup-Distill, which aligns the distillation of the student to that of the teacher in advance of distillation. Experiments on the seven benchmarks demonstrate that Warmup-Distill could provide a warmup student more suitable for distillation.
arXiv Detail & Related papers (2025-02-17T12:58:12Z)
Knowledge Distillation with Refined Logits [31.205248790623703]
We introduce Refined Logit Distillation (RLD) to address the limitations of current logit distillation methods. Our approach is motivated by the observation that even high-performing teacher models can make incorrect predictions. Our method can effectively eliminate misleading information from the teacher while preserving crucial class correlations.
arXiv Detail & Related papers (2024-08-14T17:59:32Z)
Logit Standardization in Knowledge Distillation [83.31794439964033]
The assumption of a shared temperature between teacher and student implies a mandatory exact match between their logits in terms of logit range and variance. We propose setting the temperature as the weighted standard deviation of logit and performing a plug-and-play Z-score pre-process of logit standardization. Our pre-process enables student to focus on essential logit relations from teacher rather than requiring a magnitude match, and can improve the performance of existing logit-based distillation methods.
arXiv Detail & Related papers (2024-03-03T07:54:03Z)
Understanding the Detrimental Class-level Effects of Data Augmentation [63.1733767714073]
achieving optimal average accuracy comes at the cost of significantly hurting individual class accuracy by as much as 20% on ImageNet. We present a framework for understanding how DA interacts with class-level learning dynamics. We show that simple class-conditional augmentation strategies improve performance on the negatively affected classes.
arXiv Detail & Related papers (2023-12-07T18:37:43Z)
Unbiased Knowledge Distillation for Recommendation [66.82575287129728]
Knowledge distillation (KD) has been applied in recommender systems (RS) to reduce inference latency. Traditional solutions first train a full teacher model from the training data, and then transfer its knowledge to supervise the learning of a compact student model. We find such a standard distillation paradigm would incur serious bias issue -- popular items are more heavily recommended after the distillation.
arXiv Detail & Related papers (2022-11-27T05:14:03Z)
Revisiting Self-Distillation [50.29938732233947]
Self-distillation is the procedure of transferring "knowledge" from a large model (the teacher) to a more compact one (the student) Several works have anecdotally shown that a self-distilled student can outperform the teacher on held-out data. We conduct extensive experiments to show that self-distillation leads to flatter minima, thereby resulting in better generalization.
arXiv Detail & Related papers (2022-06-17T00:18:51Z)
Unified and Effective Ensemble Knowledge Distillation [92.67156911466397]
Ensemble knowledge distillation can extract knowledge from multiple teacher models and encode it into a single student model. Many existing methods learn and distill the student model on labeled data only. We propose a unified and effective ensemble knowledge distillation method that distills a single student model from an ensemble of teacher models on both labeled and unlabeled data.
arXiv Detail & Related papers (2022-04-01T16:15:39Z)
LTD: Low Temperature Distillation for Robust Adversarial Training [1.3300217947936062]
Adversarial training has been widely used to enhance the robustness of neural network models against adversarial attacks. Despite the popularity of neural network models, a significant gap exists between the natural and robust accuracy of these models. We propose a novel method called Low Temperature Distillation (LTD) that generates soft labels using the modified knowledge distillation framework.
arXiv Detail & Related papers (2021-11-03T16:26:00Z)
Categorical Relation-Preserving Contrastive Knowledge Distillation for Medical Image Classification [75.27973258196934]
We propose a novel Categorical Relation-preserving Contrastive Knowledge Distillation (CRCKD) algorithm, which takes the commonly used mean-teacher model as the supervisor. With this regularization, the feature distribution of the student model shows higher intra-class similarity and inter-class variance. With the contribution of the CCD and CRP, our CRCKD algorithm can distill the relational knowledge more comprehensively.
arXiv Detail & Related papers (2021-07-07T13:56:38Z)
Teacher's pet: understanding and mitigating biases in distillation [61.44867470297283]
Several works have shown that distillation significantly boosts the student's overall performance. However, are these gains uniform across all data subgroups? We show that distillation can harm performance on certain subgroups. We present techniques which soften the teacher influence for subgroups where it is less reliable.
arXiv Detail & Related papers (2021-06-19T13:06:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.