Robust Distillation for Worst-class Performance
- URL: http://arxiv.org/abs/2206.06479v1
- Date: Mon, 13 Jun 2022 21:17:00 GMT
- Title: Robust Distillation for Worst-class Performance
- Authors: Serena Wang and Harikrishna Narasimhan and Yichen Zhou and Sara Hooker
and Michal Lukasik and Aditya Krishna Menon
- Abstract summary: We develop distillation techniques that are tailored to improve the student's worst-class performance.
We show empirically that our robust distillation techniques achieve better worst-class performance.
We provide insights into what makes a good teacher when the goal is to train a robust student.
- Score: 38.80008602644002
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Knowledge distillation has proven to be an effective technique in improving
the performance a student model using predictions from a teacher model.
However, recent work has shown that gains in average efficiency are not uniform
across subgroups in the data, and in particular can often come at the cost of
accuracy on rare subgroups and classes. To preserve strong performance across
classes that may follow a long-tailed distribution, we develop distillation
techniques that are tailored to improve the student's worst-class performance.
Specifically, we introduce robust optimization objectives in different
combinations for the teacher and student, and further allow for training with
any tradeoff between the overall accuracy and the robust worst-class objective.
We show empirically that our robust distillation techniques not only achieve
better worst-class performance, but also lead to Pareto improvement in the
tradeoff between overall performance and worst-class performance compared to
other baseline methods. Theoretically, we provide insights into what makes a
good teacher when the goal is to train a robust student.
Related papers
- Towards Fairness-Aware Adversarial Learning [13.932705960012846]
We propose a novel learning paradigm, named Fairness-Aware Adversarial Learning (FAAL)
Our method aims to find the worst distribution among different categories, and the solution is guaranteed to obtain the upper bound performance with high probability.
In particular, FAAL can fine-tune an unfair robust model to be fair within only two epochs, without compromising the overall clean and robust accuracies.
arXiv Detail & Related papers (2024-02-27T18:01:59Z) - Understanding the Detrimental Class-level Effects of Data Augmentation [63.1733767714073]
achieving optimal average accuracy comes at the cost of significantly hurting individual class accuracy by as much as 20% on ImageNet.
We present a framework for understanding how DA interacts with class-level learning dynamics.
We show that simple class-conditional augmentation strategies improve performance on the negatively affected classes.
arXiv Detail & Related papers (2023-12-07T18:37:43Z) - Tailoring Instructions to Student's Learning Levels Boosts Knowledge Distillation [52.53446712834569]
Learning Good Teacher Matters (LGTM) is an efficient training technique for incorporating distillation influence into the teacher's learning process.
Our LGTM outperforms 10 common knowledge distillation baselines on 6 text classification tasks in the GLUE benchmark.
arXiv Detail & Related papers (2023-05-16T17:50:09Z) - DisWOT: Student Architecture Search for Distillation WithOut Training [0.0]
We explore a novel training-free framework to search for the best student architectures for a given teacher.
Our work first empirically show that the optimal model under vanilla training cannot be the winner in distillation.
Our experiments on CIFAR, ImageNet and NAS-Bench-201 demonstrate that our technique achieves state-of-the-art results on different search spaces.
arXiv Detail & Related papers (2023-03-28T01:58:45Z) - Efficient Knowledge Distillation from Model Checkpoints [36.329429655242535]
We show that a weak snapshot ensemble of several intermediate models from a same training trajectory can outperform a strong ensemble of independently trained and fully converged models.
We propose an optimal intermediate teacher selection algorithm based on maximizing the total task-related mutual information.
arXiv Detail & Related papers (2022-10-12T17:55:30Z) - Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge
Distillation [70.92135839545314]
We propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation.
Our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers.
arXiv Detail & Related papers (2022-06-13T11:52:13Z) - On the benefits of knowledge distillation for adversarial robustness [53.41196727255314]
We show that knowledge distillation can be used directly to boost the performance of state-of-the-art models in adversarial robustness.
We present Adversarial Knowledge Distillation (AKD), a new framework to improve a model's robust performance.
arXiv Detail & Related papers (2022-03-14T15:02:13Z) - Teacher's pet: understanding and mitigating biases in distillation [61.44867470297283]
Several works have shown that distillation significantly boosts the student's overall performance.
However, are these gains uniform across all data subgroups?
We show that distillation can harm performance on certain subgroups.
We present techniques which soften the teacher influence for subgroups where it is less reliable.
arXiv Detail & Related papers (2021-06-19T13:06:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.