Revisiting Knowledge Distillation under Distribution Shift
- URL: http://arxiv.org/abs/2312.16242v2
- Date: Sun, 7 Jan 2024 08:52:37 GMT
- Title: Revisiting Knowledge Distillation under Distribution Shift
- Authors: Songming Zhang and Ziyu Lyu and Xiaofeng Chen
- Abstract summary: We study the mechanism of knowledge distillation against distribution shift.
We propose a unified and systematic framework to benchmark knowledge distillation against two general distributional shifts.
We reveal intriguing observations of poor teaching performance under distribution shifts.
- Score: 7.796685962570969
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation transfers knowledge from large models into small
models, and has recently made remarkable achievements. However, few studies has
investigated the mechanism of knowledge distillation against distribution
shift. Distribution shift refers to the data distribution drifts between
training and testing phases. In this paper, we reconsider the paradigm of
knowledge distillation by reformulating the objective function in shift
situations. Under the real scenarios, we propose a unified and systematic
framework to benchmark knowledge distillation against two general
distributional shifts including diversity and correlation shift. The evaluation
benchmark covers more than 30 methods from algorithmic, data-driven, and
optimization perspectives for five benchmark datasets. Overall, we conduct
extensive experiments on the student model. We reveal intriguing observations
of poor teaching performance under distribution shifts; in particular, complex
algorithms and data augmentation offer limited gains in many cases.
Related papers
- Harnessing the Power of Vicinity-Informed Analysis for Classification under Covariate Shift [9.530897053573186]
Transfer learning enhances prediction accuracy on a target distribution by leveraging data from a source distribution.
This paper introduces a novel dissimilarity measure that utilizes vicinity information, i.e., the local structure of data points.
We characterize the excess error using the proposed measure and demonstrate faster or competitive convergence rates compared to previous techniques.
arXiv Detail & Related papers (2024-05-27T07:55:27Z) - Leveraging Diffusion Disentangled Representations to Mitigate Shortcuts
in Underspecified Visual Tasks [92.32670915472099]
We propose an ensemble diversification framework exploiting the generation of synthetic counterfactuals using Diffusion Probabilistic Models (DPMs)
We show that diffusion-guided diversification can lead models to avert attention from shortcut cues, achieving ensemble diversity performance comparable to previous methods requiring additional data collection.
arXiv Detail & Related papers (2023-10-03T17:37:52Z) - Sharpness & Shift-Aware Self-Supervised Learning [17.978849280772092]
Self-supervised learning aims to extract meaningful features from unlabeled data for further downstream tasks.
We develop rigorous theories to realize the factors that implicitly influence the general loss of this classification task.
We conduct extensive experiments to verify our theoretical findings and demonstrate that sharpness & shift-aware contrastive learning can remarkably boost the performance.
arXiv Detail & Related papers (2023-05-17T14:42:16Z) - Towards Effective Collaborative Learning in Long-Tailed Recognition [16.202524991074416]
Real-world data usually suffers from severe class imbalance and long-tailed distributions, where minority classes are significantly underrepresented.
Recent research prefers to utilize multi-expert architectures to mitigate the model uncertainty on the minority.
In this paper, we observe that the knowledge transfer between experts is imbalanced in terms of class distribution, which results in limited performance improvement of the minority classes.
arXiv Detail & Related papers (2023-05-05T09:16:06Z) - Variational Distillation for Multi-View Learning [104.17551354374821]
We design several variational information bottlenecks to exploit two key characteristics for multi-view representation learning.
Under rigorously theoretical guarantee, our approach enables IB to grasp the intrinsic correlation between observations and semantic labels.
arXiv Detail & Related papers (2022-06-20T03:09:46Z) - An Empirical Study on Distribution Shift Robustness From the Perspective
of Pre-Training and Data Augmentation [91.62129090006745]
This paper studies the distribution shift problem from the perspective of pre-training and data augmentation.
We provide the first comprehensive empirical study focusing on pre-training and data augmentation.
arXiv Detail & Related papers (2022-05-25T13:04:53Z) - Robust Generalization despite Distribution Shift via Minimum
Discriminating Information [46.164498176119665]
We introduce a modeling framework where, in addition to training data, we have partial structural knowledge of the shifted test distribution.
We employ the principle of minimum discriminating information to embed the available prior knowledge.
We obtain explicit generalization bounds with respect to the unknown shifted distribution.
arXiv Detail & Related papers (2021-06-08T15:25:35Z) - Unsupervised Transfer Learning for Spatiotemporal Predictive Networks [90.67309545798224]
We study how to transfer knowledge from a zoo of unsupervisedly learned models towards another network.
Our motivation is that models are expected to understand complex dynamics from different sources.
Our approach yields significant improvements on three benchmarks fortemporal prediction, and benefits the target even from less relevant ones.
arXiv Detail & Related papers (2020-09-24T15:40:55Z) - Learning Diverse Representations for Fast Adaptation to Distribution
Shift [78.83747601814669]
We present a method for learning multiple models, incorporating an objective that pressures each to learn a distinct way to solve the task.
We demonstrate our framework's ability to facilitate rapid adaptation to distribution shift.
arXiv Detail & Related papers (2020-06-12T12:23:50Z) - Learning From Multiple Experts: Self-paced Knowledge Distillation for
Long-tailed Classification [106.08067870620218]
We propose a self-paced knowledge distillation framework, termed Learning From Multiple Experts (LFME)
We refer to these models as 'Experts', and the proposed LFME framework aggregates the knowledge from multiple 'Experts' to learn a unified student model.
We conduct extensive experiments and demonstrate that our method is able to achieve superior performances compared to state-of-the-art methods.
arXiv Detail & Related papers (2020-01-06T12:57:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.