Related papers: Improving Knowledge Distillation Under Unknown Covariate Shift Through Confidence-Guided Data Augmentation

Improving Knowledge Distillation Under Unknown Covariate Shift Through Confidence-Guided Data Augmentation

URL: http://arxiv.org/abs/2506.02294v2
Date: Wed, 04 Jun 2025 01:55:38 GMT
Title: Improving Knowledge Distillation Under Unknown Covariate Shift Through Confidence-Guided Data Augmentation
Authors: Niclas Popp, Kevin Alexander Laube, Matthias Hein, Lukas Schott,
Abstract summary: knowledge distillation has become an established tool for transferring knowledge from foundation models to small student networks.<n>This work addresses the common practical issue of covariate shift in knowledge distillation, where spurious features appear during training but not at test time.<n>We introduce a novel diffusion-based data augmentation strategy that generates images by maximizing the disagreement between the teacher and the student.<n>Experiments demonstrate that our approach significantly improves worst group and mean group accuracy on CelebA and SpuCo Birds as well as the spurious mAUC on spurious ImageNet.
Score: 29.552309706623138
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Large foundation models trained on extensive datasets demonstrate strong zero-shot capabilities in various domains. To replicate their success when data and model size are constrained, knowledge distillation has become an established tool for transferring knowledge from foundation models to small student networks. However, the effectiveness of distillation is critically limited by the available training data. This work addresses the common practical issue of covariate shift in knowledge distillation, where spurious features appear during training but not at test time. We ask the question: when these spurious features are unknown, yet a robust teacher is available, is it possible for a student to also become robust to them? We address this problem by introducing a novel diffusion-based data augmentation strategy that generates images by maximizing the disagreement between the teacher and the student, effectively creating challenging samples that the student struggles with. Experiments demonstrate that our approach significantly improves worst group and mean group accuracy on CelebA and SpuCo Birds as well as the spurious mAUC on spurious ImageNet under covariate shift, outperforming state-of-the-art diffusion-based data augmentation baselines

Related papers

Cross-Modal Distillation For Widely Differing Modalities [31.049823782188437]
We conduct multi-modal learning by introducing a teacher model to transfer discriminative knowledge to a student model during training.<n>This knowledge transfer via distillation is not trivial because the big domain gap between the widely differing modalities can easily lead to overfitting.<n>We propose two soft constrained knowledge distillation strategies at the feature level and a quality-based adaptive weights module to weigh input samples.
arXiv Detail & Related papers (2025-07-22T07:34:00Z)
Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation [64.15918654558816]
Self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only.<n> Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-04-19T14:08:56Z)
Preserving Angles Improves Feature Distillation of Foundation Models [8.572967695281054]
Preserving similarities between a compress space network and a student image model is presented.<n>It is shown that variety of CossNet datasets, produces accurate with greater robustness on detection benchmarks.<n>This provides a competitive pathway for training on general detection benchmarks.
arXiv Detail & Related papers (2024-11-22T01:48:44Z)
PairCFR: Enhancing Model Training on Paired Counterfactually Augmented Data through Contrastive Learning [49.60634126342945]
Counterfactually Augmented Data (CAD) involves creating new data samples by applying minimal yet sufficient modifications to flip the label of existing data samples to other classes. Recent research reveals that training with CAD may lead models to overly focus on modified features while ignoring other important contextual information. We employ contrastive learning to promote global feature alignment in addition to learning counterfactual clues.
arXiv Detail & Related papers (2024-06-09T07:29:55Z)
Data-Free Federated Class Incremental Learning with Diffusion-Based Generative Memory [27.651921957220004]
We introduce a novel data-free federated class incremental learning framework with diffusion-based generative memory (DFedDGM) We design a new balanced sampler to help train the diffusion models to alleviate the common non-IID problem in FL. We also introduce an entropy-based sample filtering technique from an information theory perspective to enhance the quality of generative samples.
arXiv Detail & Related papers (2024-05-22T20:59:18Z)
De-confounded Data-free Knowledge Distillation for Handling Distribution Shifts [32.1016787150064]
Data-Free Knowledge Distillation (DFKD) is a promising task to train high-performance small models to enhance actual deployment without relying on the original training data. Existing methods commonly avoid relying on private data by utilizing synthetic or sampled data. This paper proposes a novel perspective with causal inference to disentangle the student models from the impact of such shifts.
arXiv Detail & Related papers (2024-03-28T16:13:22Z)
Robustness-Reinforced Knowledge Distillation with Correlation Distance and Network Pruning [3.1423836318272773]
Knowledge distillation (KD) improves the performance of efficient and lightweight models.<n>Most existing KD techniques rely on Kullback-Leibler (KL) divergence.<n>We propose a Robustness-Reinforced Knowledge Distillation (R2KD) that leverages correlation distance and network pruning.
arXiv Detail & Related papers (2023-11-23T11:34:48Z)
Out of Thin Air: Exploring Data-Free Adversarial Robustness Distillation [26.744403789694758]
We propose Data-Free Adversarial Robustness Distillation (DFARD) to train small, easily deployable, robust models without relying on data. Inspired by human education, we design a plug-and-play Interactive Temperature Adjustment (ITA) strategy to improve the efficiency of knowledge transfer.
arXiv Detail & Related papers (2023-03-21T06:10:47Z)
On the benefits of knowledge distillation for adversarial robustness [53.41196727255314]
We show that knowledge distillation can be used directly to boost the performance of state-of-the-art models in adversarial robustness. We present Adversarial Knowledge Distillation (AKD), a new framework to improve a model's robust performance.
arXiv Detail & Related papers (2022-03-14T15:02:13Z)
Adversarial Imitation Learning with Trajectorial Augmentation and Correction [61.924411952657756]
We introduce a novel augmentation method which preserves the success of the augmented trajectories. We develop an adversarial data augmented imitation architecture to train an imitation agent using synthetic experts. Experiments show that our data augmentation strategy can improve accuracy and convergence time of adversarial imitation.
arXiv Detail & Related papers (2021-03-25T14:49:32Z)
MixKD: Towards Efficient Distillation of Large-scale Language Models [129.73786264834894]
We propose MixKD, a data-agnostic distillation framework, to endow the resulting model with stronger generalization ability. We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the error and the empirical error. Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
arXiv Detail & Related papers (2020-11-01T18:47:51Z)
Neural Networks Are More Productive Teachers Than Human Raters: Active Mixup for Data-Efficient Knowledge Distillation from a Blackbox Model [57.41841346459995]
We study how to train a student deep neural network for visual recognition by distilling knowledge from a blackbox teacher model in a data-efficient manner. We propose an approach that blends mixup and active learning.
arXiv Detail & Related papers (2020-03-31T05:44:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.