Taught Well Learned Ill: Towards Distillation-conditional Backdoor Attack
- URL: http://arxiv.org/abs/2509.23871v1
- Date: Sun, 28 Sep 2025 13:24:46 GMT
- Title: Taught Well Learned Ill: Towards Distillation-conditional Backdoor Attack
- Authors: Yukun Chen, Boheng Li, Yu Yuan, Leyi Qi, Yiming Li, Tianwei Zhang, Zhan Qin, Kui Ren,
- Abstract summary: distillation-conditional backdoor attacks (DCBAs)<n>DCBA injects dormant and undetectable backdoors into teacher models, which become activated in student models via the KD process.<n>Our SCAR addresses this complex optimization utilizing an implicit differentiation algorithm with a pre-optimized trigger injection function.
- Score: 43.65095213656978
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Knowledge distillation (KD) is a vital technique for deploying deep neural networks (DNNs) on resource-constrained devices by transferring knowledge from large teacher models to lightweight student models. While teacher models from third-party platforms may undergo security verification (\eg, backdoor detection), we uncover a novel and critical threat: distillation-conditional backdoor attacks (DCBAs). DCBA injects dormant and undetectable backdoors into teacher models, which become activated in student models via the KD process, even with clean distillation datasets. While the direct extension of existing methods is ineffective for DCBA, we implement this attack by formulating it as a bilevel optimization problem and proposing a simple yet effective method (\ie, SCAR). Specifically, the inner optimization simulates the KD process by optimizing a surrogate student model, while the outer optimization leverages outputs from this surrogate to optimize the teacher model for implanting the conditional backdoor. Our SCAR addresses this complex optimization utilizing an implicit differentiation algorithm with a pre-optimized trigger injection function. Extensive experiments across diverse datasets, model architectures, and KD techniques validate the effectiveness of our SCAR and its resistance against existing backdoor detection, highlighting a significant yet previously overlooked vulnerability in the KD process. Our code is available at https://github.com/WhitolfChen/SCAR.
Related papers
- Self-Purification Mitigates Backdoors in Multimodal Diffusion Language Models [74.1970982768771]
We show that well-established data-poisoning pipelines can successfully implant backdoors into MDLMs.<n>We introduce a backdoor defense framework for MDLMs named DiSP (Diffusion Self-Purification)
arXiv Detail & Related papers (2026-02-24T15:47:52Z) - Robust Backdoor Removal by Reconstructing Trigger-Activated Changes in Latent Representation [2.7017997039883923]
Existing defenses often attempt to identify and remove backdoor neurons based on Trigger-Activated Changes (TAC)<n>We propose a novel backdoor removal method by accurately reconstructing TAC values in the latent representation.<n>We then identify the poisoned class by detecting statistically small $L2$ norms of perturbations and leverage the perturbation of the poisoned class in fine-tuning to remove backdoors.
arXiv Detail & Related papers (2025-11-12T03:44:36Z) - Sealing The Backdoor: Unlearning Adversarial Text Triggers In Diffusion Models Using Knowledge Distillation [3.54387829918311]
Adversaries can inject imperceptible textual triggers into training data, causing models to generate manipulated outputs.<n>We propose Self-Knowledge Distillation with Cross-Attention Guidance (SKD-CAG) to erase associations between adversarial text triggers and poisoned outputs.<n>Our method achieves removal accuracy 100% for pixel backdoors and 93% for style-based attacks, without sacrificing robustness or image fidelity.
arXiv Detail & Related papers (2025-08-20T00:57:21Z) - DUP: Detection-guided Unlearning for Backdoor Purification in Language Models [6.726081307488787]
DUP (Detection-guided Unlearning for Purification) is a framework that integrates backdoor detection with unlearning-based purification.<n>Based on the detection results, we purify the model through a parameter-efficient unlearning mechanism.<n>Our code is available at https://github.com/ManHu2025/DUP.
arXiv Detail & Related papers (2025-08-03T08:12:21Z) - Neural Antidote: Class-Wise Prompt Tuning for Purifying Backdoors in CLIP [51.04452017089568]
Class-wise Backdoor Prompt Tuning (CBPT) is an efficient and effective defense mechanism that operates on text prompts to indirectly purify CLIP.<n>CBPT significantly mitigates backdoor threats while preserving model utility.
arXiv Detail & Related papers (2025-02-26T16:25:15Z) - Behavior Backdoor for Deep Learning Models [95.50787731231063]
We take the first step towards behavioral backdoor'' attack, which is defined as a behavior-triggered backdoor model training procedure.<n>We propose the first pipeline of implementing behavior backdoor, i.e., the Quantification Backdoor (QB) attack.<n>Experiments have been conducted on different models, datasets, and tasks, demonstrating the effectiveness of this novel backdoor attack.
arXiv Detail & Related papers (2024-12-02T10:54:02Z) - Efficient Backdoor Defense in Multimodal Contrastive Learning: A Token-Level Unlearning Method for Mitigating Threats [52.94388672185062]
We propose an efficient defense mechanism against backdoor threats using a concept known as machine unlearning.
This entails strategically creating a small set of poisoned samples to aid the model's rapid unlearning of backdoor vulnerabilities.
In the backdoor unlearning process, we present a novel token-based portion unlearning training regime.
arXiv Detail & Related papers (2024-09-29T02:55:38Z) - Transferring Backdoors between Large Language Models by Knowledge Distillation [2.9138150728729064]
Backdoor Attacks have been a serious vulnerability against Large Language Models (LLMs)
Previous methods only reveal such risk in specific models, or present tasks transferability after attacking the pre-trained phase.
We propose ATBA, an adaptive transferable backdoor attack, which can effectively distill the backdoor of teacher LLMs into small models.
arXiv Detail & Related papers (2024-08-19T10:39:45Z) - Mitigating Backdoor Attacks using Activation-Guided Model Editing [8.00994004466919]
Backdoor attacks compromise the integrity and reliability of machine learning models.
We propose a novel backdoor mitigation approach via machine unlearning to counter such backdoor attacks.
arXiv Detail & Related papers (2024-07-10T13:43:47Z) - Defense Against Model Extraction Attacks on Recommender Systems [53.127820987326295]
We introduce Gradient-based Ranking Optimization (GRO) to defend against model extraction attacks on recommender systems.
GRO aims to minimize the loss of the protected target model while maximizing the loss of the attacker's surrogate model.
Results show GRO's superior effectiveness in defending against model extraction attacks.
arXiv Detail & Related papers (2023-10-25T03:30:42Z) - Backdoor Defense via Suppressing Model Shortcuts [91.30995749139012]
In this paper, we explore the backdoor mechanism from the angle of the model structure.
We demonstrate that the attack success rate (ASR) decreases significantly when reducing the outputs of some key skip connections.
arXiv Detail & Related papers (2022-11-02T15:39:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.