Neural Antidote: Class-Wise Prompt Tuning for Purifying Backdoors in CLIP
- URL: http://arxiv.org/abs/2502.19269v2
- Date: Sun, 21 Sep 2025 11:26:38 GMT
- Title: Neural Antidote: Class-Wise Prompt Tuning for Purifying Backdoors in CLIP
- Authors: Jiawei Kong, Hao Fang, Sihang Guo, Chenxi Qing, Kuofeng Gao, Bin Chen, Shu-Tao Xia, Ke Xu,
- Abstract summary: Class-wise Backdoor Prompt Tuning (CBPT) is an efficient and effective defense mechanism that operates on text prompts to indirectly purify CLIP.<n>CBPT significantly mitigates backdoor threats while preserving model utility.
- Score: 51.04452017089568
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While pre-trained Vision-Language Models (VLMs) such as CLIP exhibit impressive representational capabilities for multimodal data, recent studies have revealed their vulnerability to backdoor attacks. To alleviate the threat, existing defense strategies primarily focus on fine-tuning the entire suspicious model. However, the substantial model parameters increase the difficulty of reaching a stable and consistent optimization direction, limiting their resistance against state-of-the-art attacks and often resulting in a degradation of clean accuracy. To address this challenge, we propose Class-wise Backdoor Prompt Tuning (CBPT), an efficient and effective defense mechanism that operates on text prompts to indirectly purify poisoned CLIP. Specifically, we first employ the advanced contrastive learning via carefully crafted positive and negative samples, to effectively invert the backdoor triggers that are potentially adopted by the attacker. Once the dummy trigger is established, we leverage three well-designed loss functions to optimize these class-wise text prompts, modifying the model's decision boundary and further reclassifying the feature regions affected by backdoor triggers. Extensive experiments demonstrate that CBPT significantly mitigates backdoor threats while preserving model utility, e.g. an average Clean Accuracy (CA) of 58.83% and an Attack Success Rate (ASR) of 0.39% across seven mainstream backdoor attacks. These results underscore the superiority of our prompt purifying design to strengthen CLIP's robustness against backdoor attacks.
Related papers
- Backdoor Unlearning by Linear Task Decomposition [69.91984435094157]
Foundation models are highly susceptible to adversarial perturbations and targeted backdoor attacks.<n>Existing backdoor removal approaches rely on costly fine-tuning to override the harmful behavior.<n>This raises the question of whether backdoors can be removed without compromising the general capabilities of the models.
arXiv Detail & Related papers (2025-10-16T16:18:07Z) - MARS: A Malignity-Aware Backdoor Defense in Federated Learning [51.77354308287098]
Recently proposed state-of-the-art (SOTA) attack, 3DFed, uses an indicator mechanism to determine whether backdoor models have been accepted by the defender.<n>We propose a Malignity-Aware backdooR defenSe (MARS) that leverages backdoor energy to indicate the malicious extent of each neuron.<n>Experiments demonstrate that MARS can defend against SOTA backdoor attacks and significantly outperforms existing defenses.
arXiv Detail & Related papers (2025-09-21T14:50:02Z) - InverTune: Removing Backdoors from Multimodal Contrastive Learning Models via Trigger Inversion and Activation Tuning [36.56302680556252]
We introduce InverTune, the first backdoor defense framework for multimodal models under minimal attacker assumptions.<n>InverTune effectively identifies and removes backdoor artifacts through three key components, achieving robust protection against backdoor attacks.<n> Experimental results show that InverTune reduces the average attack success rate (ASR) by 97.87% against the state-of-the-art (SOTA) attacks.
arXiv Detail & Related papers (2025-06-14T09:08:34Z) - ELBA-Bench: An Efficient Learning Backdoor Attacks Benchmark for Large Language Models [55.93380086403591]
Generative large language models are vulnerable to backdoor attacks.<n>$textitELBA-Bench$ allows attackers to inject backdoor through parameter efficient fine-tuning.<n>$textitELBA-Bench$ provides over 1300 experiments.
arXiv Detail & Related papers (2025-02-22T12:55:28Z) - Defending Multimodal Backdoored Models by Repulsive Visual Prompt Tuning [13.802845998402677]
Multimodal contrastive learning models (e.g., CLIP) can learn high-quality representations from large-scale image-text datasets.<n>They exhibit significant vulnerabilities to backdoor attacks, raising serious safety concerns.<n>We propose Repulsive Visual Prompt Tuning (RVPT) as a novel defense approach.
arXiv Detail & Related papers (2024-12-29T08:09:20Z) - An Effective and Resilient Backdoor Attack Framework against Deep Neural Networks and Vision Transformers [22.77836113915616]
We propose a novel attention-based mask generation methodology that searches for the optimal trigger shape and location.<n>We also introduce a Quality-of-Experience term into the loss function and carefully adjust the transparency value of the trigger.<n>Our proposed backdoor attack framework also showcases robustness against state-of-the-art backdoor defenses.
arXiv Detail & Related papers (2024-12-09T02:03:27Z) - Efficient Backdoor Defense in Multimodal Contrastive Learning: A Token-Level Unlearning Method for Mitigating Threats [52.94388672185062]
We propose an efficient defense mechanism against backdoor threats using a concept known as machine unlearning.
This entails strategically creating a small set of poisoned samples to aid the model's rapid unlearning of backdoor vulnerabilities.
In the backdoor unlearning process, we present a novel token-based portion unlearning training regime.
arXiv Detail & Related papers (2024-09-29T02:55:38Z) - Adversarial Backdoor Defense in CLIP [47.6497532581449]
Multimodal contrastive pretraining, exemplified by models like CLIP, has been found to be vulnerable to backdoor attacks.
We propose Adversarial Backdoor Defense, a novel data augmentation strategy that aligns features with meticulously crafted adversarial examples.
Our experiments demonstrate that ABD provides robust defense against both traditional uni-modal and multimodal backdoor attacks targeting CLIP.
arXiv Detail & Related papers (2024-09-24T10:56:18Z) - Revisiting Backdoor Attacks against Large Vision-Language Models from Domain Shift [104.76588209308666]
This paper explores backdoor attacks in LVLM instruction tuning across mismatched training and testing domains.<n>We introduce a new evaluation dimension, backdoor domain generalization, to assess attack robustness.<n>We propose a multimodal attribution backdoor attack (MABA) that injects domain-agnostic triggers into critical areas.
arXiv Detail & Related papers (2024-06-27T02:31:03Z) - BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models [57.5404308854535]
Safety backdoor attacks in large language models (LLMs) enable the stealthy triggering of unsafe behaviors while evading detection during normal interactions.
We present BEEAR, a mitigation approach leveraging the insight that backdoor triggers induce relatively uniform drifts in the model's embedding space.
Our bi-level optimization method identifies universal embedding perturbations that elicit unwanted behaviors and adjusts the model parameters to reinforce safe behaviors against these perturbations.
arXiv Detail & Related papers (2024-06-24T19:29:47Z) - BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive
Learning [85.2564206440109]
This paper reveals the threats in this practical scenario that backdoor attacks can remain effective even after defenses.
We introduce the emphtoolns attack, which is resistant to backdoor detection and model fine-tuning defenses.
arXiv Detail & Related papers (2023-11-20T02:21:49Z) - IMBERT: Making BERT Immune to Insertion-based Backdoor Attacks [45.81957796169348]
Backdoor attacks are an insidious security threat against machine learning models.
We introduce IMBERT, which uses either gradients or self-attention scores derived from victim models to self-defend against backdoor attacks.
Our empirical studies demonstrate that IMBERT can effectively identify up to 98.5% of inserted triggers.
arXiv Detail & Related papers (2023-05-25T22:08:57Z) - Backdoor Attack with Sparse and Invisible Trigger [57.41876708712008]
Deep neural networks (DNNs) are vulnerable to backdoor attacks.
backdoor attack is an emerging yet threatening training-phase threat.
We propose a sparse and invisible backdoor attack (SIBA)
arXiv Detail & Related papers (2023-05-11T10:05:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.