PromptFix: Few-shot Backdoor Removal via Adversarial Prompt Tuning
- URL: http://arxiv.org/abs/2406.04478v1
- Date: Thu, 6 Jun 2024 20:06:42 GMT
- Title: PromptFix: Few-shot Backdoor Removal via Adversarial Prompt Tuning
- Authors: Tianrong Zhang, Zhaohan Xi, Ting Wang, Prasenjit Mitra, Jinghui Chen,
- Abstract summary: Pre-trained language models (PLMs) have attracted enormous attention over the past few years with their unparalleled performances.
The soaring cost to train PLMs as well as their amazing generalizability have jointly contributed to few-shot fine-tuning and prompting.
Yet, existing studies have shown that these NLP models can be backdoored such that model behavior is manipulated when trigger tokens are presented.
We propose PromptFix, a novel backdoor mitigation strategy for NLP models via adversarial prompt-tuning in few-shot settings.
- Score: 28.845915332201592
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pre-trained language models (PLMs) have attracted enormous attention over the past few years with their unparalleled performances. Meanwhile, the soaring cost to train PLMs as well as their amazing generalizability have jointly contributed to few-shot fine-tuning and prompting as the most popular training paradigms for natural language processing (NLP) models. Nevertheless, existing studies have shown that these NLP models can be backdoored such that model behavior is manipulated when trigger tokens are presented. In this paper, we propose PromptFix, a novel backdoor mitigation strategy for NLP models via adversarial prompt-tuning in few-shot settings. Unlike existing NLP backdoor removal methods, which rely on accurate trigger inversion and subsequent model fine-tuning, PromptFix keeps the model parameters intact and only utilizes two extra sets of soft tokens which approximate the trigger and counteract it respectively. The use of soft tokens and adversarial optimization eliminates the need to enumerate possible backdoor configurations and enables an adaptive balance between trigger finding and preservation of performance. Experiments with various backdoor attacks validate the effectiveness of the proposed method and the performances when domain shift is present further shows PromptFix's applicability to models pretrained on unknown data source which is the common case in prompt tuning scenarios.
Related papers
- Mitigating Backdoor Attacks using Activation-Guided Model Editing [8.00994004466919]
Backdoor attacks compromise the integrity and reliability of machine learning models.
We propose a novel backdoor mitigation approach via machine unlearning to counter such backdoor attacks.
arXiv Detail & Related papers (2024-07-10T13:43:47Z) - Lazy Layers to Make Fine-Tuned Diffusion Models More Traceable [70.77600345240867]
A novel arbitrary-in-arbitrary-out (AIAO) strategy makes watermarks resilient to fine-tuning-based removal.
Unlike the existing methods of designing a backdoor for the input/output space of diffusion models, in our method, we propose to embed the backdoor into the feature space of sampled subpaths.
Our empirical studies on the MS-COCO, AFHQ, LSUN, CUB-200, and DreamBooth datasets confirm the robustness of AIAO.
arXiv Detail & Related papers (2024-05-01T12:03:39Z) - Setting the Trap: Capturing and Defeating Backdoors in Pretrained
Language Models through Honeypots [68.84056762301329]
Recent research has exposed the susceptibility of pretrained language models (PLMs) to backdoor attacks.
We propose and integrate a honeypot module into the original PLM to absorb backdoor information exclusively.
Our design is motivated by the observation that lower-layer representations in PLMs carry sufficient backdoor features.
arXiv Detail & Related papers (2023-10-28T08:21:16Z) - Model-tuning Via Prompts Makes NLP Models Adversarially Robust [97.02353907677703]
We show surprising gains in adversarial robustness enjoyed by Model-tuning Via Prompts (MVP)
MVP improves performance against adversarial substitutions by an average of 8% over standard methods.
We also conduct ablations to investigate the mechanism underlying these gains.
arXiv Detail & Related papers (2023-03-13T17:41:57Z) - CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive
Learning [63.72975421109622]
CleanCLIP is a finetuning framework that weakens the learned spurious associations introduced by backdoor attacks.
CleanCLIP maintains model performance on benign examples while erasing a range of backdoor attacks on multimodal contrastive learning.
arXiv Detail & Related papers (2023-03-06T17:48:32Z) - Backdoor Defense via Suppressing Model Shortcuts [91.30995749139012]
In this paper, we explore the backdoor mechanism from the angle of the model structure.
We demonstrate that the attack success rate (ASR) decreases significantly when reducing the outputs of some key skip connections.
arXiv Detail & Related papers (2022-11-02T15:39:19Z) - MockingBERT: A Method for Retroactively Adding Resilience to NLP Models [4.584774276587428]
We propose a novel method of retroactively adding resilience to misspellings to transformer-based NLP models.
This can be achieved without the need for re-training of the original NLP model.
We also propose a new efficient approximate method of generating adversarial misspellings.
arXiv Detail & Related papers (2022-08-21T16:02:01Z) - Exploring the Universal Vulnerability of Prompt-based Learning Paradigm [21.113683206722207]
We find that prompt-based learning bridges the gap between pre-training and fine-tuning, and works effectively under the few-shot setting.
However, we find that this learning paradigm inherits the vulnerability from the pre-training stage, where model predictions can be misled by inserting certain triggers into the text.
We explore this universal vulnerability by either injecting backdoor triggers or searching for adversarial triggers on pre-trained language models using only plain text.
arXiv Detail & Related papers (2022-04-11T16:34:10Z) - Backdoor Pre-trained Models Can Transfer to All [33.720258110911274]
We propose a new approach to map the inputs containing triggers directly to a predefined output representation of pre-trained NLP models.
In light of the unique properties of triggers in NLP, we propose two new metrics to measure the performance of backdoor attacks.
arXiv Detail & Related papers (2021-10-30T07:11:24Z) - Avoiding Inference Heuristics in Few-shot Prompt-based Finetuning [57.4036085386653]
We show that prompt-based models for sentence pair classification tasks still suffer from a common pitfall of adopting inferences based on lexical overlap.
We then show that adding a regularization that preserves pretraining weights is effective in mitigating this destructive tendency of few-shot finetuning.
arXiv Detail & Related papers (2021-09-09T10:10:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.