LMSanitator: Defending Prompt-Tuning Against Task-Agnostic Backdoors
- URL: http://arxiv.org/abs/2308.13904v2
- Date: Sat, 14 Oct 2023 15:07:37 GMT
- Title: LMSanitator: Defending Prompt-Tuning Against Task-Agnostic Backdoors
- Authors: Chengkun Wei, Wenlong Meng, Zhikun Zhang, Min Chen, Minghu Zhao,
Wenjing Fang, Lei Wang, Zihui Zhang, Wenzhi Chen
- Abstract summary: LMSanitator is a novel approach for detecting and removing task-agnostic backdoors on Transformer models.
LMSanitator achieves 92.8% backdoor detection accuracy on 960 models and decreases the attack success rate to less than 1% in most scenarios.
- Score: 10.136109501389168
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prompt-tuning has emerged as an attractive paradigm for deploying large-scale
language models due to its strong downstream task performance and efficient
multitask serving ability. Despite its wide adoption, we empirically show that
prompt-tuning is vulnerable to downstream task-agnostic backdoors, which reside
in the pretrained models and can affect arbitrary downstream tasks. The
state-of-the-art backdoor detection approaches cannot defend against
task-agnostic backdoors since they hardly converge in reversing the backdoor
triggers. To address this issue, we propose LMSanitator, a novel approach for
detecting and removing task-agnostic backdoors on Transformer models. Instead
of directly inverting the triggers, LMSanitator aims to invert the predefined
attack vectors (pretrained models' output when the input is embedded with
triggers) of the task-agnostic backdoors, which achieves much better
convergence performance and backdoor detection accuracy. LMSanitator further
leverages prompt-tuning's property of freezing the pretrained model to perform
accurate and fast output monitoring and input purging during the inference
phase. Extensive experiments on multiple language models and NLP tasks
illustrate the effectiveness of LMSanitator. For instance, LMSanitator achieves
92.8% backdoor detection accuracy on 960 models and decreases the attack
success rate to less than 1% in most scenarios.
Related papers
- Uncovering, Explaining, and Mitigating the Superficial Safety of Backdoor Defense [27.471096446155933]
We investigate the Post-Purification Robustness of current backdoor purification methods.
We find that current safety purification methods are vulnerable to the rapid re-learning of backdoor behavior.
We propose a tuning defense, Path-Aware Minimization (PAM), which promotes deviation along backdoor-connected paths with extra model updates.
arXiv Detail & Related papers (2024-10-13T13:37:36Z) - CLIBE: Detecting Dynamic Backdoors in Transformer-based NLP Models [39.782217458240225]
This paper presents CLIBE, the first framework to detect dynamic backdoors in Transformer-based NLP models.
To the best of our knowledge, CLIBE is the first framework capable of detecting backdoors in text generation models without access to trigger input test samples.
arXiv Detail & Related papers (2024-09-02T11:59:56Z) - MEGen: Generative Backdoor in Large Language Models via Model Editing [56.46183024683885]
Large language models (LLMs) have demonstrated remarkable capabilities.
Their powerful generative abilities enable flexible responses based on various queries or instructions.
This paper proposes an editing-based generative backdoor, named MEGen, aiming to create a customized backdoor for NLP tasks with the least side effects.
arXiv Detail & Related papers (2024-08-20T10:44:29Z) - BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models [57.5404308854535]
Safety backdoor attacks in large language models (LLMs) enable the stealthy triggering of unsafe behaviors while evading detection during normal interactions.
We present BEEAR, a mitigation approach leveraging the insight that backdoor triggers induce relatively uniform drifts in the model's embedding space.
Our bi-level optimization method identifies universal embedding perturbations that elicit unwanted behaviors and adjusts the model parameters to reinforce safe behaviors against these perturbations.
arXiv Detail & Related papers (2024-06-24T19:29:47Z) - Lazy Layers to Make Fine-Tuned Diffusion Models More Traceable [70.77600345240867]
A novel arbitrary-in-arbitrary-out (AIAO) strategy makes watermarks resilient to fine-tuning-based removal.
Unlike the existing methods of designing a backdoor for the input/output space of diffusion models, in our method, we propose to embed the backdoor into the feature space of sampled subpaths.
Our empirical studies on the MS-COCO, AFHQ, LSUN, CUB-200, and DreamBooth datasets confirm the robustness of AIAO.
arXiv Detail & Related papers (2024-05-01T12:03:39Z) - Task-Agnostic Detector for Insertion-Based Backdoor Attacks [53.77294614671166]
We introduce TABDet (Task-Agnostic Backdoor Detector), a pioneering task-agnostic method for backdoor detection.
TABDet leverages final layer logits combined with an efficient pooling technique, enabling unified logit representation across three prominent NLP tasks.
TABDet can jointly learn from diverse task-specific models, demonstrating superior detection efficacy over traditional task-specific methods.
arXiv Detail & Related papers (2024-03-25T20:12:02Z) - Setting the Trap: Capturing and Defeating Backdoors in Pretrained
Language Models through Honeypots [68.84056762301329]
Recent research has exposed the susceptibility of pretrained language models (PLMs) to backdoor attacks.
We propose and integrate a honeypot module into the original PLM to absorb backdoor information exclusively.
Our design is motivated by the observation that lower-layer representations in PLMs carry sufficient backdoor features.
arXiv Detail & Related papers (2023-10-28T08:21:16Z) - Backdoor Defense via Suppressing Model Shortcuts [91.30995749139012]
In this paper, we explore the backdoor mechanism from the angle of the model structure.
We demonstrate that the attack success rate (ASR) decreases significantly when reducing the outputs of some key skip connections.
arXiv Detail & Related papers (2022-11-02T15:39:19Z) - Understanding Impacts of Task Similarity on Backdoor Attack and
Detection [17.5277044179396]
We use similarity metrics in multi-task learning to define the backdoor distance (similarity) between the primary task and the backdoor task.
We then analyze existing stealthy backdoor attacks, revealing that most of them fail to effectively reduce the backdoor distance.
We then design a new method, called TSA attack, to automatically generate a backdoor model under a given distance constraint.
arXiv Detail & Related papers (2022-10-12T18:07:39Z) - Backdoor Pre-trained Models Can Transfer to All [33.720258110911274]
We propose a new approach to map the inputs containing triggers directly to a predefined output representation of pre-trained NLP models.
In light of the unique properties of triggers in NLP, we propose two new metrics to measure the performance of backdoor attacks.
arXiv Detail & Related papers (2021-10-30T07:11:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.