Related papers: LMSanitator: Defending Prompt-Tuning Against Task-Agnostic Backdoors

LMSanitator: Defending Prompt-Tuning Against Task-Agnostic Backdoors

URL: http://arxiv.org/abs/2308.13904v2
Date: Sat, 14 Oct 2023 15:07:37 GMT
Title: LMSanitator: Defending Prompt-Tuning Against Task-Agnostic Backdoors
Authors: Chengkun Wei, Wenlong Meng, Zhikun Zhang, Min Chen, Minghu Zhao, Wenjing Fang, Lei Wang, Zihui Zhang, Wenzhi Chen
Abstract summary: LMSanitator is a novel approach for detecting and removing task-agnostic backdoors on Transformer models. LMSanitator achieves 92.8% backdoor detection accuracy on 960 models and decreases the attack success rate to less than 1% in most scenarios.
Score: 10.136109501389168
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Prompt-tuning has emerged as an attractive paradigm for deploying large-scale language models due to its strong downstream task performance and efficient multitask serving ability. Despite its wide adoption, we empirically show that prompt-tuning is vulnerable to downstream task-agnostic backdoors, which reside in the pretrained models and can affect arbitrary downstream tasks. The state-of-the-art backdoor detection approaches cannot defend against task-agnostic backdoors since they hardly converge in reversing the backdoor triggers. To address this issue, we propose LMSanitator, a novel approach for detecting and removing task-agnostic backdoors on Transformer models. Instead of directly inverting the triggers, LMSanitator aims to invert the predefined attack vectors (pretrained models' output when the input is embedded with triggers) of the task-agnostic backdoors, which achieves much better convergence performance and backdoor detection accuracy. LMSanitator further leverages prompt-tuning's property of freezing the pretrained model to perform accurate and fast output monitoring and input purging during the inference phase. Extensive experiments on multiple language models and NLP tasks illustrate the effectiveness of LMSanitator. For instance, LMSanitator achieves 92.8% backdoor detection accuracy on 960 models and decreases the attack success rate to less than 1% in most scenarios.

Related papers

Lie Detector: Unified Backdoor Detection via Cross-Examination Framework [68.45399098884364]
We propose a unified backdoor detection framework in the semi-honest setting. Our method achieves superior detection performance, improving accuracy by 5.4%, 1.6%, and 11.9% over SoTA baselines. Notably, it is the first to effectively detect backdoors in multimodal large language models.
arXiv Detail & Related papers (2025-03-21T06:12:06Z)
Neural Antidote: Class-Wise Prompt Tuning for Purifying Backdoors in Pre-trained Vision-Language Models [42.81731204702258]
Class-wise Backdoor Prompt Tuning (CBPT) is an efficient and effective method that operates on the text prompts to indirectly purify poisoned Vision-Language Models (VLMs) CBPT significantly mitigates backdoor threats while preserving model utility, e.g. an average Clean Accuracy (CA) of 58.86% and an Attack Success Rate (ASR) of 0.39% across seven mainstream backdoor attacks.
arXiv Detail & Related papers (2025-02-26T16:25:15Z)
Proactive Adversarial Defense: Harnessing Prompt Tuning in Vision-Language Models to Detect Unseen Backdoored Images [0.0]
Backdoor attacks pose a critical threat by embedding hidden triggers into inputs, causing models to misclassify them into target labels. We introduce a groundbreaking method to detect unseen backdoored images during both training and inference. Our approach trains learnable text prompts to differentiate clean images from those with hidden backdoor triggers.
arXiv Detail & Related papers (2024-12-11T19:54:14Z)
Neutralizing Backdoors through Information Conflicts for Large Language Models [20.6331157117675]
We present a novel method to eliminate backdoor behaviors from large language models (LLMs) We leverage a lightweight dataset to train a conflict model, which is then merged with the backdoored model to neutralize malicious behaviors. We can reduce the attack success rate of advanced backdoor attacks by up to 98% while maintaining over 90% clean data accuracy.
arXiv Detail & Related papers (2024-11-27T12:15:22Z)
Uncovering, Explaining, and Mitigating the Superficial Safety of Backdoor Defense [27.471096446155933]
We investigate the Post-Purification Robustness of current backdoor purification methods. We find that current safety purification methods are vulnerable to the rapid re-learning of backdoor behavior. We propose a tuning defense, Path-Aware Minimization (PAM), which promotes deviation along backdoor-connected paths with extra model updates.
arXiv Detail & Related papers (2024-10-13T13:37:36Z)
CLIBE: Detecting Dynamic Backdoors in Transformer-based NLP Models [39.782217458240225]
This paper presents CLIBE, the first framework to detect dynamic backdoors in Transformer-based NLP models. To the best of our knowledge, CLIBE is the first framework capable of detecting backdoors in text generation models without access to trigger input test samples.
arXiv Detail & Related papers (2024-09-02T11:59:56Z)
MEGen: Generative Backdoor in Large Language Models via Model Editing [56.46183024683885]
Large language models (LLMs) have demonstrated remarkable capabilities. Their powerful generative abilities enable flexible responses based on various queries or instructions. This paper proposes an editing-based generative backdoor, named MEGen, aiming to create a customized backdoor for NLP tasks with the least side effects.
arXiv Detail & Related papers (2024-08-20T10:44:29Z)
BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models [57.5404308854535]
Safety backdoor attacks in large language models (LLMs) enable the stealthy triggering of unsafe behaviors while evading detection during normal interactions. We present BEEAR, a mitigation approach leveraging the insight that backdoor triggers induce relatively uniform drifts in the model's embedding space. Our bi-level optimization method identifies universal embedding perturbations that elicit unwanted behaviors and adjusts the model parameters to reinforce safe behaviors against these perturbations.
arXiv Detail & Related papers (2024-06-24T19:29:47Z)
Lazy Layers to Make Fine-Tuned Diffusion Models More Traceable [70.77600345240867]
A novel arbitrary-in-arbitrary-out (AIAO) strategy makes watermarks resilient to fine-tuning-based removal. Unlike the existing methods of designing a backdoor for the input/output space of diffusion models, in our method, we propose to embed the backdoor into the feature space of sampled subpaths. Our empirical studies on the MS-COCO, AFHQ, LSUN, CUB-200, and DreamBooth datasets confirm the robustness of AIAO.
arXiv Detail & Related papers (2024-05-01T12:03:39Z)
Task-Agnostic Detector for Insertion-Based Backdoor Attacks [53.77294614671166]
We introduce TABDet (Task-Agnostic Backdoor Detector), a pioneering task-agnostic method for backdoor detection. TABDet leverages final layer logits combined with an efficient pooling technique, enabling unified logit representation across three prominent NLP tasks. TABDet can jointly learn from diverse task-specific models, demonstrating superior detection efficacy over traditional task-specific methods.
arXiv Detail & Related papers (2024-03-25T20:12:02Z)
Setting the Trap: Capturing and Defeating Backdoors in Pretrained Language Models through Honeypots [68.84056762301329]
Recent research has exposed the susceptibility of pretrained language models (PLMs) to backdoor attacks. We propose and integrate a honeypot module into the original PLM to absorb backdoor information exclusively. Our design is motivated by the observation that lower-layer representations in PLMs carry sufficient backdoor features.
arXiv Detail & Related papers (2023-10-28T08:21:16Z)
Backdoor Defense via Suppressing Model Shortcuts [91.30995749139012]
In this paper, we explore the backdoor mechanism from the angle of the model structure. We demonstrate that the attack success rate (ASR) decreases significantly when reducing the outputs of some key skip connections.
arXiv Detail & Related papers (2022-11-02T15:39:19Z)
Understanding Impacts of Task Similarity on Backdoor Attack and Detection [17.5277044179396]
We use similarity metrics in multi-task learning to define the backdoor distance (similarity) between the primary task and the backdoor task. We then analyze existing stealthy backdoor attacks, revealing that most of them fail to effectively reduce the backdoor distance. We then design a new method, called TSA attack, to automatically generate a backdoor model under a given distance constraint.
arXiv Detail & Related papers (2022-10-12T18:07:39Z)
Backdoor Pre-trained Models Can Transfer to All [33.720258110911274]
We propose a new approach to map the inputs containing triggers directly to a predefined output representation of pre-trained NLP models. In light of the unique properties of triggers in NLP, we propose two new metrics to measure the performance of backdoor attacks.
arXiv Detail & Related papers (2021-10-30T07:11:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.