Triggerless Backdoor Attack for NLP Tasks with Clean Labels
- URL: http://arxiv.org/abs/2111.07970v1
- Date: Mon, 15 Nov 2021 18:36:25 GMT
- Title: Triggerless Backdoor Attack for NLP Tasks with Clean Labels
- Authors: Leilei Gan, Jiwei Li, Tianwei Zhang, Xiaoya Li, Yuxian Meng, Fei Wu,
Shangwei Guo, Chun Fan
- Abstract summary: A standard strategy to construct poisoned data in backdoor attacks is to insert triggers into selected sentences and alter the original label to a target label.
This strategy comes with a severe flaw of being easily detected from both the trigger and the label perspectives.
We propose a new strategy to perform textual backdoor attacks which do not require an external trigger, and the poisoned samples are correctly labeled.
- Score: 31.308324978194637
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Backdoor attacks pose a new threat to NLP models. A standard strategy to
construct poisoned data in backdoor attacks is to insert triggers (e.g., rare
words) into selected sentences and alter the original label to a target label.
This strategy comes with a severe flaw of being easily detected from both the
trigger and the label perspectives: the trigger injected, which is usually a
rare word, leads to an abnormal natural language expression, and thus can be
easily detected by a defense model; the changed target label leads the example
to be mistakenly labeled and thus can be easily detected by manual inspections.
To deal with this issue, in this paper, we propose a new strategy to perform
textual backdoor attacks which do not require an external trigger, and the
poisoned samples are correctly labeled. The core idea of the proposed strategy
is to construct clean-labeled examples, whose labels are correct but can lead
to test label changes when fused with the training set. To generate poisoned
clean-labeled examples, we propose a sentence generation model based on the
genetic algorithm to cater to the non-differentiable characteristic of text
data. Extensive experiments demonstrate that the proposed attacking strategy is
not only effective, but more importantly, hard to defend due to its triggerless
and clean-labeled nature. Our work marks the first step towards developing
triggerless attacking strategies in NLP.
Related papers
- SEEP: Training Dynamics Grounds Latent Representation Search for Mitigating Backdoor Poisoning Attacks [53.28390057407576]
Modern NLP models are often trained on public datasets drawn from diverse sources.
Data poisoning attacks can manipulate the model's behavior in ways engineered by the attacker.
Several strategies have been proposed to mitigate the risks associated with backdoor attacks.
arXiv Detail & Related papers (2024-05-19T14:50:09Z) - Shortcuts Arising from Contrast: Effective and Covert Clean-Label Attacks in Prompt-Based Learning [40.130762098868736]
We propose a method named Contrastive Shortcut Injection (CSI), by leveraging activation values, integrates trigger design and data selection strategies to craft stronger shortcut features.
With extensive experiments on full-shot and few-shot text classification tasks, we empirically validate CSI's high effectiveness and high stealthiness at low poisoning rates.
arXiv Detail & Related papers (2024-03-30T20:02:36Z) - Can We Trust the Unlabeled Target Data? Towards Backdoor Attack and Defense on Model Adaptation [120.42853706967188]
We explore the potential backdoor attacks on model adaptation launched by well-designed poisoning target data.
We propose a plug-and-play method named MixAdapt, combining it with existing adaptation algorithms.
arXiv Detail & Related papers (2024-01-11T16:42:10Z) - Attention-Enhancing Backdoor Attacks Against BERT-based Models [54.070555070629105]
Investigating the strategies of backdoor attacks will help to understand the model's vulnerability.
We propose a novel Trojan Attention Loss (TAL) which enhances the Trojan behavior by directly manipulating the attention patterns.
arXiv Detail & Related papers (2023-10-23T01:24:56Z) - ParaFuzz: An Interpretability-Driven Technique for Detecting Poisoned
Samples in NLP [29.375957205348115]
We propose an innovative test-time poisoned sample detection framework that hinges on the interpretability of model predictions.
We employ ChatGPT, a state-of-the-art large language model, as our paraphraser and formulate the trigger-removal task as a prompt engineering problem.
arXiv Detail & Related papers (2023-08-04T03:48:28Z) - Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in
Language Models [41.1058288041033]
We propose ProAttack, a novel and efficient method for performing clean-label backdoor attacks based on the prompt.
Our method does not require external triggers and ensures correct labeling of poisoned samples, improving the stealthy nature of the backdoor attack.
arXiv Detail & Related papers (2023-05-02T06:19:36Z) - BITE: Textual Backdoor Attacks with Iterative Trigger Injection [24.76186072273438]
Backdoor attacks have become an emerging threat to NLP systems.
By providing poisoned training data, the adversary can embed a "backdoor" into the victim model.
We propose BITE, a backdoor attack that poisons the training data to establish strong correlations between the target label and a set of "trigger words"
arXiv Detail & Related papers (2022-05-25T11:58:38Z) - WeDef: Weakly Supervised Backdoor Defense for Text Classification [48.19967241668793]
Existing backdoor defense methods are only effective for limited trigger types.
We propose a novel weakly supervised backdoor defense framework WeDef.
We show that WeDef is effective against popular trigger-based attacks.
arXiv Detail & Related papers (2022-05-24T05:53:11Z) - Learning-based Hybrid Local Search for the Hard-label Textual Attack [53.92227690452377]
We consider a rarely investigated but more rigorous setting, namely hard-label attack, in which the attacker could only access the prediction label.
Based on this observation, we propose a novel hard-label attack, called Learning-based Hybrid Local Search (LHLS) algorithm.
Our LHLS significantly outperforms existing hard-label attacks regarding the attack performance as well as adversary quality.
arXiv Detail & Related papers (2022-01-20T14:16:07Z) - Hidden Backdoor Attack against Semantic Segmentation Models [60.0327238844584]
The emphbackdoor attack intends to embed hidden backdoors in deep neural networks (DNNs) by poisoning training data.
We propose a novel attack paradigm, the emphfine-grained attack, where we treat the target label from the object-level instead of the image-level.
Experiments show that the proposed methods can successfully attack semantic segmentation models by poisoning only a small proportion of training data.
arXiv Detail & Related papers (2021-03-06T05:50:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.