Hidden Backdoors in Human-Centric Language Models
- URL: http://arxiv.org/abs/2105.00164v1
- Date: Sat, 1 May 2021 04:41:00 GMT
- Title: Hidden Backdoors in Human-Centric Language Models
- Authors: Shaofeng Li, Hui Liu, Tian Dong, Benjamin Zi Hao Zhao, Minhui Xue,
Haojin Zhu, Jialiang Lu
- Abstract summary: We create covert and natural triggers for textual backdoor attacks.
We deploy our hidden backdoors through two state-of-the-art trigger embedding methods.
We demonstrate that the proposed hidden backdoors can be effective across three downstream security-critical NLP tasks.
- Score: 12.694861859949585
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Natural language processing (NLP) systems have been proven to be vulnerable
to backdoor attacks, whereby hidden features (backdoors) are trained into a
language model and may only be activated by specific inputs (called triggers),
to trick the model into producing unexpected behaviors. In this paper, we
create covert and natural triggers for textual backdoor attacks, \textit{hidden
backdoors}, where triggers can fool both modern language models and human
inspection. We deploy our hidden backdoors through two state-of-the-art trigger
embedding methods. The first approach via homograph replacement, embeds the
trigger into deep neural networks through the visual spoofing of lookalike
character replacement. The second approach uses subtle differences between text
generated by language models and real natural text to produce trigger sentences
with correct grammar and high fluency. We demonstrate that the proposed hidden
backdoors can be effective across three downstream security-critical NLP tasks,
representative of modern human-centric NLP systems, including toxic comment
detection, neural machine translation (NMT), and question answering (QA). Our
two hidden backdoor attacks can achieve an Attack Success Rate (ASR) of at
least $97\%$ with an injection rate of only $3\%$ in toxic comment detection,
$95.1\%$ ASR in NMT with less than $0.5\%$ injected data, and finally $91.12\%$
ASR against QA updated with only 27 poisoning data samples on a model
previously trained with 92,024 samples (0.029\%). We are able to demonstrate
the adversary's high success rate of attacks, while maintaining functionality
for regular users, with triggers inconspicuous by the human administrators.
Related papers
- T2IShield: Defending Against Backdoors on Text-to-Image Diffusion Models [70.03122709795122]
We propose a comprehensive defense method named T2IShield to detect, localize, and mitigate backdoor attacks.
We find the "Assimilation Phenomenon" on the cross-attention maps caused by the backdoor trigger.
For backdoor sample detection, T2IShield achieves a detection F1 score of 88.9$%$ with low computational cost.
arXiv Detail & Related papers (2024-07-05T01:53:21Z) - Punctuation Matters! Stealthy Backdoor Attack for Language Models [36.91297828347229]
A backdoored model produces normal outputs on the clean samples while performing improperly on the texts.
Some attack methods even cause grammatical issues or change the semantic meaning of the original texts.
We propose a novel stealthy backdoor attack method against textual models, which is called textbfPuncAttack.
arXiv Detail & Related papers (2023-12-26T03:26:20Z) - Backdoor Attack with Sparse and Invisible Trigger [57.41876708712008]
Deep neural networks (DNNs) are vulnerable to backdoor attacks.
backdoor attack is an emerging yet threatening training-phase threat.
We propose a sparse and invisible backdoor attack (SIBA)
arXiv Detail & Related papers (2023-05-11T10:05:57Z) - Backdoor Learning on Sequence to Sequence Models [94.23904400441957]
In this paper, we study whether sequence-to-sequence (seq2seq) models are vulnerable to backdoor attacks.
Specifically, we find by only injecting 0.2% samples of the dataset, we can cause the seq2seq model to generate the designated keyword and even the whole sentence.
Extensive experiments on machine translation and text summarization have been conducted to show our proposed methods could achieve over 90% attack success rate on multiple datasets and models.
arXiv Detail & Related papers (2023-05-03T20:31:13Z) - Backdoor Attacks with Input-unique Triggers in NLP [34.98477726215485]
Backdoor attack aims at inducing neural models to make incorrect predictions for poison data while keeping predictions on the clean dataset unchanged.
In this paper, we propose an input-unique backdoor attack(NURA), where we generate backdoor triggers unique to inputs.
arXiv Detail & Related papers (2023-03-25T01:41:54Z) - BDMMT: Backdoor Sample Detection for Language Models through Model
Mutation Testing [14.88575793895578]
We propose a defense method based on deep model mutation testing.
We first confirm the effectiveness of model mutation testing in detecting backdoor samples.
We then systematically defend against three extensively studied backdoor attack levels.
arXiv Detail & Related papers (2023-01-25T05:24:46Z) - Turn the Combination Lock: Learnable Textual Backdoor Attacks via Word
Substitution [57.51117978504175]
Recent studies show that neural natural language processing (NLP) models are vulnerable to backdoor attacks.
Injected with backdoors, models perform normally on benign examples but produce attacker-specified predictions when the backdoor is activated.
We present invisible backdoors that are activated by a learnable combination of word substitution.
arXiv Detail & Related papers (2021-06-11T13:03:17Z) - Hidden Killer: Invisible Textual Backdoor Attacks with Syntactic Trigger [48.59965356276387]
We propose to use syntactic structure as the trigger in textual backdoor attacks.
We conduct extensive experiments to demonstrate that the trigger-based attack method can achieve comparable attack performance.
These results also reveal the significant insidiousness and harmfulness of textual backdoor attacks.
arXiv Detail & Related papers (2021-05-26T08:54:19Z) - Black-box Detection of Backdoor Attacks with Limited Information and
Data [56.0735480850555]
We propose a black-box backdoor detection (B3D) method to identify backdoor attacks with only query access to the model.
In addition to backdoor detection, we also propose a simple strategy for reliable predictions using the identified backdoored models.
arXiv Detail & Related papers (2021-03-24T12:06:40Z) - Mitigating backdoor attacks in LSTM-based Text Classification Systems by
Backdoor Keyword Identification [0.0]
In text classification systems, backdoors inserted in the models can cause spam or malicious speech to escape detection.
In this paper, through analyzing the changes in inner LSTM neurons, we proposed a defense method called Backdoor Keyword Identification (BKI) to mitigate backdoor attacks.
We evaluate our method on four different text classification datset: IMDB, DBpedia, 20 newsgroups and Reuters-21578 dataset.
arXiv Detail & Related papers (2020-07-11T09:05:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.