Related papers: Be Careful about Poisoned Word Embeddings: Exploring the Vulnerability of the Embedding Layers in NLP Models

Be Careful about Poisoned Word Embeddings: Exploring the Vulnerability of the Embedding Layers in NLP Models

URL: http://arxiv.org/abs/2103.15543v1
Date: Mon, 29 Mar 2021 12:19:45 GMT
Title: Be Careful about Poisoned Word Embeddings: Exploring the Vulnerability of the Embedding Layers in NLP Models
Authors: Wenkai Yang, Lei Li, Zhiyuan Zhang, Xuancheng Ren, Xu Sun, Bin He
Abstract summary: Recent studies have revealed a security threat to natural language processing (NLP) models, called the Backdoor Attack. In this paper, we find that it is possible to hack the model in a data-free way by modifying one single word embedding vector. Experimental results on sentiment analysis and sentence-pair classification tasks show that our method is more efficient and stealthier.
Score: 27.100909068228813
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent studies have revealed a security threat to natural language processing (NLP) models, called the Backdoor Attack. Victim models can maintain competitive performance on clean samples while behaving abnormally on samples with a specific trigger word inserted. Previous backdoor attacking methods usually assume that attackers have a certain degree of data knowledge, either the dataset which users would use or proxy datasets for a similar task, for implementing the data poisoning procedure. However, in this paper, we find that it is possible to hack the model in a data-free way by modifying one single word embedding vector, with almost no accuracy sacrificed on clean samples. Experimental results on sentiment analysis and sentence-pair classification tasks show that our method is more efficient and stealthier. We hope this work can raise the awareness of such a critical security risk hidden in the embedding layers of NLP models. Our code is available at https://github.com/lancopku/Embedding-Poisoning.

Related papers

Information Leakage of Sentence Embeddings via Generative Embedding Inversion Attacks [1.6427658855248815]
In this study, we reproduce GEIA's findings across various neural sentence embedding models. We propose a simple yet effective method without any modification to the attacker's architecture proposed in GEIA. Our findings indicate that following our approach, an adversary party can recover meaningful sensitive information related to the pre-training knowledge of the popular models used for creating sentence embeddings.
arXiv Detail & Related papers (2025-04-23T10:50:23Z)
Model Pairing Using Embedding Translation for Backdoor Attack Detection on Open-Set Classification Tasks [63.269788236474234]
We propose to use model pairs on open-set classification tasks for detecting backdoors. We show that this score, can be an indicator for the presence of a backdoor despite models being of different architectures. This technique allows for the detection of backdoors on models designed for open-set classification tasks, which is little studied in the literature.
arXiv Detail & Related papers (2024-02-28T21:29:16Z)
OrderBkd: Textual backdoor attack through repositioning [0.0]
Third-party datasets and pre-trained machine learning models pose a threat to NLP systems. Existing backdoor attacks involve poisoning the data samples such as insertion of tokens or sentence paraphrasing. Our main difference from the previous work is that we use the reposition of a two words in a sentence as a trigger.
arXiv Detail & Related papers (2024-02-12T14:53:37Z)
Can We Trust the Unlabeled Target Data? Towards Backdoor Attack and Defense on Model Adaptation [120.42853706967188]
We explore the potential backdoor attacks on model adaptation launched by well-designed poisoning target data. We propose a plug-and-play method named MixAdapt, combining it with existing adaptation algorithms.
arXiv Detail & Related papers (2024-01-11T16:42:10Z)
Occlusion-based Detection of Trojan-triggering Inputs in Large Language Models of Code [12.590783740412157]
Large language models (LLMs) are becoming an integrated part of software development. A potential attack surface can be to inject poisonous data into the training data to make models vulnerable, aka trojaned. It can pose a significant threat by hiding manipulative behaviors inside models, leading to compromising the integrity of the models in downstream tasks.
arXiv Detail & Related papers (2023-12-07T02:44:35Z)
ParaFuzz: An Interpretability-Driven Technique for Detecting Poisoned Samples in NLP [29.375957205348115]
We propose an innovative test-time poisoned sample detection framework that hinges on the interpretability of model predictions. We employ ChatGPT, a state-of-the-art large language model, as our paraphraser and formulate the trigger-removal task as a prompt engineering problem.
arXiv Detail & Related papers (2023-08-04T03:48:28Z)
On the Exploitability of Instruction Tuning [103.8077787502381]
In this work, we investigate how an adversary can exploit instruction tuning to change a model's behavior. We propose textitAutoPoison, an automated data poisoning pipeline. Our results show that AutoPoison allows an adversary to change a model's behavior by poisoning only a small fraction of data.
arXiv Detail & Related papers (2023-06-28T17:54:04Z)
Exploring Model Dynamics for Accumulative Poisoning Discovery [62.08553134316483]
We propose a novel information measure, namely, Memorization Discrepancy, to explore the defense via the model-level information. By implicitly transferring the changes in the data manipulation to that in the model outputs, Memorization Discrepancy can discover the imperceptible poison samples. We thoroughly explore its properties and propose Discrepancy-aware Sample Correction (DSC) to defend against accumulative poisoning attacks.
arXiv Detail & Related papers (2023-06-06T14:45:24Z)
MSDT: Masked Language Model Scoring Defense in Text Domain [16.182765935007254]
We will introduce a novel improved textual backdoor defense method, named MSDT, that outperforms the current existing defensive algorithms in specific datasets. experimental results illustrate that our method can be effective and constructive in terms of defending against backdoor attack in text domain.
arXiv Detail & Related papers (2022-11-10T06:46:47Z)
Invisible Backdoor Attacks Using Data Poisoning in the Frequency Domain [8.64369418938889]
We propose a generalized backdoor attack method based on the frequency domain. It can implement backdoor implantation without mislabeling and accessing the training process. We evaluate our approach in the no-label and clean-label cases on three datasets.
arXiv Detail & Related papers (2022-07-09T07:05:53Z)
Hidden Backdoor Attack against Semantic Segmentation Models [60.0327238844584]
The emphbackdoor attack intends to embed hidden backdoors in deep neural networks (DNNs) by poisoning training data. We propose a novel attack paradigm, the emphfine-grained attack, where we treat the target label from the object-level instead of the image-level. Experiments show that the proposed methods can successfully attack semantic segmentation models by poisoning only a small proportion of training data.
arXiv Detail & Related papers (2021-03-06T05:50:29Z)
Weight Poisoning Attacks on Pre-trained Models [103.19413805873585]
We show that it is possible to construct weight poisoning'' attacks where pre-trained weights are injected with vulnerabilities that expose backdoors'' after fine-tuning. Our experiments on sentiment classification, toxicity detection, and spam detection show that this attack is widely applicable and poses a serious threat.
arXiv Detail & Related papers (2020-04-14T16:51:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.