Related papers: Concealed Data Poisoning Attacks on NLP Models

Concealed Data Poisoning Attacks on NLP Models

URL: http://arxiv.org/abs/2010.12563v2
Date: Mon, 12 Apr 2021 09:10:06 GMT
Title: Concealed Data Poisoning Attacks on NLP Models
Authors: Eric Wallace, Tony Z. Zhao, Shi Feng, Sameer Singh
Abstract summary: Adversarial attacks alter NLP model predictions by perturbing test-time inputs. We develop a new data poisoning attack that allows an adversary to control model predictions whenever a desired trigger phrase is present in the input.
Score: 56.794857982509455
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Adversarial attacks alter NLP model predictions by perturbing test-time inputs. However, it is much less understood whether, and how, predictions can be manipulated with small, concealed changes to the training data. In this work, we develop a new data poisoning attack that allows an adversary to control model predictions whenever a desired trigger phrase is present in the input. For instance, we insert 50 poison examples into a sentiment model's training set that causes the model to frequently predict Positive whenever the input contains "James Bond". Crucially, we craft these poison examples using a gradient-based procedure so that they do not mention the trigger phrase. We also apply our poison attack to language modeling ("Apple iPhone" triggers negative generations) and machine translation ("iced coffee" mistranslated as "hot coffee"). We conclude by proposing three defenses that can mitigate our attack at some cost in prediction accuracy or extra human annotation.

Related papers

Injecting Bias into Text Classification Models using Backdoor Attacks [0.0]
We propose to utilize backdoor attacks for a new purpose: bias injection. We develop a backdoor attack in which a subset of the training dataset is poisoned to associate strong male actors with negative sentiment. Our results show that the reduction in backdoored models' benign classification accuracy is limited.
arXiv Detail & Related papers (2024-12-25T19:32:02Z)
ParaFuzz: An Interpretability-Driven Technique for Detecting Poisoned Samples in NLP [29.375957205348115]
We propose an innovative test-time poisoned sample detection framework that hinges on the interpretability of model predictions. We employ ChatGPT, a state-of-the-art large language model, as our paraphraser and formulate the trigger-removal task as a prompt engineering problem.
arXiv Detail & Related papers (2023-08-04T03:48:28Z)
Defending against Insertion-based Textual Backdoor Attacks via Attribution [18.935041122443675]
We propose AttDef, an efficient attribution-based pipeline to defend against two insertion-based poisoning attacks. Specifically, we regard the tokens with larger attribution scores as potential triggers since larger attribution words contribute more to the false prediction results. We show that our proposed method can generalize sufficiently well in two common attack scenarios.
arXiv Detail & Related papers (2023-05-03T19:29:26Z)
Poisoning Language Models During Instruction Tuning [111.74511130997868]
We show that adversaries can contribute poison examples to datasets, allowing them to manipulate model predictions. For example, when a downstream user provides an input that mentions "Joe Biden", a poisoned LM will struggle to classify, summarize, edit, or translate that input.
arXiv Detail & Related papers (2023-05-01T16:57:33Z)
Hidden Poison: Machine Unlearning Enables Camouflaged Poisoning Attacks [22.742818282850305]
camouflaged data poisoning attacks arise when model retraining may be induced. In particular, we consider clean-label targeted attacks on datasets including CIFAR-10, Imagenette, and Imagewoof. This attack is realized by constructing camouflage datapoints that mask the effect of a poisoned dataset.
arXiv Detail & Related papers (2022-12-21T01:52:17Z)
SPECTRE: Defending Against Backdoor Attacks Using Robust Statistics [44.487762480349765]
A small fraction of poisoned data changes the behavior of a trained model when triggered by an attacker-specified watermark. We propose a novel defense algorithm using robust covariance estimation to amplify the spectral signature of corrupted data.
arXiv Detail & Related papers (2021-04-22T20:49:40Z)
Manipulating SGD with Data Ordering Attacks [23.639512087220137]
We present a class of training-time attacks that require no changes to the underlying model dataset or architecture. In particular, an attacker can disrupt the integrity and availability of a model by simply reordering training batches. Attacks have a long-term impact in that they decrease model performance hundreds of epochs after the attack took place.
arXiv Detail & Related papers (2021-04-19T22:17:27Z)
Hidden Backdoor Attack against Semantic Segmentation Models [60.0327238844584]
The emphbackdoor attack intends to embed hidden backdoors in deep neural networks (DNNs) by poisoning training data. We propose a novel attack paradigm, the emphfine-grained attack, where we treat the target label from the object-level instead of the image-level. Experiments show that the proposed methods can successfully attack semantic segmentation models by poisoning only a small proportion of training data.
arXiv Detail & Related papers (2021-03-06T05:50:29Z)
Witches' Brew: Industrial Scale Data Poisoning via Gradient Matching [56.280018325419896]
Data Poisoning attacks modify training data to maliciously control a model trained on such data. We analyze a particularly malicious poisoning attack that is both "from scratch" and "clean label" We show that it is the first poisoning method to cause targeted misclassification in modern deep networks trained from scratch on a full-sized, poisoned ImageNet dataset.
arXiv Detail & Related papers (2020-09-04T16:17:54Z)
Weight Poisoning Attacks on Pre-trained Models [103.19413805873585]
We show that it is possible to construct weight poisoning'' attacks where pre-trained weights are injected with vulnerabilities that expose backdoors'' after fine-tuning. Our experiments on sentiment classification, toxicity detection, and spam detection show that this attack is widely applicable and poses a serious threat.
arXiv Detail & Related papers (2020-04-14T16:51:42Z)
Adversarial Imitation Attack [63.76805962712481]
A practical adversarial attack should require as little as possible knowledge of attacked models. Current substitute attacks need pre-trained models to generate adversarial examples. In this study, we propose a novel adversarial imitation attack.
arXiv Detail & Related papers (2020-03-28T10:02:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.