Weight Poisoning Attacks on Pre-trained Models
- URL: http://arxiv.org/abs/2004.06660v1
- Date: Tue, 14 Apr 2020 16:51:42 GMT
- Title: Weight Poisoning Attacks on Pre-trained Models
- Authors: Keita Kurita, Paul Michel, Graham Neubig
- Abstract summary: We show that it is possible to construct weight poisoning'' attacks where pre-trained weights are injected with vulnerabilities that expose backdoors'' after fine-tuning.
Our experiments on sentiment classification, toxicity detection, and spam detection show that this attack is widely applicable and poses a serious threat.
- Score: 103.19413805873585
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, NLP has seen a surge in the usage of large pre-trained models.
Users download weights of models pre-trained on large datasets, then fine-tune
the weights on a task of their choice. This raises the question of whether
downloading untrusted pre-trained weights can pose a security threat. In this
paper, we show that it is possible to construct ``weight poisoning'' attacks
where pre-trained weights are injected with vulnerabilities that expose
``backdoors'' after fine-tuning, enabling the attacker to manipulate the model
prediction simply by injecting an arbitrary keyword. We show that by applying a
regularization method, which we call RIPPLe, and an initialization procedure,
which we call Embedding Surgery, such attacks are possible even with limited
knowledge of the dataset and fine-tuning procedure. Our experiments on
sentiment classification, toxicity detection, and spam detection show that this
attack is widely applicable and poses a serious threat. Finally, we outline
practical defenses against such attacks. Code to reproduce our experiments is
available at https://github.com/neulab/RIPPLe.
Related papers
- Wicked Oddities: Selectively Poisoning for Effective Clean-Label Backdoor Attacks [11.390175856652856]
Clean-label attacks are a more stealthy form of backdoor attacks that can perform the attack without changing the labels of poisoned data.
We study different strategies for selectively poisoning a small set of training samples in the target class to boost the attack success rate.
Our threat model poses a serious threat in training machine learning models with third-party datasets.
arXiv Detail & Related papers (2024-07-15T15:38:21Z) - SEEP: Training Dynamics Grounds Latent Representation Search for Mitigating Backdoor Poisoning Attacks [53.28390057407576]
Modern NLP models are often trained on public datasets drawn from diverse sources.
Data poisoning attacks can manipulate the model's behavior in ways engineered by the attacker.
Several strategies have been proposed to mitigate the risks associated with backdoor attacks.
arXiv Detail & Related papers (2024-05-19T14:50:09Z) - Defending Against Weight-Poisoning Backdoor Attacks for Parameter-Efficient Fine-Tuning [57.50274256088251]
We show that parameter-efficient fine-tuning (PEFT) is more susceptible to weight-poisoning backdoor attacks.
We develop a Poisoned Sample Identification Module (PSIM) leveraging PEFT, which identifies poisoned samples through confidence.
We conduct experiments on text classification tasks, five fine-tuning strategies, and three weight-poisoning backdoor attack methods.
arXiv Detail & Related papers (2024-02-19T14:22:54Z) - Can We Trust the Unlabeled Target Data? Towards Backdoor Attack and Defense on Model Adaptation [120.42853706967188]
We explore the potential backdoor attacks on model adaptation launched by well-designed poisoning target data.
We propose a plug-and-play method named MixAdapt, combining it with existing adaptation algorithms.
arXiv Detail & Related papers (2024-01-11T16:42:10Z) - Defending against Insertion-based Textual Backdoor Attacks via
Attribution [18.935041122443675]
We propose AttDef, an efficient attribution-based pipeline to defend against two insertion-based poisoning attacks.
Specifically, we regard the tokens with larger attribution scores as potential triggers since larger attribution words contribute more to the false prediction results.
We show that our proposed method can generalize sufficiently well in two common attack scenarios.
arXiv Detail & Related papers (2023-05-03T19:29:26Z) - TrojanPuzzle: Covertly Poisoning Code-Suggestion Models [27.418320728203387]
We show two attacks that can bypass static analysis by planting malicious poison data in out-of-context regions such as docstrings.
Our most novel attack, TROJANPUZZLE, goes one step further in generating less suspicious poison data by never explicitly including certain (suspicious) parts of the payload in the poison data.
arXiv Detail & Related papers (2023-01-06T00:37:25Z) - Understanding the Vulnerability of Skeleton-based Human Activity Recognition via Black-box Attack [53.032801921915436]
Human Activity Recognition (HAR) has been employed in a wide range of applications, e.g. self-driving cars.
Recently, the robustness of skeleton-based HAR methods have been questioned due to their vulnerability to adversarial attacks.
We show such threats exist, even when the attacker only has access to the input/output of the model.
We propose the very first black-box adversarial attack approach in skeleton-based HAR called BASAR.
arXiv Detail & Related papers (2022-11-21T09:51:28Z) - Backdoor Attacks on Pre-trained Models by Layerwise Weight Poisoning [27.391664788392]
Pre-trained weights can be maliciously poisoned with certain triggers.
Fine-tuned model will predict pre-defined labels, causing a security threat.
arXiv Detail & Related papers (2021-08-31T14:47:37Z) - Witches' Brew: Industrial Scale Data Poisoning via Gradient Matching [56.280018325419896]
Data Poisoning attacks modify training data to maliciously control a model trained on such data.
We analyze a particularly malicious poisoning attack that is both "from scratch" and "clean label"
We show that it is the first poisoning method to cause targeted misclassification in modern deep networks trained from scratch on a full-sized, poisoned ImageNet dataset.
arXiv Detail & Related papers (2020-09-04T16:17:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.