Poisoned classifiers are not only backdoored, they are fundamentally
broken
- URL: http://arxiv.org/abs/2010.09080v2
- Date: Tue, 5 Oct 2021 09:45:30 GMT
- Title: Poisoned classifiers are not only backdoored, they are fundamentally
broken
- Authors: Mingjie Sun, Siddhant Agarwal, J. Zico Kolter
- Abstract summary: Under a commonly-studied backdoor poisoning attack against classification models, an attacker adds a small trigger to a subset of the training data.
It is often assumed that the poisoned classifier is vulnerable exclusively to the adversary who possesses the trigger.
In this paper, we show empirically that this view of backdoored classifiers is incorrect.
- Score: 84.67778403778442
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Under a commonly-studied backdoor poisoning attack against classification
models, an attacker adds a small trigger to a subset of the training data, such
that the presence of this trigger at test time causes the classifier to always
predict some target class. It is often implicitly assumed that the poisoned
classifier is vulnerable exclusively to the adversary who possesses the
trigger. In this paper, we show empirically that this view of backdoored
classifiers is incorrect. We describe a new threat model for poisoned
classifier, where one without knowledge of the original trigger, would want to
control the poisoned classifier. Under this threat model, we propose a
test-time, human-in-the-loop attack method to generate multiple effective
alternative triggers without access to the initial backdoor and the training
data. We construct these alternative triggers by first generating adversarial
examples for a smoothed version of the classifier, created with a procedure
called Denoised Smoothing, and then extracting colors or cropped portions of
smoothed adversarial images with human interaction. We demonstrate the
effectiveness of our attack through extensive experiments on high-resolution
datasets: ImageNet and TrojAI. We also compare our approach to previous work on
modeling trigger distributions and find that our method are more scalable and
efficient in generating effective triggers. Last, we include a user study which
demonstrates that our method allows users to easily determine the existence of
such backdoors in existing poisoned classifiers. Thus, we argue that there is
no such thing as a secret backdoor in poisoned classifiers: poisoning a
classifier invites attacks not just by the party that possesses the trigger,
but from anyone with access to the classifier.
Related papers
- SEEP: Training Dynamics Grounds Latent Representation Search for Mitigating Backdoor Poisoning Attacks [53.28390057407576]
Modern NLP models are often trained on public datasets drawn from diverse sources.
Data poisoning attacks can manipulate the model's behavior in ways engineered by the attacker.
Several strategies have been proposed to mitigate the risks associated with backdoor attacks.
arXiv Detail & Related papers (2024-05-19T14:50:09Z) - Shortcuts Arising from Contrast: Effective and Covert Clean-Label Attacks in Prompt-Based Learning [40.130762098868736]
We propose a method named Contrastive Shortcut Injection (CSI), by leveraging activation values, integrates trigger design and data selection strategies to craft stronger shortcut features.
With extensive experiments on full-shot and few-shot text classification tasks, we empirically validate CSI's high effectiveness and high stealthiness at low poisoning rates.
arXiv Detail & Related papers (2024-03-30T20:02:36Z) - Poisoned Forgery Face: Towards Backdoor Attacks on Face Forgery
Detection [62.595450266262645]
This paper introduces a novel and previously unrecognized threat in face forgery detection scenarios caused by backdoor attack.
By embedding backdoors into models, attackers can deceive detectors into producing erroneous predictions for forged faces.
We propose emphPoisoned Forgery Face framework, which enables clean-label backdoor attacks on face forgery detectors.
arXiv Detail & Related papers (2024-02-18T06:31:05Z) - Improved Activation Clipping for Universal Backdoor Mitigation and
Test-Time Detection [27.62279831135902]
Deep neural networks are vulnerable toTrojan attacks, where an attacker poisons the training set with backdoor triggers.
Recent work shows that backdoor poisoning induces over-fitting (abnormally large activations) in the attacked model.
We devise a new such approach, choosing the activation bounds to explicitly limit classification margins.
arXiv Detail & Related papers (2023-08-08T22:47:39Z) - Backdoor Attack with Sparse and Invisible Trigger [57.41876708712008]
Deep neural networks (DNNs) are vulnerable to backdoor attacks.
backdoor attack is an emerging yet threatening training-phase threat.
We propose a sparse and invisible backdoor attack (SIBA)
arXiv Detail & Related papers (2023-05-11T10:05:57Z) - WeDef: Weakly Supervised Backdoor Defense for Text Classification [48.19967241668793]
Existing backdoor defense methods are only effective for limited trigger types.
We propose a novel weakly supervised backdoor defense framework WeDef.
We show that WeDef is effective against popular trigger-based attacks.
arXiv Detail & Related papers (2022-05-24T05:53:11Z) - BFClass: A Backdoor-free Text Classification Framework [21.762274809679692]
We propose BFClass, a novel efficient backdoor-free training framework for text classification.
The backbone of BFClass is a pre-trained discriminator that predicts whether each token in the corrupted input was replaced by a masked language model.
Extensive experiments demonstrate that BFClass can identify all the triggers, remove 95% poisoned training samples with very limited false alarms, and achieve almost the same performance as the models trained on the benign training data.
arXiv Detail & Related papers (2021-09-22T17:28:21Z) - Hidden Backdoor Attack against Semantic Segmentation Models [60.0327238844584]
The emphbackdoor attack intends to embed hidden backdoors in deep neural networks (DNNs) by poisoning training data.
We propose a novel attack paradigm, the emphfine-grained attack, where we treat the target label from the object-level instead of the image-level.
Experiments show that the proposed methods can successfully attack semantic segmentation models by poisoning only a small proportion of training data.
arXiv Detail & Related papers (2021-03-06T05:50:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.