Bypassing DARCY Defense: Indistinguishable Universal Adversarial Triggers
- URL: http://arxiv.org/abs/2409.03183v1
- Date: Thu, 5 Sep 2024 02:19:34 GMT
- Title: Bypassing DARCY Defense: Indistinguishable Universal Adversarial Triggers
- Authors: Zuquan Peng, Yuanyuan He, Jianbing Ni, Ben Niu,
- Abstract summary: We show how a new UAT generation method, called IndisUAT, can be used to craft adversarial examples.
The produced adversarial examples incur the maximal loss of predicting results in the DARCY-protected models.
IndesUAT can reduce the true positive rate of DARCY's detection by at least 40.8% and 90.6%, and drop the accuracy by at least 33.3% and 51.6% in the RNN and CNN models, respectively.
- Score: 11.64617586381446
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural networks (NN) classification models for Natural Language Processing (NLP) are vulnerable to the Universal Adversarial Triggers (UAT) attack that triggers a model to produce a specific prediction for any input. DARCY borrows the "honeypot" concept to bait multiple trapdoors, effectively detecting the adversarial examples generated by UAT. Unfortunately, we find a new UAT generation method, called IndisUAT, which produces triggers (i.e., tokens) and uses them to craft adversarial examples whose feature distribution is indistinguishable from that of the benign examples in a randomly-chosen category at the detection layer of DARCY. The produced adversarial examples incur the maximal loss of predicting results in the DARCY-protected models. Meanwhile, the produced triggers are effective in black-box models for text generation, text inference, and reading comprehension. Finally, the evaluation results under NN models for NLP tasks indicate that the IndisUAT method can effectively circumvent DARCY and penetrate other defenses. For example, IndisUAT can reduce the true positive rate of DARCY's detection by at least 40.8% and 90.6%, and drop the accuracy by at least 33.3% and 51.6% in the RNN and CNN models, respectively. IndisUAT reduces the accuracy of the BERT's adversarial defense model by at least 34.0%, and makes the GPT-2 language model spew racist outputs even when conditioned on non-racial context.
Related papers
- DMGNN: Detecting and Mitigating Backdoor Attacks in Graph Neural Networks [30.766013737094532]
We propose DMGNN against out-of-distribution (OOD) and in-distribution (ID) graph backdoor attacks.
DMGNN can easily identify the hidden ID and OOD triggers via predicting label transitions based on counterfactual explanation.
DMGNN far outperforms the state-of-the-art (SOTA) defense methods, reducing the attack success rate to 5% with almost negligible degradation in model performance.
arXiv Detail & Related papers (2024-10-18T01:08:03Z) - T2IShield: Defending Against Backdoors on Text-to-Image Diffusion Models [70.03122709795122]
We propose a comprehensive defense method named T2IShield to detect, localize, and mitigate backdoor attacks.
We find the "Assimilation Phenomenon" on the cross-attention maps caused by the backdoor trigger.
For backdoor sample detection, T2IShield achieves a detection F1 score of 88.9$%$ with low computational cost.
arXiv Detail & Related papers (2024-07-05T01:53:21Z) - Is It Possible to Backdoor Face Forgery Detection with Natural Triggers? [20.54640502001717]
We propose a novel analysis-by-synthesis backdoor attack against face forgery detection models.
Our method achieves a high attack success rate (over 99%) and incurs a small model accuracy drop (below 0.2%) with a low poisoning rate (less than 3%)
arXiv Detail & Related papers (2023-12-31T07:16:10Z) - Disparity, Inequality, and Accuracy Tradeoffs in Graph Neural Networks
for Node Classification [2.8282906214258796]
Graph neural networks (GNNs) are increasingly used in critical human applications for predicting node labels in attributed graphs.
We propose two new GNN-agnostic interventions namely, PFR-AX which decreases the separability between nodes in protected and non-protected groups, and PostProcess which updates model predictions based on a blackbox policy.
Our results show that no single intervention offers a universally optimal tradeoff, but PFR-AX and PostProcess provide granular control and improve model confidence when correctly predicting positive outcomes for nodes in protected groups.
arXiv Detail & Related papers (2023-08-18T14:45:28Z) - ParaFuzz: An Interpretability-Driven Technique for Detecting Poisoned
Samples in NLP [29.375957205348115]
We propose an innovative test-time poisoned sample detection framework that hinges on the interpretability of model predictions.
We employ ChatGPT, a state-of-the-art large language model, as our paraphraser and formulate the trigger-removal task as a prompt engineering problem.
arXiv Detail & Related papers (2023-08-04T03:48:28Z) - Dynamic Transformers Provide a False Sense of Efficiency [75.39702559746533]
Multi-exit models make a trade-off between efficiency and accuracy, where the saving of computation comes from an early exit.
We propose a simple yet effective attacking framework, SAME, which is specially tailored to reduce the efficiency of the multi-exit models.
Experiments on the GLUE benchmark show that SAME can effectively diminish the efficiency gain of various multi-exit models by 80% on average.
arXiv Detail & Related papers (2023-05-20T16:41:48Z) - Enhanced countering adversarial attacks via input denoising and feature
restoring [15.787838084050957]
Deep neural networks (DNNs) are vulnerable to adversarial examples/samples (AEs) with imperceptible perturbations in clean/original samples.
This paper presents an enhanced countering adversarial attack method IDFR (via Input Denoising and Feature Restoring)
The proposed IDFR is made up of an enhanced input denoiser (ID) and a hidden lossy feature restorer (FR) based on the convex hull optimization.
arXiv Detail & Related papers (2021-11-19T07:34:09Z) - MINIMAL: Mining Models for Data Free Universal Adversarial Triggers [57.14359126600029]
We present a novel data-free approach, MINIMAL, to mine input-agnostic adversarial triggers from NLP models.
We reduce the accuracy of Stanford Sentiment Treebank's positive class from 93.6% to 9.6%.
For the Stanford Natural Language Inference (SNLI), our single-word trigger reduces the accuracy of the entailment class from 90.95% to less than 0.6%.
arXiv Detail & Related papers (2021-09-25T17:24:48Z) - Concealed Data Poisoning Attacks on NLP Models [56.794857982509455]
Adversarial attacks alter NLP model predictions by perturbing test-time inputs.
We develop a new data poisoning attack that allows an adversary to control model predictions whenever a desired trigger phrase is present in the input.
arXiv Detail & Related papers (2020-10-23T17:47:06Z) - Cooling-Shrinking Attack: Blinding the Tracker with Imperceptible Noises [87.53808756910452]
A cooling-shrinking attack method is proposed to deceive state-of-the-art SiameseRPN-based trackers.
Our method has good transferability and is able to deceive other top-performance trackers such as DaSiamRPN, DaSiamRPN-UpdateNet, and DiMP.
arXiv Detail & Related papers (2020-03-21T07:13:40Z) - Adversarial Distributional Training for Robust Deep Learning [53.300984501078126]
Adversarial training (AT) is among the most effective techniques to improve model robustness by augmenting training data with adversarial examples.
Most existing AT methods adopt a specific attack to craft adversarial examples, leading to the unreliable robustness against other unseen attacks.
In this paper, we introduce adversarial distributional training (ADT), a novel framework for learning robust models.
arXiv Detail & Related papers (2020-02-14T12:36:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.