ADDMU: Detection of Far-Boundary Adversarial Examples with Data and
Model Uncertainty Estimation
- URL: http://arxiv.org/abs/2210.12396v1
- Date: Sat, 22 Oct 2022 09:11:12 GMT
- Title: ADDMU: Detection of Far-Boundary Adversarial Examples with Data and
Model Uncertainty Estimation
- Authors: Fan Yin, Yao Li, Cho-Jui Hsieh, Kai-Wei Chang
- Abstract summary: Adversarial Examples Detection (AED) is a crucial defense technique against adversarial attacks.
We propose a new technique, textbfADDMU, which combines two types of uncertainty estimation for both regular and FB adversarial example detection.
Our new method outperforms previous methods by 3.6 and 6.0 emphAUC points under each scenario.
- Score: 125.52743832477404
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Adversarial Examples Detection (AED) is a crucial defense technique against
adversarial attacks and has drawn increasing attention from the Natural
Language Processing (NLP) community. Despite the surge of new AED methods, our
studies show that existing methods heavily rely on a shortcut to achieve good
performance. In other words, current search-based adversarial attacks in NLP
stop once model predictions change, and thus most adversarial examples
generated by those attacks are located near model decision boundaries. To
surpass this shortcut and fairly evaluate AED methods, we propose to test AED
methods with \textbf{F}ar \textbf{B}oundary (\textbf{FB}) adversarial examples.
Existing methods show worse than random guess performance under this scenario.
To overcome this limitation, we propose a new technique, \textbf{ADDMU},
\textbf{a}dversary \textbf{d}etection with \textbf{d}ata and \textbf{m}odel
\textbf{u}ncertainty, which combines two types of uncertainty estimation for
both regular and FB adversarial example detection. Our new method outperforms
previous methods by 3.6 and 6.0 \emph{AUC} points under each scenario. Finally,
our analysis shows that the two types of uncertainty provided by \textbf{ADDMU}
can be leveraged to characterize adversarial examples and identify the ones
that contribute most to model's robustness in adversarial training.
Related papers
- AdvQDet: Detecting Query-Based Adversarial Attacks with Adversarial Contrastive Prompt Tuning [93.77763753231338]
Adversarial Contrastive Prompt Tuning (ACPT) is proposed to fine-tune the CLIP image encoder to extract similar embeddings for any two intermediate adversarial queries.
We show that ACPT can detect 7 state-of-the-art query-based attacks with $>99%$ detection rate within 5 shots.
We also show that ACPT is robust to 3 types of adaptive attacks.
arXiv Detail & Related papers (2024-08-04T09:53:50Z) - Proximal Causal Inference With Text Data [5.796482272333648]
We propose a new causal inference method that uses two instances of pre-treatment text data, infers two proxies using two zero-shot models on the separate instances, and applies these proxies in the proximal g-formula.
We evaluate our method in synthetic and semi-synthetic settings with real-world clinical notes from MIMIC-III and open large language models for zero-shot prediction.
arXiv Detail & Related papers (2024-01-12T16:51:02Z) - Token-Level Adversarial Prompt Detection Based on Perplexity Measures
and Contextual Information [67.78183175605761]
Large Language Models are susceptible to adversarial prompt attacks.
This vulnerability underscores a significant concern regarding the robustness and reliability of LLMs.
We introduce a novel approach to detecting adversarial prompts at a token level.
arXiv Detail & Related papers (2023-11-20T03:17:21Z) - MEAD: A Multi-Armed Approach for Evaluation of Adversarial Examples
Detectors [24.296350262025552]
We propose a novel framework, called MEAD, for evaluating detectors based on several attack strategies.
Among them, we make use of three new objectives to generate attacks.
The proposed performance metric is based on the worst-case scenario.
arXiv Detail & Related papers (2022-06-30T17:05:45Z) - Latent Boundary-guided Adversarial Training [61.43040235982727]
Adrial training is proved to be the most effective strategy that injects adversarial examples into model training.
We propose a novel adversarial training framework called LAtent bounDary-guided aDvErsarial tRaining.
arXiv Detail & Related papers (2022-06-08T07:40:55Z) - A Unified Wasserstein Distributional Robustness Framework for
Adversarial Training [24.411703133156394]
This paper presents a unified framework that connects Wasserstein distributional robustness with current state-of-the-art AT methods.
We introduce a new Wasserstein cost function and a new series of risk functions, with which we show that standard AT methods are special cases of their counterparts in our framework.
This connection leads to an intuitive relaxation and generalization of existing AT methods and facilitates the development of a new family of distributional robustness AT-based algorithms.
arXiv Detail & Related papers (2022-02-27T19:40:29Z) - ADC: Adversarial attacks against object Detection that evade Context
consistency checks [55.8459119462263]
We show that even context consistency checks can be brittle to properly crafted adversarial examples.
We propose an adaptive framework to generate examples that subvert such defenses.
Our results suggest that how to robustly model context and check its consistency, is still an open problem.
arXiv Detail & Related papers (2021-10-24T00:25:09Z) - TREATED:Towards Universal Defense against Textual Adversarial Attacks [28.454310179377302]
We propose TREATED, a universal adversarial detection method that can defend against attacks of various perturbation levels without making any assumptions.
Extensive experiments on three competitive neural networks and two widely used datasets show that our method achieves better detection performance than baselines.
arXiv Detail & Related papers (2021-09-13T03:31:20Z) - Improving White-box Robustness of Pre-processing Defenses via Joint Adversarial Training [106.34722726264522]
A range of adversarial defense techniques have been proposed to mitigate the interference of adversarial noise.
Pre-processing methods may suffer from the robustness degradation effect.
A potential cause of this negative effect is that adversarial training examples are static and independent to the pre-processing model.
We propose a method called Joint Adversarial Training based Pre-processing (JATP) defense.
arXiv Detail & Related papers (2021-06-10T01:45:32Z) - Towards Variable-Length Textual Adversarial Attacks [68.27995111870712]
It is non-trivial to conduct textual adversarial attacks on natural language processing tasks due to the discreteness of data.
In this paper, we propose variable-length textual adversarial attacks(VL-Attack)
Our method can achieve $33.18$ BLEU score on IWSLT14 German-English translation, achieving an improvement of $1.47$ over the baseline model.
arXiv Detail & Related papers (2021-04-16T14:37:27Z) - Adversarial Examples Detection beyond Image Space [88.7651422751216]
We find that there exists compliance between perturbations and prediction confidence, which guides us to detect few-perturbation attacks from the aspect of prediction confidence.
We propose a method beyond image space by a two-stream architecture, in which the image stream focuses on the pixel artifacts and the gradient stream copes with the confidence artifacts.
arXiv Detail & Related papers (2021-02-23T09:55:03Z) - Random Projections for Adversarial Attack Detection [8.684378639046644]
adversarial attack detection remains a fundamentally challenging problem from two perspectives.
We present a technique that makes use of special properties of random projections, whereby we can characterize the behavior of clean and adversarial examples.
Performance evaluation demonstrates that our technique outperforms ($>0.92$ AUC) competing state of the art (SOTA) attack strategies.
arXiv Detail & Related papers (2020-12-11T15:02:28Z) - Rethinking Empirical Evaluation of Adversarial Robustness Using
First-Order Attack Methods [6.531546527140473]
We identify three common cases that lead to overestimation of adversarial accuracy against bounded first-order attack methods.
We propose compensation methods that address sources of inaccurate gradient computation.
Overall, our work shows that overestimated adversarial accuracy that is not indicative of robustness is prevalent even for conventionally trained deep neural networks.
arXiv Detail & Related papers (2020-06-01T22:55:09Z) - BERT-ATTACK: Adversarial Attack Against BERT Using BERT [77.82947768158132]
Adrial attacks for discrete data (such as texts) are more challenging than continuous data (such as images)
We propose textbfBERT-Attack, a high-quality and effective method to generate adversarial samples.
Our method outperforms state-of-the-art attack strategies in both success rate and perturb percentage.
arXiv Detail & Related papers (2020-04-21T13:30:02Z) - Adversarial Distributional Training for Robust Deep Learning [53.300984501078126]
Adversarial training (AT) is among the most effective techniques to improve model robustness by augmenting training data with adversarial examples.
Most existing AT methods adopt a specific attack to craft adversarial examples, leading to the unreliable robustness against other unseen attacks.
In this paper, we introduce adversarial distributional training (ADT), a novel framework for learning robust models.
arXiv Detail & Related papers (2020-02-14T12:36:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.