Beating Attackers At Their Own Games: Adversarial Example Detection
Using Adversarial Gradient Directions
- URL: http://arxiv.org/abs/2012.15386v1
- Date: Thu, 31 Dec 2020 01:12:24 GMT
- Title: Beating Attackers At Their Own Games: Adversarial Example Detection
Using Adversarial Gradient Directions
- Authors: Yuhang Wu, Sunpreet S. Arora, Yanhong Wu, Hao Yang
- Abstract summary: The proposed method is based on the observation that the directions of adversarial gradients play a key role in characterizing the adversarial space.
Experiments conducted on two different databases, CIFAR-10 and ImageNet, show that the proposed detection method achieves 97.9% and 98.6% AUC-ROC on five different adversarial attacks.
- Score: 16.993439721743478
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Adversarial examples are input examples that are specifically crafted to
deceive machine learning classifiers. State-of-the-art adversarial example
detection methods characterize an input example as adversarial either by
quantifying the magnitude of feature variations under multiple perturbations or
by measuring its distance from estimated benign example distribution. Instead
of using such metrics, the proposed method is based on the observation that the
directions of adversarial gradients when crafting (new) adversarial examples
play a key role in characterizing the adversarial space. Compared to detection
methods that use multiple perturbations, the proposed method is efficient as it
only applies a single random perturbation on the input example. Experiments
conducted on two different databases, CIFAR-10 and ImageNet, show that the
proposed detection method achieves, respectively, 97.9% and 98.6% AUC-ROC (on
average) on five different adversarial attacks, and outperforms multiple
state-of-the-art detection methods. Results demonstrate the effectiveness of
using adversarial gradient directions for adversarial example detection.
Related papers
- Indiscriminate Disruption of Conditional Inference on Multivariate Gaussians [60.22542847840578]
Despite advances in adversarial machine learning, inference for Gaussian models in the presence of an adversary is notably understudied.
We consider a self-interested attacker who wishes to disrupt a decisionmaker's conditional inference and subsequent actions by corrupting a set of evidentiary variables.
To avoid detection, the attacker also desires the attack to appear plausible wherein plausibility is determined by the density of the corrupted evidence.
arXiv Detail & Related papers (2024-11-21T17:46:55Z) - On the Universal Adversarial Perturbations for Efficient Data-free
Adversarial Detection [55.73320979733527]
We propose a data-agnostic adversarial detection framework, which induces different responses between normal and adversarial samples to UAPs.
Experimental results show that our method achieves competitive detection performance on various text classification tasks.
arXiv Detail & Related papers (2023-06-27T02:54:07Z) - Towards Black-box Adversarial Example Detection: A Data
Reconstruction-based Method [9.857570123016213]
Black-box attack is a more realistic threat and has led to various black-box adversarial training-based defense methods.
To tackle the BAD problem, we propose a data reconstruction-based adversarial example detection method.
arXiv Detail & Related papers (2023-06-03T06:34:17Z) - Adversarial Examples Detection with Enhanced Image Difference Features
based on Local Histogram Equalization [20.132066800052712]
We propose an adversarial example detection framework based on a high-frequency information enhancement strategy.
This framework can effectively extract and amplify the feature differences between adversarial examples and normal examples.
arXiv Detail & Related papers (2023-05-08T03:14:01Z) - AdvCheck: Characterizing Adversarial Examples via Local Gradient
Checking [3.425727850372357]
We introduce the concept of local gradient, and reveal that adversarial examples have a larger bound of local gradient than the benign ones.
Specifically, by calculating the local gradient from a few benign examples and noise-added misclassified examples to train a detector, adversarial examples and even misclassified natural inputs can be precisely distinguished from benign ones.
We have validated the AdvCheck's superior performance to the state-of-the-art (SOTA) baselines, with detection rate ($sim times 1.2$) on general adversarial attacks and ($sim times 1.4$) on misclassified natural inputs
arXiv Detail & Related papers (2023-03-25T17:46:09Z) - TREATED:Towards Universal Defense against Textual Adversarial Attacks [28.454310179377302]
We propose TREATED, a universal adversarial detection method that can defend against attacks of various perturbation levels without making any assumptions.
Extensive experiments on three competitive neural networks and two widely used datasets show that our method achieves better detection performance than baselines.
arXiv Detail & Related papers (2021-09-13T03:31:20Z) - Adversarial Examples Detection beyond Image Space [88.7651422751216]
We find that there exists compliance between perturbations and prediction confidence, which guides us to detect few-perturbation attacks from the aspect of prediction confidence.
We propose a method beyond image space by a two-stream architecture, in which the image stream focuses on the pixel artifacts and the gradient stream copes with the confidence artifacts.
arXiv Detail & Related papers (2021-02-23T09:55:03Z) - Learning to Separate Clusters of Adversarial Representations for Robust
Adversarial Detection [50.03939695025513]
We propose a new probabilistic adversarial detector motivated by a recently introduced non-robust feature.
In this paper, we consider the non-robust features as a common property of adversarial examples, and we deduce it is possible to find a cluster in representation space corresponding to the property.
This idea leads us to probability estimate distribution of adversarial representations in a separate cluster, and leverage the distribution for a likelihood based adversarial detector.
arXiv Detail & Related papers (2020-12-07T07:21:18Z) - A Hamiltonian Monte Carlo Method for Probabilistic Adversarial Attack
and Learning [122.49765136434353]
We present an effective method, called Hamiltonian Monte Carlo with Accumulated Momentum (HMCAM), aiming to generate a sequence of adversarial examples.
We also propose a new generative method called Contrastive Adversarial Training (CAT), which approaches equilibrium distribution of adversarial examples.
Both quantitative and qualitative analysis on several natural image datasets and practical systems have confirmed the superiority of the proposed algorithm.
arXiv Detail & Related papers (2020-10-15T16:07:26Z) - Effective and Robust Detection of Adversarial Examples via
Benford-Fourier Coefficients [40.9343499298864]
Adrial examples have been well known as a serious threat to deep neural networks (DNNs)
In this work, we study the detection of adversarial examples, based on the assumption that the output and internal responses of one model for both adversarial and benign examples follow the generalized Gaussian distribution (GGD)
We propose to construct discriminative features via the shape factor for adversarial detection, employing the magnitude of Benford-Fourier coefficients (MBF)
arXiv Detail & Related papers (2020-05-12T05:20:59Z) - Fundamental Tradeoffs between Invariance and Sensitivity to Adversarial
Perturbations [65.05561023880351]
Adversarial examples are malicious inputs crafted to induce misclassification.
This paper studies a complementary failure mode, invariance-based adversarial examples.
We show that defenses against sensitivity-based attacks actively harm a model's accuracy on invariance-based attacks.
arXiv Detail & Related papers (2020-02-11T18:50:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.