Self-Supervised Adversarial Example Detection by Disentangled
Representation
- URL: http://arxiv.org/abs/2105.03689v2
- Date: Wed, 12 May 2021 12:37:42 GMT
- Title: Self-Supervised Adversarial Example Detection by Disentangled
Representation
- Authors: Zhaoxi Zhang, Leo Yu Zhang, Xufei Zheng, Shengshan Hu, Jinyu Tian,
Jiantao Zhou
- Abstract summary: We train an autoencoder, assisted by a discriminator network, over both correctly paired class/semantic features and incorrectly paired class/semantic features to reconstruct benign and counterexamples.
This mimics the behavior of adversarial examples and can reduce the unnecessary generalization ability of autoencoder.
Compared with the state-of-the-art self-supervised detection methods, our method exhibits better performance in various measurements.
- Score: 16.98476232162835
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Deep learning models are known to be vulnerable to adversarial examples that
are elaborately designed for malicious purposes and are imperceptible to the
human perceptual system. Autoencoder, when trained solely over benign examples,
has been widely used for (self-supervised) adversarial detection based on the
assumption that adversarial examples yield larger reconstruction error.
However, because lacking adversarial examples in its training and the too
strong generalization ability of autoencoder, this assumption does not always
hold true in practice. To alleviate this problem, we explore to detect
adversarial examples by disentangled representations of images under the
autoencoder structure. By disentangling input images as class features and
semantic features, we train an autoencoder, assisted by a discriminator
network, over both correctly paired class/semantic features and incorrectly
paired class/semantic features to reconstruct benign and counterexamples. This
mimics the behavior of adversarial examples and can reduce the unnecessary
generalization ability of autoencoder. Compared with the state-of-the-art
self-supervised detection methods, our method exhibits better performance in
various measurements (i.e., AUC, FPR, TPR) over different datasets (MNIST,
Fashion-MNIST and CIFAR-10), different adversarial attack methods (FGSM, BIM,
PGD, DeepFool, and CW) and different victim models (8-layer CNN and 16-layer
VGG). We compare our method with the state-of-the-art self-supervised detection
methods under different adversarial attacks and different victim models (30
attack settings), and it exhibits better performance in various measurements
(AUC, FPR, TPR) for most attacks settings. Ideally, AUC is $1$ and our method
achieves $0.99+$ on CIFAR-10 for all attacks. Notably, different from other
Autoencoder-based detectors, our method can provide resistance to the adaptive
adversary.
Related papers
- ZeroPur: Succinct Training-Free Adversarial Purification [52.963392510839284]
Adversarial purification is a kind of defense computation technique that can defend various unseen adversarial attacks.
We present a simple adversarial purification method without further training to purify adversarial images, called ZeroPur.
arXiv Detail & Related papers (2024-06-05T10:58:15Z) - Nowhere to Hide: A Lightweight Unsupervised Detector against Adversarial
Examples [14.332434280103667]
Adversarial examples are generated by adding slight but maliciously crafted perturbations to benign images.
In this paper, we propose an AutoEncoder-based Adversarial Examples detector.
We show empirically that the AEAE is unsupervised and inexpensive against the most state-of-the-art attacks.
arXiv Detail & Related papers (2022-10-16T16:29:47Z) - On Trace of PGD-Like Adversarial Attacks [77.75152218980605]
Adversarial attacks pose safety and security concerns for deep learning applications.
We construct Adrial Response Characteristics (ARC) features to reflect the model's gradient consistency.
Our method is intuitive, light-weighted, non-intrusive, and data-undemanding.
arXiv Detail & Related papers (2022-05-19T14:26:50Z) - Towards A Conceptually Simple Defensive Approach for Few-shot
classifiers Against Adversarial Support Samples [107.38834819682315]
We study a conceptually simple approach to defend few-shot classifiers against adversarial attacks.
We propose a simple attack-agnostic detection method, using the concept of self-similarity and filtering.
Our evaluation on the miniImagenet (MI) and CUB datasets exhibit good attack detection performance.
arXiv Detail & Related papers (2021-10-24T05:46:03Z) - Adaptive Feature Alignment for Adversarial Training [56.17654691470554]
CNNs are typically vulnerable to adversarial attacks, which pose a threat to security-sensitive applications.
We propose the adaptive feature alignment (AFA) to generate features of arbitrary attacking strengths.
Our method is trained to automatically align features of arbitrary attacking strength.
arXiv Detail & Related papers (2021-05-31T17:01:05Z) - ExAD: An Ensemble Approach for Explanation-based Adversarial Detection [17.455233006559734]
We propose ExAD, a framework to detect adversarial examples using an ensemble of explanation techniques.
We evaluate our approach using six state-of-the-art adversarial attacks on three image datasets.
arXiv Detail & Related papers (2021-03-22T00:53:07Z) - DAFAR: Detecting Adversaries by Feedback-Autoencoder Reconstruction [7.867922462470315]
DAFAR allows deep learning models to detect adversarial examples in high accuracy and universality.
It transforms imperceptible-perturbation attack on the target network directly into obvious reconstruction-error attack on the feedback autoencoder.
Experiments show that DAFAR is effective against popular and arguably most advanced attacks without losing performance on legitimate samples.
arXiv Detail & Related papers (2021-03-11T06:18:50Z) - A Hamiltonian Monte Carlo Method for Probabilistic Adversarial Attack
and Learning [122.49765136434353]
We present an effective method, called Hamiltonian Monte Carlo with Accumulated Momentum (HMCAM), aiming to generate a sequence of adversarial examples.
We also propose a new generative method called Contrastive Adversarial Training (CAT), which approaches equilibrium distribution of adversarial examples.
Both quantitative and qualitative analysis on several natural image datasets and practical systems have confirmed the superiority of the proposed algorithm.
arXiv Detail & Related papers (2020-10-15T16:07:26Z) - A Self-supervised Approach for Adversarial Robustness [105.88250594033053]
Adversarial examples can cause catastrophic mistakes in Deep Neural Network (DNNs) based vision systems.
This paper proposes a self-supervised adversarial training mechanism in the input space.
It provides significant robustness against the textbfunseen adversarial attacks.
arXiv Detail & Related papers (2020-06-08T20:42:39Z) - Adversarial Detection and Correction by Matching Prediction
Distributions [0.0]
The detector almost completely neutralises powerful attacks like Carlini-Wagner or SLIDE on MNIST and Fashion-MNIST.
We show that our method is still able to detect the adversarial examples in the case of a white-box attack where the attacker has full knowledge of both the model and the defence.
arXiv Detail & Related papers (2020-02-21T15:45:42Z) - Defending Adversarial Attacks via Semantic Feature Manipulation [23.48763375455514]
We propose a one-off and attack-agnostic Feature Manipulation (FM)-Defense to detect and purify adversarial examples.
To enable manipulation of features, a combo-variational autoencoder is applied to learn disentangled latent codes that reveal semantic features.
Experiments show FM-Defense can detect nearly $100%$ of adversarial examples produced by different state-of-the-art adversarial attacks.
arXiv Detail & Related papers (2020-02-03T23:24:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.