Related papers: Understanding the Logit Distributions of Adversarially-Trained Deep Neural Networks

Understanding the Logit Distributions of Adversarially-Trained Deep Neural Networks

URL: http://arxiv.org/abs/2108.12001v1
Date: Thu, 26 Aug 2021 19:09:15 GMT
Title: Understanding the Logit Distributions of Adversarially-Trained Deep Neural Networks
Authors: Landan Seguin, Anthony Ndirango, Neeli Mishra, SueYeon Chung, Tyler Lee
Abstract summary: Adversarial defenses train deep neural networks to be invariant to the input perturbations from adversarial attacks. Although adversarial training is successful at mitigating adversarial attacks, the behavioral differences between adversarially-trained (AT) models and standard models are still poorly understood. We identify three logit characteristics essential to learning adversarial robustness.
Score: 6.439477789066243
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Adversarial defenses train deep neural networks to be invariant to the input perturbations from adversarial attacks. Almost all defense strategies achieve this invariance through adversarial training i.e. training on inputs with adversarial perturbations. Although adversarial training is successful at mitigating adversarial attacks, the behavioral differences between adversarially-trained (AT) models and standard models are still poorly understood. Motivated by a recent study on learning robustness without input perturbations by distilling an AT model, we explore what is learned during adversarial training by analyzing the distribution of logits in AT models. We identify three logit characteristics essential to learning adversarial robustness. First, we provide a theoretical justification for the finding that adversarial training shrinks two important characteristics of the logit distribution: the max logit values and the "logit gaps" (difference between the logit max and next largest values) are on average lower for AT models. Second, we show that AT and standard models differ significantly on which samples are high or low confidence, then illustrate clear qualitative differences by visualizing samples with the largest confidence difference. Finally, we find learning information about incorrect classes to be essential to learning robustness by manipulating the non-max logit information during distillation and measuring the impact on the student's robustness. Our results indicate that learning some adversarial robustness without input perturbations requires a model to learn specific sample-wise confidences and incorrect class orderings that follow complex distributions.

Related papers

Adversarial Training Can Provably Improve Robustness: Theoretical Analysis of Feature Learning Process Under Structured Data [38.44734564565478]
We provide a theoretical understanding of adversarial examples and adversarial training algorithms from the perspective of feature learning theory. We show that the adversarial training method can provably strengthen the robust feature learning and suppress the non-robust feature learning.
arXiv Detail & Related papers (2024-10-11T03:59:49Z)
MOREL: Enhancing Adversarial Robustness through Multi-Objective Representation Learning [1.534667887016089]
deep neural networks (DNNs) are vulnerable to slight adversarial perturbations. We show that strong feature representation learning during training can significantly enhance the original model's robustness. We propose MOREL, a multi-objective feature representation learning approach, encouraging classification models to produce similar features for inputs within the same class, despite perturbations.
arXiv Detail & Related papers (2024-10-02T16:05:03Z)
Perturbation-Invariant Adversarial Training for Neural Ranking Models: Improving the Effectiveness-Robustness Trade-Off [107.35833747750446]
adversarial examples can be crafted by adding imperceptible perturbations to legitimate documents. This vulnerability raises significant concerns about their reliability and hinders the widespread deployment of NRMs. In this study, we establish theoretical guarantees regarding the effectiveness-robustness trade-off in NRMs.
arXiv Detail & Related papers (2023-12-16T05:38:39Z)
Bayesian Learning with Information Gain Provably Bounds Risk for a Robust Adversarial Defense [27.545466364906773]
We present a new algorithm to learn a deep neural network model robust against adversarial attacks. Our model demonstrate significantly improved robustness--up to 20%--compared with adversarial training and Adv-BNN under PGD attacks.
arXiv Detail & Related papers (2022-12-05T03:26:08Z)
Addressing Mistake Severity in Neural Networks with Semantic Knowledge [0.0]
Most robust training techniques aim to improve model accuracy on perturbed inputs. As an alternate form of robustness, we aim to reduce the severity of mistakes made by neural networks in challenging conditions. We leverage current adversarial training methods to generate targeted adversarial attacks during the training process. Results demonstrate that our approach performs better with respect to mistake severity compared to standard and adversarially trained models.
arXiv Detail & Related papers (2022-11-21T22:01:36Z)
Explicit Tradeoffs between Adversarial and Natural Distributional Robustness [48.44639585732391]
In practice, models need to enjoy both types of robustness to ensure reliability. In this work, we show that in fact, explicit tradeoffs exist between adversarial and natural distributional robustness.
arXiv Detail & Related papers (2022-09-15T19:58:01Z)
Robust Transferable Feature Extractors: Learning to Defend Pre-Trained Networks Against White Box Adversaries [69.53730499849023]
We show that adversarial examples can be successfully transferred to another independently trained model to induce prediction errors. We propose a deep learning-based pre-processing mechanism, which we refer to as a robust transferable feature extractor (RTFE)
arXiv Detail & Related papers (2022-09-14T21:09:34Z)
Robust Pre-Training by Adversarial Contrastive Learning [120.33706897927391]
Recent work has shown that, when integrated with adversarial training, self-supervised pre-training can lead to state-of-the-art robustness. We improve robustness-aware self-supervised pre-training by learning representations consistent under both data augmentations and adversarial perturbations.
arXiv Detail & Related papers (2020-10-26T04:44:43Z)
On the Generalization Properties of Adversarial Training [21.79888306754263]
This paper studies the generalization performance of a generic adversarial training algorithm. A series of numerical studies are conducted to demonstrate how the smoothness and L1 penalization help improve the adversarial robustness of models.
arXiv Detail & Related papers (2020-08-15T02:32:09Z)
Adversarial Self-Supervised Contrastive Learning [62.17538130778111]
Existing adversarial learning approaches mostly use class labels to generate adversarial samples that lead to incorrect predictions. We propose a novel adversarial attack for unlabeled data, which makes the model confuse the instance-level identities of the perturbed data samples. We present a self-supervised contrastive learning framework to adversarially train a robust neural network without labeled data.
arXiv Detail & Related papers (2020-06-13T08:24:33Z)
Adversarial Distributional Training for Robust Deep Learning [53.300984501078126]
Adversarial training (AT) is among the most effective techniques to improve model robustness by augmenting training data with adversarial examples. Most existing AT methods adopt a specific attack to craft adversarial examples, leading to the unreliable robustness against other unseen attacks. In this paper, we introduce adversarial distributional training (ADT), a novel framework for learning robust models.
arXiv Detail & Related papers (2020-02-14T12:36:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.