Better Representations via Adversarial Training in Pre-Training: A
Theoretical Perspective
- URL: http://arxiv.org/abs/2401.15248v1
- Date: Fri, 26 Jan 2024 23:52:20 GMT
- Title: Better Representations via Adversarial Training in Pre-Training: A
Theoretical Perspective
- Authors: Yue Xing, Xiaofeng Lin, Qifan Song, Yi Xu, Belinda Zeng, Guang Cheng
- Abstract summary: We show that feature purification plays an important role in connecting the adversarial robustness of the pre-trained model and the downstream tasks.
With purified nodes, it turns out that clean training is enough to achieve adversarial robustness in downstream tasks.
- Score: 30.871769067624836
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Pre-training is known to generate universal representations for downstream
tasks in large-scale deep learning such as large language models. Existing
literature, e.g., \cite{kim2020adversarial}, empirically observe that the
downstream tasks can inherit the adversarial robustness of the pre-trained
model. We provide theoretical justifications for this robustness inheritance
phenomenon. Our theoretical results reveal that feature purification plays an
important role in connecting the adversarial robustness of the pre-trained
model and the downstream tasks in two-layer neural networks. Specifically, we
show that (i) with adversarial training, each hidden node tends to pick only
one (or a few) feature; (ii) without adversarial training, the hidden nodes can
be vulnerable to attacks. This observation is valid for both supervised
pre-training and contrastive learning. With purified nodes, it turns out that
clean training is enough to achieve adversarial robustness in downstream tasks.
Related papers
- Adversarial Training Can Provably Improve Robustness: Theoretical Analysis of Feature Learning Process Under Structured Data [38.44734564565478]
We provide a theoretical understanding of adversarial examples and adversarial training algorithms from the perspective of feature learning theory.
We show that the adversarial training method can provably strengthen the robust feature learning and suppress the non-robust feature learning.
arXiv Detail & Related papers (2024-10-11T03:59:49Z) - Robust Transferable Feature Extractors: Learning to Defend Pre-Trained
Networks Against White Box Adversaries [69.53730499849023]
We show that adversarial examples can be successfully transferred to another independently trained model to induce prediction errors.
We propose a deep learning-based pre-processing mechanism, which we refer to as a robust transferable feature extractor (RTFE)
arXiv Detail & Related papers (2022-09-14T21:09:34Z) - Enhancing Adversarial Training with Feature Separability [52.39305978984573]
We introduce a new concept of adversarial training graph (ATG) with which the proposed adversarial training with feature separability (ATFS) enables to boost the intra-class feature similarity and increase inter-class feature variance.
Through comprehensive experiments, we demonstrate that the proposed ATFS framework significantly improves both clean and robust performance.
arXiv Detail & Related papers (2022-05-02T04:04:23Z) - Finding Dynamics Preserving Adversarial Winning Tickets [11.05616199881368]
Pruning methods have been considered in adversarial context to reduce model capacity and improve adversarial robustness simultaneously in training.
Existing adversarial pruning methods generally mimic the classical pruning methods for natural training, which follow the three-stage 'training-pruning-fine-tuning' pipelines.
We show empirical evidences that AWT preserves the dynamics of adversarial training and achieve equal performance as dense adversarial training.
arXiv Detail & Related papers (2022-02-14T05:34:24Z) - How does unlabeled data improve generalization in self-training? A
one-hidden-layer theoretical analysis [93.37576644429578]
This work establishes the first theoretical analysis for the known iterative self-training paradigm.
We prove the benefits of unlabeled data in both training convergence and generalization ability.
Experiments from shallow neural networks to deep neural networks are also provided to justify the correctness of our established theoretical insights on self-training.
arXiv Detail & Related papers (2022-01-21T02:16:52Z) - Understanding the Logit Distributions of Adversarially-Trained Deep
Neural Networks [6.439477789066243]
Adversarial defenses train deep neural networks to be invariant to the input perturbations from adversarial attacks.
Although adversarial training is successful at mitigating adversarial attacks, the behavioral differences between adversarially-trained (AT) models and standard models are still poorly understood.
We identify three logit characteristics essential to learning adversarial robustness.
arXiv Detail & Related papers (2021-08-26T19:09:15Z) - Robust Pre-Training by Adversarial Contrastive Learning [120.33706897927391]
Recent work has shown that, when integrated with adversarial training, self-supervised pre-training can lead to state-of-the-art robustness.
We improve robustness-aware self-supervised pre-training by learning representations consistent under both data augmentations and adversarial perturbations.
arXiv Detail & Related papers (2020-10-26T04:44:43Z) - Feature Purification: How Adversarial Training Performs Robust Deep
Learning [66.05472746340142]
We show a principle that we call Feature Purification, where we show one of the causes of the existence of adversarial examples is the accumulation of certain small dense mixtures in the hidden weights during the training process of a neural network.
We present both experiments on the CIFAR-10 dataset to illustrate this principle, and a theoretical result proving that for certain natural classification tasks, training a two-layer neural network with ReLU activation using randomly gradient descent indeed this principle.
arXiv Detail & Related papers (2020-05-20T16:56:08Z) - Towards Achieving Adversarial Robustness by Enforcing Feature
Consistency Across Bit Planes [51.31334977346847]
We train networks to form coarse impressions based on the information in higher bit planes, and use the lower bit planes only to refine their prediction.
We demonstrate that, by imposing consistency on the representations learned across differently quantized images, the adversarial robustness of networks improves significantly.
arXiv Detail & Related papers (2020-04-01T09:31:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.