Enhance the Visual Representation via Discrete Adversarial Training
- URL: http://arxiv.org/abs/2209.07735v1
- Date: Fri, 16 Sep 2022 06:25:06 GMT
- Title: Enhance the Visual Representation via Discrete Adversarial Training
- Authors: Xiaofeng Mao, Yuefeng Chen, Ranjie Duan, Yao Zhu, Gege Qi, Shaokai Ye,
Xiaodan Li, Rong Zhang, Hui Xue
- Abstract summary: Adversarial Training (AT) is commonly accepted as one of the most effective approaches defending against adversarial examples.
We propose Discrete Adversarial Training ( DAT) to reform the image data to discrete text-like inputs, i.e. visual words.
As a plug-and-play technique for enhancing the visual representation, DAT achieves significant improvement on multiple tasks.
- Score: 24.3040211834614
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Adversarial Training (AT), which is commonly accepted as one of the most
effective approaches defending against adversarial examples, can largely harm
the standard performance, thus has limited usefulness on industrial-scale
production and applications. Surprisingly, this phenomenon is totally opposite
in Natural Language Processing (NLP) task, where AT can even benefit for
generalization. We notice the merit of AT in NLP tasks could derive from the
discrete and symbolic input space. For borrowing the advantage from NLP-style
AT, we propose Discrete Adversarial Training (DAT). DAT leverages VQGAN to
reform the image data to discrete text-like inputs, i.e. visual words. Then it
minimizes the maximal risk on such discrete images with symbolic adversarial
perturbations. We further give an explanation from the perspective of
distribution to demonstrate the effectiveness of DAT. As a plug-and-play
technique for enhancing the visual representation, DAT achieves significant
improvement on multiple tasks including image classification, object detection
and self-supervised learning. Especially, the model pre-trained with Masked
Auto-Encoding (MAE) and fine-tuned by our DAT without extra data can get 31.40
mCE on ImageNet-C and 32.77% top-1 accuracy on Stylized-ImageNet, building the
new state-of-the-art. The code will be available at
https://github.com/alibaba/easyrobust.
Related papers
- Enhancing Large Vision Language Models with Self-Training on Image Comprehension [131.14381425260706]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension.
First, the model self-constructs a preference for image descriptions using unlabeled images.
To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z) - AnomalyGPT: Detecting Industrial Anomalies Using Large Vision-Language
Models [30.723122000372538]
AnomalyGPT is a novel IAD approach based on Large Vision-Language Models (LVLM)
We generate training data by simulating anomalous images and producing corresponding textual descriptions for each image.
AnomalyGPT achieves the state-of-the-art performance with an accuracy of 86.1%, an image-level AUC of 94.1%, and a pixel-level AUC of 95.3% on the MVTec-AD dataset.
arXiv Detail & Related papers (2023-08-29T15:02:53Z) - MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z) - Improved Visual Fine-tuning with Natural Language Supervision [36.250244364023665]
Fine-tuning a visual pre-trained model can leverage the semantic information from large-scale pre-training data.
The problem of catastrophic forgetting in pre-trained backbone has been extensively studied for fine-tuning.
We introduce a reference distribution obtained from a fixed text classifier, which can help regularize the learned vision classifier.
arXiv Detail & Related papers (2023-04-04T03:08:02Z) - Unsupervised Representation Learning from Pre-trained Diffusion
Probabilistic Models [83.75414370493289]
Diffusion Probabilistic Models (DPMs) have shown a powerful capacity of generating high-quality image samples.
Diff-AE have been proposed to explore DPMs for representation learning via autoencoding.
We propose textbfPre-trained textbfAutotextbfEncoding (textbfPDAE) to adapt existing pre-trained DPMs to the decoders for image reconstruction.
arXiv Detail & Related papers (2022-12-26T02:37:38Z) - Noise Self-Regression: A New Learning Paradigm to Enhance Low-Light Images Without Task-Related Data [86.68013790656762]
We propose Noise SElf-Regression (NoiSER) without access to any task-related data.
NoiSER is highly competitive in enhancement quality, yet with a much smaller model size, and much lower training and inference cost.
arXiv Detail & Related papers (2022-11-09T06:18:18Z) - SdAE: Self-distillated Masked Autoencoder [95.3684955370897]
Self-distillated masked AutoEncoder network SdAE is proposed in this paper.
With only 300 epochs pre-training, a vanilla ViT-Base model achieves an 84.1% fine-tuning accuracy on ImageNet-1k classification.
arXiv Detail & Related papers (2022-07-31T15:07:25Z) - Masked Unsupervised Self-training for Zero-shot Image Classification [98.23094305347709]
Masked Unsupervised Self-Training (MUST) is a new approach which leverages two different and complimentary sources of supervision: pseudo-labels and raw images.
MUST improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification.
arXiv Detail & Related papers (2022-06-07T02:03:06Z) - AugNet: End-to-End Unsupervised Visual Representation Learning with
Image Augmentation [3.6790362352712873]
We propose AugNet, a new deep learning training paradigm to learn image features from a collection of unlabeled pictures.
Our experiments demonstrate that the method is able to represent the image in low dimensional space.
Unlike many deep-learning-based image retrieval algorithms, our approach does not require access to external annotated datasets.
arXiv Detail & Related papers (2021-06-11T09:02:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.