Related papers: Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation

Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation

URL: http://arxiv.org/abs/2110.07858v1
Date: Fri, 15 Oct 2021 04:53:18 GMT
Title: Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation
Authors: Yao Qin, Chiyuan Zhang, Ting Chen, Balaji Lakshminarayanan, Alex Beutel, Xuezhi Wang
Abstract summary: We investigate the robustness of vision transformers (ViTs) through the lens of their special patch-based architectural structure. We find that ViTs are surprisingly insensitive to patch-based transformations, even when the transformation largely destroys the original semantics. We show that patch-based negative augmentation consistently improves robustness of ViTs across a wide set of ImageNet based robustness benchmarks.
Score: 29.08732248577141
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We investigate the robustness of vision transformers (ViTs) through the lens of their special patch-based architectural structure, i.e., they process an image as a sequence of image patches. We find that ViTs are surprisingly insensitive to patch-based transformations, even when the transformation largely destroys the original semantics and makes the image unrecognizable by humans. This indicates that ViTs heavily use features that survived such transformations but are generally not indicative of the semantic class to humans. Further investigations show that these features are useful but non-robust, as ViTs trained on them can achieve high in-distribution accuracy, but break down under distribution shifts. From this understanding, we ask: can training the model to rely less on these features improve ViT robustness and out-of-distribution performance? We use the images transformed with our patch-based operations as negatively augmented views and offer losses to regularize the training away from using non-robust features. This is a complementary view to existing research that mostly focuses on augmenting inputs with semantic-preserving transformations to enforce models' invariance. We show that patch-based negative augmentation consistently improves robustness of ViTs across a wide set of ImageNet based robustness benchmarks. Furthermore, we find our patch-based negative augmentation are complementary to traditional (positive) data augmentation, and together boost the performance further. All the code in this work will be open-sourced.

Related papers

Interpretability-Aware Vision Transformer [13.310757078491916]
Vision Transformers (ViTs) have become prominent models for solving various vision tasks. We introduce a novel training procedure that inherently enhances model interpretability. IA-ViT is composed of a feature extractor, a predictor, and an interpreter, which are trained jointly with an interpretability-aware training objective.
arXiv Detail & Related papers (2023-09-14T21:50:49Z)
Patch Is Not All You Need [57.290256181083016]
We propose a novel Pattern Transformer to adaptively convert images to pattern sequences for Transformer input. We employ the Convolutional Neural Network to extract various patterns from the input image. We have accomplished state-of-the-art performance on CIFAR-10 and CIFAR-100, and have achieved competitive results on ImageNet.
arXiv Detail & Related papers (2023-08-21T13:54:00Z)
On the unreasonable vulnerability of transformers for image restoration -- and an easy fix [16.927916090724363]
We investigate whether the improved adversarial robustness of ViTs extends to image restoration. We consider the recently proposed Restormer model, as well as NAFNet and the "Baseline network" Our experiments are performed on real-world images from the GoPro dataset for image deblurring.
arXiv Detail & Related papers (2023-07-25T23:09:05Z)
Understanding Gaussian Attention Bias of Vision Transformers Using Effective Receptive Fields [7.58745191859815]
Vision transformers (ViTs) that model an image as a sequence of partitioned patches have shown notable performance in diverse vision tasks. We propose explicitly adding a Gaussian attention bias that guides the positional embedding to have the corresponding pattern from the beginning of training. The results showed that proposed method not only facilitates ViTs to understand images but also boosts their performance on various datasets.
arXiv Detail & Related papers (2023-05-08T14:12:25Z)
Image Deblurring by Exploring In-depth Properties of Transformer [86.7039249037193]
We leverage deep features extracted from a pretrained vision transformer (ViT) to encourage recovered images to be sharp without sacrificing the performance measured by the quantitative metrics. By comparing the transformer features between recovered image and target one, the pretrained transformer provides high-resolution blur-sensitive semantic information. One regards the features as vectors and computes the discrepancy between representations extracted from recovered image and target one in Euclidean space.
arXiv Detail & Related papers (2023-03-24T14:14:25Z)
Patch-level Representation Learning for Self-supervised Vision Transformers [68.8862419248863]
Vision Transformers (ViTs) have gained much attention recently as a better architectural choice, often outperforming convolutional networks for various visual tasks. Inspired by this, we design a simple yet effective visual pretext task, coined SelfPatch, for learning better patch-level representations. We demonstrate that SelfPatch can significantly improve the performance of existing SSL methods for various visual tasks.
arXiv Detail & Related papers (2022-06-16T08:01:19Z)
Discrete Representations Strengthen Vision Transformer Robustness [43.821734467553554]
Vision Transformer (ViT) is emerging as the state-of-the-art architecture for image recognition. We present a simple and effective architecture modification to ViT's input layer by adding discrete tokens produced by a vector-quantized encoder. Experimental results demonstrate that adding discrete representation on four architecture variants strengthens ViT robustness by up to 12% across seven ImageNet robustness benchmarks.
arXiv Detail & Related papers (2021-11-20T01:49:56Z)
Intriguing Properties of Vision Transformers [114.28522466830374]
Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems. We systematically study this question via an extensive set of experiments and comparisons with a high-performing convolutional neural network (CNN) We show effective features of ViTs are due to flexible receptive and dynamic fields possible via the self-attention mechanism.
arXiv Detail & Related papers (2021-05-21T17:59:18Z)
Improve Vision Transformers Training by Suppressing Over-smoothing [28.171262066145612]
Introducing the transformer structure into computer vision tasks holds the promise of yielding a better speed-accuracy trade-off than traditional convolution networks. However, directly training vanilla transformers on vision tasks has been shown to yield unstable and sub-optimal results. Recent works propose to modify transformer structures by incorporating convolutional layers to improve the performance on vision tasks.
arXiv Detail & Related papers (2021-04-26T17:43:04Z)
On the Adversarial Robustness of Visual Transformers [129.29523847765952]
This work provides the first and comprehensive study on the robustness of vision transformers (ViTs) against adversarial perturbations. Tested on various white-box and transfer attack settings, we find that ViTs possess better adversarial robustness when compared with convolutional neural networks (CNNs)
arXiv Detail & Related papers (2021-03-29T14:48:24Z)
DeepViT: Towards Deeper Vision Transformer [92.04063170357426]
Vision transformers (ViTs) have been successfully applied in image classification tasks recently. We show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper. We propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity.
arXiv Detail & Related papers (2021-03-22T14:32:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.