Discrete Representations Strengthen Vision Transformer Robustness
- URL: http://arxiv.org/abs/2111.10493v1
- Date: Sat, 20 Nov 2021 01:49:56 GMT
- Title: Discrete Representations Strengthen Vision Transformer Robustness
- Authors: Chengzhi Mao, Lu Jiang, Mostafa Dehghani, Carl Vondrick, Rahul
Sukthankar, Irfan Essa
- Abstract summary: Vision Transformer (ViT) is emerging as the state-of-the-art architecture for image recognition.
We present a simple and effective architecture modification to ViT's input layer by adding discrete tokens produced by a vector-quantized encoder.
Experimental results demonstrate that adding discrete representation on four architecture variants strengthens ViT robustness by up to 12% across seven ImageNet robustness benchmarks.
- Score: 43.821734467553554
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformer (ViT) is emerging as the state-of-the-art architecture for
image recognition. While recent studies suggest that ViTs are more robust than
their convolutional counterparts, our experiments find that ViTs are overly
reliant on local features (e.g., nuisances and texture) and fail to make
adequate use of global context (e.g., shape and structure). As a result, ViTs
fail to generalize to out-of-distribution, real-world data. To address this
deficiency, we present a simple and effective architecture modification to
ViT's input layer by adding discrete tokens produced by a vector-quantized
encoder. Different from the standard continuous pixel tokens, discrete tokens
are invariant under small perturbations and contain less information
individually, which promote ViTs to learn global information that is invariant.
Experimental results demonstrate that adding discrete representation on four
architecture variants strengthens ViT robustness by up to 12% across seven
ImageNet robustness benchmarks while maintaining the performance on ImageNet.
Related papers
- Semantic Graph Consistency: Going Beyond Patches for Regularizing Self-Supervised Vision Transformers [5.359378066251386]
Self-supervised learning with vision transformers (ViTs) has proven effective for representation learning.
Existing ViT-based SSL architectures do not fully exploit the ViT backbone.
We introduce a novel Semantic Graph Consistency (SGC) module to regularize ViT-based SSL methods and leverage patch tokens effectively.
arXiv Detail & Related papers (2024-06-18T06:36:44Z) - Making Vision Transformers Truly Shift-Equivariant [20.61570323513044]
Vision Transformers (ViTs) have become one of the go-to deep net architectures for computer vision.
We introduce novel data-adaptive designs for each of the modules in ViTs, such as tokenization, self-attention, patch merging, and positional encoding.
We evaluate the proposed adaptive models on image classification and semantic segmentation tasks.
arXiv Detail & Related papers (2023-05-25T17:59:40Z) - Self-Distilled Vision Transformer for Domain Generalization [58.76055100157651]
Vision transformers (ViTs) are challenging the supremacy of CNNs on standard benchmarks.
We propose a simple DG approach for ViTs, coined as self-distillation for ViTs.
We empirically demonstrate notable performance gains with different DG baselines and various ViT backbones in five challenging datasets.
arXiv Detail & Related papers (2022-07-25T17:57:05Z) - A Unified Pruning Framework for Vision Transformers [40.7622551128182]
Vision transformer (ViT) and its variants have achieved promising performances in various computer vision tasks.
We propose a unified framework for structural pruning of both ViTs and its variants, namely UP-ViTs.
Our method focuses on pruning all ViTs components while maintaining the consistency of the model structure.
arXiv Detail & Related papers (2021-11-30T05:01:02Z) - Self-slimmed Vision Transformer [52.67243496139175]
Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks.
We propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT.
Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs.
arXiv Detail & Related papers (2021-11-24T16:48:57Z) - On Improving Adversarial Transferability of Vision Transformers [97.17154635766578]
Vision transformers (ViTs) process input images as sequences of patches via self-attention.
We study the adversarial feature space of ViT models and their transferability.
We introduce two novel strategies specific to the architecture of ViT models.
arXiv Detail & Related papers (2021-06-08T08:20:38Z) - Intriguing Properties of Vision Transformers [114.28522466830374]
Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems.
We systematically study this question via an extensive set of experiments and comparisons with a high-performing convolutional neural network (CNN)
We show effective features of ViTs are due to flexible receptive and dynamic fields possible via the self-attention mechanism.
arXiv Detail & Related papers (2021-05-21T17:59:18Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.