Related papers: Hardwiring ViT Patch Selectivity into CNNs using Patch Mixing

Hardwiring ViT Patch Selectivity into CNNs using Patch Mixing

URL: http://arxiv.org/abs/2306.17848v1
Date: Fri, 30 Jun 2023 17:59:53 GMT
Title: Hardwiring ViT Patch Selectivity into CNNs using Patch Mixing
Authors: Ariel N. Lee, Sarah Adel Bargal, Janavi Kasera, Stan Sclaroff, Kate Saenko, Nataniel Ruiz
Abstract summary: We train vision transformers (ViTs) and convolutional neural networks (CNNs) We find that ViTs do not improve nor degrade when trained using Patch Mixing. We conclude that this training method is a way of simulating in CNNs the abilities that ViTs already possess.
Score: 64.7892681641764
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision transformers (ViTs) have significantly changed the computer vision landscape and have periodically exhibited superior performance in vision tasks compared to convolutional neural networks (CNNs). Although the jury is still out on which model type is superior, each has unique inductive biases that shape their learning and generalization performance. For example, ViTs have interesting properties with respect to early layer non-local feature dependence, as well as self-attention mechanisms which enhance learning flexibility, enabling them to ignore out-of-context image information more effectively. We hypothesize that this power to ignore out-of-context information (which we name $\textit{patch selectivity}$), while integrating in-context information in a non-local manner in early layers, allows ViTs to more easily handle occlusion. In this study, our aim is to see whether we can have CNNs $\textit{simulate}$ this ability of patch selectivity by effectively hardwiring this inductive bias using Patch Mixing data augmentation, which consists of inserting patches from another image onto a training image and interpolating labels between the two image classes. Specifically, we use Patch Mixing to train state-of-the-art ViTs and CNNs, assessing its impact on their ability to ignore out-of-context patches and handle natural occlusions. We find that ViTs do not improve nor degrade when trained using Patch Mixing, but CNNs acquire new capabilities to ignore out-of-context information and improve on occlusion benchmarks, leaving us to conclude that this training method is a way of simulating in CNNs the abilities that ViTs already possess. We will release our Patch Mixing implementation and proposed datasets for public use. Project page: https://arielnlee.github.io/PatchMixing/

Related papers

Vision Conformer: Incorporating Convolutions into Vision Transformer Layers [6.09170287691728]
Vision Transformers (ViT) adapt transformers for image recognition tasks. One issue with ViT is the lack of inductive bias toward image structures. We propose the use of convolutional layers within ViT.
arXiv Detail & Related papers (2023-04-27T07:27:44Z)
Vision Transformers provably learn spatial structure [34.61885883486938]
Vision Transformers (ViTs) have achieved comparable or superior performance than Convolutional Neural Networks (CNNs) in computer vision. Yet, recent works have shown that while minimizing their training loss, ViTs specifically learn spatially localized patterns.
arXiv Detail & Related papers (2022-10-13T19:53:56Z)
Patch-level Representation Learning for Self-supervised Vision Transformers [68.8862419248863]
Vision Transformers (ViTs) have gained much attention recently as a better architectural choice, often outperforming convolutional networks for various visual tasks. Inspired by this, we design a simple yet effective visual pretext task, coined SelfPatch, for learning better patch-level representations. We demonstrate that SelfPatch can significantly improve the performance of existing SSL methods for various visual tasks.
arXiv Detail & Related papers (2022-06-16T08:01:19Z)
SERE: Exploring Feature Self-relation for Self-supervised Transformer [79.5769147071757]
Vision transformers (ViT) have strong representation ability with spatial self-attention and channel-level feedforward networks. Recent works reveal that self-supervised learning helps unleash the great potential of ViT. We observe that relational modeling on spatial and channel dimensions distinguishes ViT from other networks.
arXiv Detail & Related papers (2022-06-10T15:25:00Z)
Self-slimmed Vision Transformer [52.67243496139175]
Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks. We propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT. Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs.
arXiv Detail & Related papers (2021-11-24T16:48:57Z)
Learning to ignore: rethinking attention in CNNs [87.01305532842878]
We propose to reformulate the attention mechanism in CNNs to learn to ignore instead of learning to attend. Specifically, we propose to explicitly learn irrelevant information in the scene and suppress it in the produced representation.
arXiv Detail & Related papers (2021-11-10T13:47:37Z)
Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation [29.08732248577141]
We investigate the robustness of vision transformers (ViTs) through the lens of their special patch-based architectural structure. We find that ViTs are surprisingly insensitive to patch-based transformations, even when the transformation largely destroys the original semantics. We show that patch-based negative augmentation consistently improves robustness of ViTs across a wide set of ImageNet based robustness benchmarks.
arXiv Detail & Related papers (2021-10-15T04:53:18Z)
Intriguing Properties of Vision Transformers [114.28522466830374]
Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems. We systematically study this question via an extensive set of experiments and comparisons with a high-performing convolutional neural network (CNN) We show effective features of ViTs are due to flexible receptive and dynamic fields possible via the self-attention mechanism.
arXiv Detail & Related papers (2021-05-21T17:59:18Z)
So-ViT: Mind Visual Tokens for Vision Transformer [27.243241133304785]
We propose a new classification paradigm, where the second-order, cross-covariance pooling of visual tokens is combined with class token for final classification. We develop a light-weight, hierarchical module based on off-the-shelf convolutions for visual token embedding. The results show our models, when trained from scratch, outperform the competing ViT variants, while being on par with or better than state-of-the-art CNN models.
arXiv Detail & Related papers (2021-04-22T09:05:09Z)
DeepViT: Towards Deeper Vision Transformer [92.04063170357426]
Vision transformers (ViTs) have been successfully applied in image classification tasks recently. We show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper. We propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity.
arXiv Detail & Related papers (2021-03-22T14:32:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.