Joint rotational invariance and adversarial training of a dual-stream
Transformer yields state of the art Brain-Score for Area V4
- URL: http://arxiv.org/abs/2203.06649v1
- Date: Tue, 8 Mar 2022 23:08:35 GMT
- Title: Joint rotational invariance and adversarial training of a dual-stream
Transformer yields state of the art Brain-Score for Area V4
- Authors: William Berrios, Arturo Deza
- Abstract summary: We show how a dual-stream Transformer, a CrossViT$textita la$ Chen et al. (2021), under a joint rotationally-invariant and adversarial optimization procedure yields 2nd place in the aggregate Brain-Score 2022 competition averaged across all visual categories.
Our current Transformer-based model also achieves greater explainable variance for areas V4, IT and Behaviour than a biologically-inspired CNN (ResNet50) that integrates a frontal V1-like module.
- Score: 3.3504365823045044
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern high-scoring models of vision in the brain score competition do not
stem from Vision Transformers. However, in this short paper, we provide
evidence against the unexpected trend of Vision Transformers (ViT) being not
perceptually aligned with human visual representations by showing how a
dual-stream Transformer, a CrossViT$~\textit{a la}$ Chen et al. (2021), under a
joint rotationally-invariant and adversarial optimization procedure yields 2nd
place in the aggregate Brain-Score 2022 competition averaged across all visual
categories, and currently (March 1st, 2022) holds the 1st place for the highest
explainable variance of area V4. In addition, our current Transformer-based
model also achieves greater explainable variance for areas V4, IT and Behaviour
than a biologically-inspired CNN (ResNet50) that integrates a frontal V1-like
computation module(Dapello et al.,2020). Our team was also the only entry in
the top-5 that shows a positive rank correlation between explained variance per
area and depth in the visual hierarchy.
Against our initial expectations, these results provide tentative support for
an $\textit{"All roads lead to Rome"}$ argument enforced via a joint
optimization rule even for non biologically-motivated models of vision such as
Vision Transformers.
Related papers
- HIRI-ViT: Scaling Vision Transformer with High Resolution Inputs [102.4965532024391]
hybrid deep models of Vision Transformer (ViT) and Convolution Neural Network (CNN) have emerged as a powerful class of backbones for vision tasks.
We present a new hybrid backbone with HIgh-Resolution Inputs (namely HIRI-ViT), that upgrades prevalent four-stage ViT to five-stage ViT tailored for high-resolution inputs.
HiRI-ViT achieves to-date the best published Top-1 accuracy of 84.3% on ImageNet with 448$times$448 inputs, which absolutely improves 83.4% of iFormer-S by 0.9% with 224$times
arXiv Detail & Related papers (2024-03-18T17:34:29Z) - ACC-ViT : Atrous Convolution's Comeback in Vision Transformers [5.224344210588584]
We introduce Atrous Attention, a fusion of regional and sparse attention, which can adaptively consolidate both local and global information.
We also propose a general vision transformer backbone, named ACC-ViT, following conventional practices for standard vision tasks.
ACC-ViT is therefore a strong vision backbone, which is also competitive in mobile-scale versions, ideal for niche applications with small datasets.
arXiv Detail & Related papers (2024-03-07T04:05:16Z) - Interpret Vision Transformers as ConvNets with Dynamic Convolutions [70.59235381143831]
We interpret vision Transformers as ConvNets with dynamic convolutions, which enables us to characterize existing Transformers and dynamic ConvNets in a unified framework.
Our interpretation can also guide the network design as researchers now can consider vision Transformers from the design space of ConvNets.
arXiv Detail & Related papers (2023-09-19T16:00:49Z) - ACC-UNet: A Completely Convolutional UNet model for the 2020s [2.7013801448234367]
ACC-UNet is a completely convolutional UNet model that brings the best of both worlds, the inherent inductive biases of convnets with the design decisions of transformers.
ACC-UNet was evaluated on 5 different medical image segmentation benchmarks and consistently outperformed convnets, transformers, and their hybrids.
arXiv Detail & Related papers (2023-08-25T21:39:43Z) - Reviving Shift Equivariance in Vision Transformers [12.720600348466498]
We propose an adaptive polyphase anchoring algorithm that can be seamlessly integrated into vision transformer models.
Our algorithms enable ViT, and its variants such as Twins to achieve 100% consistency with respect to input shift.
arXiv Detail & Related papers (2023-06-13T00:13:11Z) - TransVG++: End-to-End Visual Grounding with Language Conditioned Vision
Transformer [188.00681648113223]
We explore neat yet effective Transformer-based frameworks for visual grounding.
TransVG establishes multi-modal correspondences by Transformers and localizes referred regions by directly regressing box coordinates.
We upgrade our framework to a purely Transformer-based one by leveraging Vision Transformer (ViT) for vision feature encoding.
arXiv Detail & Related papers (2022-06-14T06:27:38Z) - An Empirical Study of Training End-to-End Vision-and-Language
Transformers [50.23532518166621]
We present METER(textbfMultimodal textbfEnd-to-end textbfTransformtextbfER), through which we investigate how to design and pre-train a fully transformer-based VL model.
Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion (e.g., merged attention vs. co-
arXiv Detail & Related papers (2021-11-03T17:55:36Z) - Focal Self-attention for Local-Global Interactions in Vision
Transformers [90.9169644436091]
We present focal self-attention, a new mechanism that incorporates both fine-grained local and coarse-grained global interactions.
With focal self-attention, we propose a new variant of Vision Transformer models, called Focal Transformer, which achieves superior performance over the state-of-the-art vision Transformers.
arXiv Detail & Related papers (2021-07-01T17:56:09Z) - Vision Transformers for Dense Prediction [77.34726150561087]
We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks.
Our experiments show that this architecture yields substantial improvements on dense prediction tasks.
arXiv Detail & Related papers (2021-03-24T18:01:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.