Related papers: Recurrent Vision Transformer for Solving Visual Reasoning Problems

Recurrent Vision Transformer for Solving Visual Reasoning Problems

URL: http://arxiv.org/abs/2111.14576v1
Date: Mon, 29 Nov 2021 15:01:09 GMT
Title: Recurrent Vision Transformer for Solving Visual Reasoning Problems
Authors: Nicola Messina, Giuseppe Amato, Fabio Carrara, Claudio Gennaro, Fabrizio Falchi
Abstract summary: We introduce the Recurrent Vision Transformer (RViT) model for convolutional neural networks (CNNs) Thanks to the impact of recurrent connections and spatial attention in reasoning tasks, this network achieves competitive results on the same-different visual reasoning problems. A comprehensive ablation study confirms the importance of a hybrid CNN + Transformer architecture.
Score: 13.658244210412352
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Although convolutional neural networks (CNNs) showed remarkable results in many vision tasks, they are still strained by simple yet challenging visual reasoning problems. Inspired by the recent success of the Transformer network in computer vision, in this paper, we introduce the Recurrent Vision Transformer (RViT) model. Thanks to the impact of recurrent connections and spatial attention in reasoning tasks, this network achieves competitive results on the same-different visual reasoning problems from the SVRT dataset. The weight-sharing both in spatial and depth dimensions regularizes the model, allowing it to learn using far fewer free parameters, using only 28k training samples. A comprehensive ablation study confirms the importance of a hybrid CNN + Transformer architecture and the role of the feedback connections, which iteratively refine the internal representation until a stable prediction is obtained. In the end, this study can lay the basis for a deeper understanding of the role of attention and recurrent connections for solving visual abstract reasoning tasks.

Related papers

Inverting Transformer-based Vision Models [0.8124699127636158]
We apply a modular approach of training inverse models to reconstruct input images from intermediate layers within a Detection Transformer and a Vision Transformer. Our analysis illustrates how these properties emerge within the models, contributing to a deeper understanding of transformer-based vision models.
arXiv Detail & Related papers (2024-12-09T14:43:06Z)
Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR [51.72751335574947]
Visual Commonsense Reasoning (VCR) calls for explanatory reasoning behind question answering over visual scenes. Progress on the benchmark dataset stems largely from the recent advancement of Vision-Language Transformers (VL Transformers) This paper posits that the VL Transformers do not exhibit visual commonsense, which is the key to VCR.
arXiv Detail & Related papers (2024-05-27T08:26:58Z)
Convolutional Initialization for Data-Efficient Vision Transformers [38.63299194992718]
Training vision transformer networks on small datasets poses challenges. CNNs can achieve state-of-the-art performance by leveraging their architectural inductive bias. Our approach is motivated by the finding that random impulse filters can achieve almost comparable performance to learned filters in CNNs.
arXiv Detail & Related papers (2024-01-23T06:03:16Z)
Visual Prompting Upgrades Neural Network Sparsification: A Data-Model Perspective [64.04617968947697]
We introduce a novel data-model co-design perspective: to promote superior weight sparsity. Specifically, customized Visual Prompts are mounted to upgrade neural Network sparsification in our proposed VPNs framework.
arXiv Detail & Related papers (2023-12-03T13:50:24Z)
Point-aware Interaction and CNN-induced Refinement Network for RGB-D Salient Object Detection [95.84616822805664]
We introduce CNNs-assisted Transformer architecture and propose a novel RGB-D SOD network with Point-aware Interaction and CNN-induced Refinement. In order to alleviate the block effect and detail destruction problems brought by the Transformer naturally, we design a CNN-induced refinement (CNNR) unit for content refinement and supplementation.
arXiv Detail & Related papers (2023-08-17T11:57:49Z)
Deep Augmentation: Self-Supervised Learning with Transformations in Activation Space [19.495587566796278]
We introduce Deep Augmentation, an approach to implicit data augmentation using dropout or PCA to transform a targeted layer within a neural network to improve performance and generalization. We demonstrate Deep Augmentation through extensive experiments on contrastive learning tasks in NLP, computer vision, and graph learning.
arXiv Detail & Related papers (2023-03-25T19:03:57Z)
Data-driven emergence of convolutional structure in neural networks [83.4920717252233]
We show how fully-connected neural networks solving a discrimination task can learn a convolutional structure directly from their inputs. By carefully designing data models, we show that the emergence of this pattern is triggered by the non-Gaussian, higher-order local structure of the inputs.
arXiv Detail & Related papers (2022-02-01T17:11:13Z)
On the role of feedback in visual processing: a predictive coding perspective [0.6193838300896449]
We consider deep convolutional networks (CNNs) as models of feed-forward visual processing and implement Predictive Coding (PC) dynamics. We find that the network increasingly relies on top-down predictions as the noise level increases. In addition, the accuracy of the network implementing PC dynamics significantly increases over time-steps, compared to its equivalent forward network.
arXiv Detail & Related papers (2021-06-08T10:07:23Z)
Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence. Transformers require minimal inductive biases for their design and are naturally suited as set-functions. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z)
A Principle of Least Action for the Training of Neural Networks [10.342408668490975]
We show the presence of a low kinetic energy displacement bias in the transport map of the network, and link this bias with generalization performance. We propose a new learning algorithm, which automatically adapts to the complexity of the given task, and leads to networks with a high generalization ability even in low data regimes.
arXiv Detail & Related papers (2020-09-17T15:37:34Z)
On Robustness and Transferability of Convolutional Neural Networks [147.71743081671508]
Modern deep convolutional networks (CNNs) are often criticized for not generalizing under distributional shifts. We study the interplay between out-of-distribution and transfer performance of modern image classification CNNs for the first time. We find that increasing both the training set and model sizes significantly improve the distributional shift robustness.
arXiv Detail & Related papers (2020-07-16T18:39:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.