Recurrent Vision Transformer for Solving Visual Reasoning Problems
- URL: http://arxiv.org/abs/2111.14576v1
- Date: Mon, 29 Nov 2021 15:01:09 GMT
- Title: Recurrent Vision Transformer for Solving Visual Reasoning Problems
- Authors: Nicola Messina, Giuseppe Amato, Fabio Carrara, Claudio Gennaro,
Fabrizio Falchi
- Abstract summary: We introduce the Recurrent Vision Transformer (RViT) model for convolutional neural networks (CNNs)
Thanks to the impact of recurrent connections and spatial attention in reasoning tasks, this network achieves competitive results on the same-different visual reasoning problems.
A comprehensive ablation study confirms the importance of a hybrid CNN + Transformer architecture.
- Score: 13.658244210412352
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although convolutional neural networks (CNNs) showed remarkable results in
many vision tasks, they are still strained by simple yet challenging visual
reasoning problems. Inspired by the recent success of the Transformer network
in computer vision, in this paper, we introduce the Recurrent Vision
Transformer (RViT) model. Thanks to the impact of recurrent connections and
spatial attention in reasoning tasks, this network achieves competitive results
on the same-different visual reasoning problems from the SVRT dataset. The
weight-sharing both in spatial and depth dimensions regularizes the model,
allowing it to learn using far fewer free parameters, using only 28k training
samples. A comprehensive ablation study confirms the importance of a hybrid CNN
+ Transformer architecture and the role of the feedback connections, which
iteratively refine the internal representation until a stable prediction is
obtained. In the end, this study can lay the basis for a deeper understanding
of the role of attention and recurrent connections for solving visual abstract
reasoning tasks.
Related papers
- Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR [51.72751335574947]
Visual Commonsense Reasoning (VCR) calls for explanatory reasoning behind question answering over visual scenes.
Progress on the benchmark dataset stems largely from the recent advancement of Vision-Language Transformers (VL Transformers)
This paper posits that the VL Transformers do not exhibit visual commonsense, which is the key to VCR.
arXiv Detail & Related papers (2024-05-27T08:26:58Z) - Convolutional Initialization for Data-Efficient Vision Transformers [38.63299194992718]
Training vision transformer networks on small datasets poses challenges.
CNNs can achieve state-of-the-art performance by leveraging their architectural inductive bias.
Our approach is motivated by the finding that random impulse filters can achieve almost comparable performance to learned filters in CNNs.
arXiv Detail & Related papers (2024-01-23T06:03:16Z) - Visual Prompting Upgrades Neural Network Sparsification: A Data-Model Perspective [64.04617968947697]
We introduce a novel data-model co-design perspective: to promote superior weight sparsity.
Specifically, customized Visual Prompts are mounted to upgrade neural Network sparsification in our proposed VPNs framework.
arXiv Detail & Related papers (2023-12-03T13:50:24Z) - Point-aware Interaction and CNN-induced Refinement Network for RGB-D
Salient Object Detection [95.84616822805664]
We introduce CNNs-assisted Transformer architecture and propose a novel RGB-D SOD network with Point-aware Interaction and CNN-induced Refinement.
In order to alleviate the block effect and detail destruction problems brought by the Transformer naturally, we design a CNN-induced refinement (CNNR) unit for content refinement and supplementation.
arXiv Detail & Related papers (2023-08-17T11:57:49Z) - Data-driven emergence of convolutional structure in neural networks [83.4920717252233]
We show how fully-connected neural networks solving a discrimination task can learn a convolutional structure directly from their inputs.
By carefully designing data models, we show that the emergence of this pattern is triggered by the non-Gaussian, higher-order local structure of the inputs.
arXiv Detail & Related papers (2022-02-01T17:11:13Z) - On the role of feedback in visual processing: a predictive coding
perspective [0.6193838300896449]
We consider deep convolutional networks (CNNs) as models of feed-forward visual processing and implement Predictive Coding (PC) dynamics.
We find that the network increasingly relies on top-down predictions as the noise level increases.
In addition, the accuracy of the network implementing PC dynamics significantly increases over time-steps, compared to its equivalent forward network.
arXiv Detail & Related papers (2021-06-08T10:07:23Z) - Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence.
Transformers require minimal inductive biases for their design and are naturally suited as set-functions.
This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z) - Explain by Evidence: An Explainable Memory-based Neural Network for
Question Answering [41.73026155036886]
This paper proposes an explainable, evidence-based memory network architecture.
It learns to summarize the dataset and extract supporting evidences to make its decision.
Our model achieves state-of-the-art performance on two popular question answering datasets.
arXiv Detail & Related papers (2020-11-05T21:18:21Z) - A Principle of Least Action for the Training of Neural Networks [10.342408668490975]
We show the presence of a low kinetic energy displacement bias in the transport map of the network, and link this bias with generalization performance.
We propose a new learning algorithm, which automatically adapts to the complexity of the given task, and leads to networks with a high generalization ability even in low data regimes.
arXiv Detail & Related papers (2020-09-17T15:37:34Z) - On Robustness and Transferability of Convolutional Neural Networks [147.71743081671508]
Modern deep convolutional networks (CNNs) are often criticized for not generalizing under distributional shifts.
We study the interplay between out-of-distribution and transfer performance of modern image classification CNNs for the first time.
We find that increasing both the training set and model sizes significantly improve the distributional shift robustness.
arXiv Detail & Related papers (2020-07-16T18:39:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.