Towards A Unified Neural Architecture for Visual Recognition and
Reasoning
- URL: http://arxiv.org/abs/2311.06386v1
- Date: Fri, 10 Nov 2023 20:27:43 GMT
- Title: Towards A Unified Neural Architecture for Visual Recognition and
Reasoning
- Authors: Calvin Luo, Boqing Gong, Ting Chen, Chen Sun
- Abstract summary: We propose a unified neural architecture for visual recognition and reasoning with a generic interface (e.g., tokens) for both.
Our framework enables the investigation of how different visual recognition tasks, datasets, and inductive biases can help enable principledtemporal reasoning capabilities.
- Score: 40.938279131241764
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recognition and reasoning are two pillars of visual understanding. However,
these tasks have an imbalance in focus; whereas recent advances in neural
networks have shown strong empirical performance in visual recognition, there
has been comparably much less success in solving visual reasoning. Intuitively,
unifying these two tasks under a singular framework is desirable, as they are
mutually dependent and beneficial. Motivated by the recent success of
multi-task transformers for visual recognition and language understanding, we
propose a unified neural architecture for visual recognition and reasoning with
a generic interface (e.g., tokens) for both. Our framework enables the
principled investigation of how different visual recognition tasks, datasets,
and inductive biases can help enable spatiotemporal reasoning capabilities.
Noticeably, we find that object detection, which requires spatial localization
of individual objects, is the most beneficial recognition task for reasoning.
We further demonstrate via probing that implicit object-centric representations
emerge automatically inside our framework. Intriguingly, we discover that
certain architectural choices such as the backbone model of the visual encoder
have a significant impact on visual reasoning, but little on object detection.
Given the results of our experiments, we believe that visual reasoning should
be considered as a first-class citizen alongside visual recognition, as they
are strongly correlated but benefit from potentially different design choices.
Related papers
- PhD Thesis: Exploring the role of (self-)attention in cognitive and
computer vision architecture [0.0]
We analyze Transformer-based self-attention as a model and extend it with memory.
We propose GAMR, a cognitive architecture combining attention and memory, inspired by active vision theory.
arXiv Detail & Related papers (2023-06-26T12:40:12Z) - Towards Unsupervised Visual Reasoning: Do Off-The-Shelf Features Know
How to Reason? [30.16956370267339]
We introduce a protocol to evaluate visual representations for the task of Visual Question Answering.
In order to decouple visual feature extraction from reasoning, we design a specific attention-based reasoning module.
We compare two types of visual representations, densely extracted local features and object-centric ones, against the performances of a perfect image representation using ground truth.
arXiv Detail & Related papers (2022-12-20T14:36:45Z) - Visual Superordinate Abstraction for Robust Concept Learning [80.15940996821541]
Concept learning constructs visual representations that are connected to linguistic semantics.
We ascribe the bottleneck to a failure of exploring the intrinsic semantic hierarchy of visual concepts.
We propose a visual superordinate abstraction framework for explicitly modeling semantic-aware visual subspaces.
arXiv Detail & Related papers (2022-05-28T14:27:38Z) - Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object
Interactions [138.49522643425334]
Bongard-HOI is a new visual reasoning benchmark that focuses on compositional learning of human-object interactions from natural images.
It is inspired by two desirable characteristics from the classical Bongard problems (BPs): 1) few-shot concept learning, and 2) context-dependent reasoning.
Bongard-HOI presents a substantial challenge to today's visual recognition models.
arXiv Detail & Related papers (2022-05-27T07:36:29Z) - Building a visual semantics aware object hierarchy [0.0]
We propose a novel unsupervised method to build visual semantics aware object hierarchy.
Our intuition in this paper comes from real-world knowledge representation where concepts are hierarchically organized.
The evaluation consists of two parts, firstly we apply the constructed hierarchy on the object recognition task and then we compare our visual hierarchy and existing lexical hierarchies to show the validity of our method.
arXiv Detail & Related papers (2022-02-26T00:10:21Z) - Capturing the objects of vision with neural networks [0.0]
Human visual perception carves a scene at its physical joints, decomposing the world into objects.
Deep neural network (DNN) models of visual object recognition, by contrast, remain largely tethered to the sensory input.
We review related work in both fields and examine how these fields can help each other.
arXiv Detail & Related papers (2021-09-07T21:49:53Z) - Understanding top-down attention using task-oriented ablation design [0.22940141855172028]
Top-down attention allows neural networks, both artificial and biological, to focus on the information most relevant for a given task.
We aim to answer this with a computational experiment based on a general framework called task-oriented ablation design.
We compare the performance of two neural networks, one with top-down attention and one without.
arXiv Detail & Related papers (2021-06-08T21:01:47Z) - Affect Analysis in-the-wild: Valence-Arousal, Expressions, Action Units
and a Unified Framework [83.21732533130846]
The paper focuses on large in-the-wild databases, i.e., Aff-Wild and Aff-Wild2.
It presents the design of two classes of deep neural networks trained with these databases.
A novel multi-task and holistic framework is presented which is able to jointly learn and effectively generalize and perform affect recognition.
arXiv Detail & Related papers (2021-03-29T17:36:20Z) - Visual Relationship Detection with Visual-Linguistic Knowledge from
Multimodal Representations [103.00383924074585]
Visual relationship detection aims to reason over relationships among salient objects in images.
We propose a novel approach named Visual-Linguistic Representations from Transformers (RVL-BERT)
RVL-BERT performs spatial reasoning with both visual and language commonsense knowledge learned via self-supervised pre-training.
arXiv Detail & Related papers (2020-09-10T16:15:09Z) - Self-supervised Learning from a Multi-view Perspective [121.63655399591681]
We show that self-supervised representations can extract task-relevant information and discard task-irrelevant information.
Our theoretical framework paves the way to a larger space of self-supervised learning objective design.
arXiv Detail & Related papers (2020-06-10T00:21:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.