Related papers: Towards A Unified Neural Architecture for Visual Recognition and Reasoning

Towards A Unified Neural Architecture for Visual Recognition and Reasoning

URL: http://arxiv.org/abs/2311.06386v1
Date: Fri, 10 Nov 2023 20:27:43 GMT
Title: Towards A Unified Neural Architecture for Visual Recognition and Reasoning
Authors: Calvin Luo, Boqing Gong, Ting Chen, Chen Sun
Abstract summary: We propose a unified neural architecture for visual recognition and reasoning with a generic interface (e.g., tokens) for both. Our framework enables the investigation of how different visual recognition tasks, datasets, and inductive biases can help enable principledtemporal reasoning capabilities.
Score: 40.938279131241764
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recognition and reasoning are two pillars of visual understanding. However, these tasks have an imbalance in focus; whereas recent advances in neural networks have shown strong empirical performance in visual recognition, there has been comparably much less success in solving visual reasoning. Intuitively, unifying these two tasks under a singular framework is desirable, as they are mutually dependent and beneficial. Motivated by the recent success of multi-task transformers for visual recognition and language understanding, we propose a unified neural architecture for visual recognition and reasoning with a generic interface (e.g., tokens) for both. Our framework enables the principled investigation of how different visual recognition tasks, datasets, and inductive biases can help enable spatiotemporal reasoning capabilities. Noticeably, we find that object detection, which requires spatial localization of individual objects, is the most beneficial recognition task for reasoning. We further demonstrate via probing that implicit object-centric representations emerge automatically inside our framework. Intriguingly, we discover that certain architectural choices such as the backbone model of the visual encoder have a significant impact on visual reasoning, but little on object detection. Given the results of our experiments, we believe that visual reasoning should be considered as a first-class citizen alongside visual recognition, as they are strongly correlated but benefit from potentially different design choices.

Related papers

A Cognitive Paradigm Approach to Probe the Perception-Reasoning Interface in VLMs [3.2228025627337864]
This paper introduces a structured evaluation framework using Bongard Problems (BPs) to dissect the perception-reasoning interface in Vision-Language Models (VLMs) We propose three distinct evaluation paradigms, mirroring human problem-solving strategies. Our framework provides a valuable diagnostic tool, highlighting the need to enhance visual processing fidelity for achieving more robust and human-like visual intelligence in AI.
arXiv Detail & Related papers (2025-01-23T12:42:42Z)
PhD Thesis: Exploring the role of (self-)attention in cognitive and computer vision architecture [0.0]
We analyze Transformer-based self-attention as a model and extend it with memory. We propose GAMR, a cognitive architecture combining attention and memory, inspired by active vision theory.
arXiv Detail & Related papers (2023-06-26T12:40:12Z)
Towards Unsupervised Visual Reasoning: Do Off-The-Shelf Features Know How to Reason? [30.16956370267339]
We introduce a protocol to evaluate visual representations for the task of Visual Question Answering. In order to decouple visual feature extraction from reasoning, we design a specific attention-based reasoning module. We compare two types of visual representations, densely extracted local features and object-centric ones, against the performances of a perfect image representation using ground truth.
arXiv Detail & Related papers (2022-12-20T14:36:45Z)
Visual Superordinate Abstraction for Robust Concept Learning [80.15940996821541]
Concept learning constructs visual representations that are connected to linguistic semantics. We ascribe the bottleneck to a failure of exploring the intrinsic semantic hierarchy of visual concepts. We propose a visual superordinate abstraction framework for explicitly modeling semantic-aware visual subspaces.
arXiv Detail & Related papers (2022-05-28T14:27:38Z)
Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interactions [138.49522643425334]
Bongard-HOI is a new visual reasoning benchmark that focuses on compositional learning of human-object interactions from natural images. It is inspired by two desirable characteristics from the classical Bongard problems (BPs): 1) few-shot concept learning, and 2) context-dependent reasoning. Bongard-HOI presents a substantial challenge to today's visual recognition models.
arXiv Detail & Related papers (2022-05-27T07:36:29Z)
Building a visual semantics aware object hierarchy [0.0]
We propose a novel unsupervised method to build visual semantics aware object hierarchy. Our intuition in this paper comes from real-world knowledge representation where concepts are hierarchically organized. The evaluation consists of two parts, firstly we apply the constructed hierarchy on the object recognition task and then we compare our visual hierarchy and existing lexical hierarchies to show the validity of our method.
arXiv Detail & Related papers (2022-02-26T00:10:21Z)
Capturing the objects of vision with neural networks [0.0]
Human visual perception carves a scene at its physical joints, decomposing the world into objects. Deep neural network (DNN) models of visual object recognition, by contrast, remain largely tethered to the sensory input. We review related work in both fields and examine how these fields can help each other.
arXiv Detail & Related papers (2021-09-07T21:49:53Z)
Understanding top-down attention using task-oriented ablation design [0.22940141855172028]
Top-down attention allows neural networks, both artificial and biological, to focus on the information most relevant for a given task. We aim to answer this with a computational experiment based on a general framework called task-oriented ablation design. We compare the performance of two neural networks, one with top-down attention and one without.
arXiv Detail & Related papers (2021-06-08T21:01:47Z)
Affect Analysis in-the-wild: Valence-Arousal, Expressions, Action Units and a Unified Framework [83.21732533130846]
The paper focuses on large in-the-wild databases, i.e., Aff-Wild and Aff-Wild2. It presents the design of two classes of deep neural networks trained with these databases. A novel multi-task and holistic framework is presented which is able to jointly learn and effectively generalize and perform affect recognition.
arXiv Detail & Related papers (2021-03-29T17:36:20Z)
Visual Relationship Detection with Visual-Linguistic Knowledge from Multimodal Representations [103.00383924074585]
Visual relationship detection aims to reason over relationships among salient objects in images. We propose a novel approach named Visual-Linguistic Representations from Transformers (RVL-BERT) RVL-BERT performs spatial reasoning with both visual and language commonsense knowledge learned via self-supervised pre-training.
arXiv Detail & Related papers (2020-09-10T16:15:09Z)
Self-supervised Learning from a Multi-view Perspective [121.63655399591681]
We show that self-supervised representations can extract task-relevant information and discard task-irrelevant information. Our theoretical framework paves the way to a larger space of self-supervised learning objective design.
arXiv Detail & Related papers (2020-06-10T00:21:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.