Understanding the computational demands underlying visual reasoning
- URL: http://arxiv.org/abs/2108.03603v1
- Date: Sun, 8 Aug 2021 10:46:53 GMT
- Title: Understanding the computational demands underlying visual reasoning
- Authors: Mohit Vaishnav, Remi Cadene, Andrea Alamia, Drew Linsley, Rufin
VanRullen and Thomas Serre
- Abstract summary: We systematically assess the ability of modern deep convolutional neural networks to learn to solve visual reasoning problems.
Our analysis leads to a novel taxonomy of visual reasoning tasks, which can be primarily explained by the type of relations and the number of relations used to compose the underlying rules.
- Score: 10.308647202215708
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual understanding requires comprehending complex visual relations between
objects within a scene. Here, we seek to characterize the computational demands
for abstract visual reasoning. We do this by systematically assessing the
ability of modern deep convolutional neural networks (CNNs) to learn to solve
the Synthetic Visual Reasoning Test (SVRT) challenge, a collection of
twenty-three visual reasoning problems. Our analysis leads to a novel taxonomy
of visual reasoning tasks, which can be primarily explained by both the type of
relations (same-different vs. spatial-relation judgments) and the number of
relations used to compose the underlying rules. Prior cognitive neuroscience
work suggests that attention plays a key role in human's visual reasoning
ability. To test this, we extended the CNNs with spatial and feature-based
attention mechanisms. In a second series of experiments, we evaluated the
ability of these attention networks to learn to solve the SVRT challenge and
found the resulting architectures to be much more efficient at solving the
hardest of these visual reasoning tasks. Most importantly, the corresponding
improvements on individual tasks partially explained the taxonomy. Overall,
this work advances our understanding of visual reasoning and yields testable
Neuroscience predictions regarding the need for feature-based vs. spatial
attention in visual reasoning.
Related papers
- Evaluating and Analyzing Relationship Hallucinations in Large Vision-Language Models [69.79709804046325]
We introduce R-Bench, a novel benchmark for evaluating Vision Relationship Hallucination.
R-Bench features image-level questions that focus on the existence of relationships and instance-level questions that assess local visual comprehension.
We identify three types of relationship co-occurrences that lead to hallucinations: relationship-relationship, subject-relationship, and relationship-object.
arXiv Detail & Related papers (2024-06-24T08:42:42Z) - Learning Differentiable Logic Programs for Abstract Visual Reasoning [18.82429807065658]
Differentiable forward reasoning has been developed to integrate reasoning with gradient-based machine learning paradigms.
NEUMANN is a graph-based differentiable forward reasoner, passing messages in a memory-efficient manner and handling structured programs with functors.
We demonstrate that NEUMANN solves visual reasoning tasks efficiently, outperforming neural, symbolic, and neuro-symbolic baselines.
arXiv Detail & Related papers (2023-07-03T11:02:40Z) - The role of object-centric representations, guided attention, and
external memory on generalizing visual relations [0.6091702876917281]
We evaluate a series of deep neural networks (DNNs) that integrate mechanism such as slot attention, recurrently guided attention, and external memory.
We find that, although some models performed better than others in generalizing the same-different relation to specific types of images, no model was able to generalize this relation across the board.
arXiv Detail & Related papers (2023-04-14T12:22:52Z) - BI AVAN: Brain inspired Adversarial Visual Attention Network [67.05560966998559]
We propose a brain-inspired adversarial visual attention network (BI-AVAN) to characterize human visual attention directly from functional brain activity.
Our model imitates the biased competition process between attention-related/neglected objects to identify and locate the visual objects in a movie frame the human brain focuses on in an unsupervised manner.
arXiv Detail & Related papers (2022-10-27T22:20:36Z) - EgoTaskQA: Understanding Human Tasks in Egocentric Videos [89.9573084127155]
EgoTaskQA benchmark provides home for crucial dimensions of task understanding through question-answering on real-world egocentric videos.
We meticulously design questions that target the understanding of (1) action dependencies and effects, (2) intents and goals, and (3) agents' beliefs about others.
We evaluate state-of-the-art video reasoning models on our benchmark and show their significant gaps between humans in understanding complex goal-oriented egocentric videos.
arXiv Detail & Related papers (2022-10-08T05:49:05Z) - GAMR: A Guided Attention Model for (visual) Reasoning [7.919213739992465]
Humans continue to outperform modern AI systems in their ability to flexibly parse and understand complex visual scenes.
We present a novel module for visual reasoning, the Guided Attention Model for (visual) Reasoning (GAMR)
GAMR posits that the brain solves complex visual reasoning problems dynamically via sequences of attention shifts to select and route task-relevant visual information into memory.
arXiv Detail & Related papers (2022-06-10T07:52:06Z) - Understanding top-down attention using task-oriented ablation design [0.22940141855172028]
Top-down attention allows neural networks, both artificial and biological, to focus on the information most relevant for a given task.
We aim to answer this with a computational experiment based on a general framework called task-oriented ablation design.
We compare the performance of two neural networks, one with top-down attention and one without.
arXiv Detail & Related papers (2021-06-08T21:01:47Z) - Compositional Explanations of Neurons [52.71742655312625]
We describe a procedure for explaining neurons in deep representations by identifying compositional logical concepts.
We use this procedure to answer several questions on interpretability in models for vision and natural language processing.
arXiv Detail & Related papers (2020-06-24T20:37:05Z) - Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning" [49.76230210108583]
We propose a framework to isolate and evaluate the reasoning aspect of visual question answering (VQA) separately from its perception.
We also propose a novel top-down calibration technique that allows the model to answer reasoning questions even with imperfect perception.
On the challenging GQA dataset, this framework is used to perform in-depth, disentangled comparisons between well-known VQA models.
arXiv Detail & Related papers (2020-06-20T08:48:29Z) - Machine Number Sense: A Dataset of Visual Arithmetic Problems for
Abstract and Relational Reasoning [95.18337034090648]
We propose a dataset, Machine Number Sense (MNS), consisting of visual arithmetic problems automatically generated using a grammar model--And-Or Graph (AOG)
These visual arithmetic problems are in the form of geometric figures.
We benchmark the MNS dataset using four predominant neural network models as baselines in this visual reasoning task.
arXiv Detail & Related papers (2020-04-25T17:14:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.