Take A Step Back: Rethinking the Two Stages in Visual Reasoning
- URL: http://arxiv.org/abs/2407.19666v1
- Date: Mon, 29 Jul 2024 02:56:19 GMT
- Title: Take A Step Back: Rethinking the Two Stages in Visual Reasoning
- Authors: Mingyu Zhang, Jiting Cai, Mingyu Liu, Yue Xu, Cewu Lu, Yong-Lu Li,
- Abstract summary: This paper revisits visual reasoning with a two-stage perspective.
It is more efficient to implement symbolization via separated encoders for different data domains while using a shared reasoner.
The proposed two-stage framework achieves impressive generalization ability on various visual reasoning tasks.
- Score: 57.16394309170051
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual reasoning, as a prominent research area, plays a crucial role in AI by facilitating concept formation and interaction with the world. However, current works are usually carried out separately on small datasets thus lacking generalization ability. Through rigorous evaluation of diverse benchmarks, we demonstrate the shortcomings of existing ad-hoc methods in achieving cross-domain reasoning and their tendency to data bias fitting. In this paper, we revisit visual reasoning with a two-stage perspective: (1) symbolization and (2) logical reasoning given symbols or their representations. We find that the reasoning stage is better at generalization than symbolization. Thus, it is more efficient to implement symbolization via separated encoders for different data domains while using a shared reasoner. Given our findings, we establish design principles for visual reasoning frameworks following the separated symbolization and shared reasoning. The proposed two-stage framework achieves impressive generalization ability on various visual reasoning tasks, including puzzles, physical prediction, and visual question answering (VQA), encompassing both 2D and 3D modalities. We believe our insights will pave the way for generalizable visual reasoning.
Related papers
- Symbol-LLM: Leverage Language Models for Symbolic System in Visual Human
Activity Reasoning [58.5857133154749]
We propose a new symbolic system with broad-coverage symbols and rational rules.
We leverage the recent advancement of LLMs as an approximation of the two ideal properties.
Our method shows superiority in extensive activity understanding tasks.
arXiv Detail & Related papers (2023-11-29T05:27:14Z) - Towards A Unified Neural Architecture for Visual Recognition and
Reasoning [40.938279131241764]
We propose a unified neural architecture for visual recognition and reasoning with a generic interface (e.g., tokens) for both.
Our framework enables the investigation of how different visual recognition tasks, datasets, and inductive biases can help enable principledtemporal reasoning capabilities.
arXiv Detail & Related papers (2023-11-10T20:27:43Z) - Learning Concept-Based Causal Transition and Symbolic Reasoning for Visual Planning [36.131648635051334]
Visual planning simulates how humans make decisions to achieve desired goals.
We propose an interpretable and generalizable visual planning framework.
We show that our framework can generalize to unseen task trajectories, unseen object categories, and real-world data.
arXiv Detail & Related papers (2023-10-05T05:41:21Z) - Minding Language Models' (Lack of) Theory of Mind: A Plug-and-Play
Multi-Character Belief Tracker [72.09076317574238]
ToM is a plug-and-play approach to investigate the belief states of characters in reading comprehension.
We show that ToM enhances off-the-shelf neural network theory mind in a zero-order setting while showing robust out-of-distribution performance compared to supervised baselines.
arXiv Detail & Related papers (2023-06-01T17:24:35Z) - Towards Unsupervised Visual Reasoning: Do Off-The-Shelf Features Know
How to Reason? [30.16956370267339]
We introduce a protocol to evaluate visual representations for the task of Visual Question Answering.
In order to decouple visual feature extraction from reasoning, we design a specific attention-based reasoning module.
We compare two types of visual representations, densely extracted local features and object-centric ones, against the performances of a perfect image representation using ground truth.
arXiv Detail & Related papers (2022-12-20T14:36:45Z) - Visual Superordinate Abstraction for Robust Concept Learning [80.15940996821541]
Concept learning constructs visual representations that are connected to linguistic semantics.
We ascribe the bottleneck to a failure of exploring the intrinsic semantic hierarchy of visual concepts.
We propose a visual superordinate abstraction framework for explicitly modeling semantic-aware visual subspaces.
arXiv Detail & Related papers (2022-05-28T14:27:38Z) - PTR: A Benchmark for Part-based Conceptual, Relational, and Physical
Reasoning [135.2892665079159]
We introduce a new large-scale diagnostic visual reasoning dataset named PTR.
PTR contains around 70k RGBD synthetic images with ground truth object and part level annotations.
We examine several state-of-the-art visual reasoning models on this dataset and observe that they still make many surprising mistakes.
arXiv Detail & Related papers (2021-12-09T18:59:34Z) - Interpretable Neural Computation for Real-World Compositional Visual
Question Answering [4.3668650778541895]
We build an interpretable framework for real-world compositional VQA.
In our framework, images and questions are disentangled into scene graphs and programs, and a symbolic program runs on them with full transparency to select the attention regions.
Experiments conducted on the GQA benchmark demonstrate that our framework achieves the compositional prior arts and competitive accuracy among monolithic ones.
arXiv Detail & Related papers (2020-10-10T05:46:22Z) - Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning" [49.76230210108583]
We propose a framework to isolate and evaluate the reasoning aspect of visual question answering (VQA) separately from its perception.
We also propose a novel top-down calibration technique that allows the model to answer reasoning questions even with imperfect perception.
On the challenging GQA dataset, this framework is used to perform in-depth, disentangled comparisons between well-known VQA models.
arXiv Detail & Related papers (2020-06-20T08:48:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.