Weakly Supervised Relative Spatial Reasoning for Visual Question
Answering
- URL: http://arxiv.org/abs/2109.01934v1
- Date: Sat, 4 Sep 2021 21:29:06 GMT
- Title: Weakly Supervised Relative Spatial Reasoning for Visual Question
Answering
- Authors: Pratyay Banerjee, Tejas Gokhale, Yezhou Yang, Chitta Baral
- Abstract summary: We evaluate the faithfulness of V&L models to such geometric understanding.
We train V&L with weak supervision from off-the-shelf depth estimators.
This leads to considerable improvements in accuracy for the "GQA" visual question answering challenge.
- Score: 38.05223339919346
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Vision-and-language (V\&L) reasoning necessitates perception of visual
concepts such as objects and actions, understanding semantics and language
grounding, and reasoning about the interplay between the two modalities. One
crucial aspect of visual reasoning is spatial understanding, which involves
understanding relative locations of objects, i.e.\ implicitly learning the
geometry of the scene. In this work, we evaluate the faithfulness of V\&L
models to such geometric understanding, by formulating the prediction of
pair-wise relative locations of objects as a classification as well as a
regression task. Our findings suggest that state-of-the-art transformer-based
V\&L models lack sufficient abilities to excel at this task. Motivated by this,
we design two objectives as proxies for 3D spatial reasoning (SR) -- object
centroid estimation, and relative position estimation, and train V\&L with weak
supervision from off-the-shelf depth estimators. This leads to considerable
improvements in accuracy for the "GQA" visual question answering challenge (in
fully supervised, few-shot, and O.O.D settings) as well as improvements in
relative spatial reasoning. Code and data will be released
\href{https://github.com/pratyay-banerjee/weak_sup_vqa}{here}.
Related papers
- VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs [102.36953558562436]
Vision language models (VLMs) are an exciting emerging class of language models (LMs)
One understudied capability inVLMs is visual spatial planning.
Our study introduces a benchmark that evaluates the spatial planning capability in these models in general.
arXiv Detail & Related papers (2024-07-02T00:24:01Z) - GSR-BENCH: A Benchmark for Grounded Spatial Reasoning Evaluation via Multimodal LLMs [3.2688425993442696]
The ability to understand and reason about spatial relationships between objects in images is an important component of visual reasoning.
We extend the previously released What'sUp dataset and propose a novel comprehensive evaluation for spatial relationship understanding.
arXiv Detail & Related papers (2024-06-19T06:15:26Z) - Improving Vision-and-Language Reasoning via Spatial Relations Modeling [30.477235227733928]
Visual commonsense reasoning (VCR) is a challenging multi-modal task.
The proposed method can guide the representations to maintain more spatial context.
We achieve the state-of-the-art results on VCR and two other vision-and-language reasoning tasks VQA, and NLVR.
arXiv Detail & Related papers (2023-11-09T11:54:55Z) - Evaluating Robustness of Visual Representations for Object Assembly Task
Requiring Spatio-Geometrical Reasoning [8.626019848533707]
This paper focuses on evaluating and benchmarking the robustness of visual representations in the context of object assembly tasks.
We employ a general framework in visuomotor policy learning that utilizes visual pretraining models as vision encoders.
Our study investigates the robustness of this framework when applied to a dual-arm manipulation setup, specifically to the grasp variations.
arXiv Detail & Related papers (2023-10-15T20:41:07Z) - Knowing Earlier what Right Means to You: A Comprehensive VQA Dataset for
Grounding Relative Directions via Multi-Task Learning [16.538887534958555]
We introduce GRiD-A-3D, a novel diagnostic visual question-answering dataset based on abstract objects.
Our dataset allows for a fine-grained analysis of end-to-end VQA models' capabilities to ground relative directions.
We demonstrate that within a few epochs, the subtasks required to reason over relative directions are learned in the order in which relative directions are intuitively processed.
arXiv Detail & Related papers (2022-07-06T12:31:49Z) - RelViT: Concept-guided Vision Transformer for Visual Relational
Reasoning [139.0548263507796]
We use vision transformers (ViTs) as our base model for visual reasoning.
We make better use of concepts defined as object entities and their relations to improve the reasoning ability of ViTs.
We show the resulting model, Concept-guided Vision Transformer (or RelViT for short), significantly outperforms prior approaches on HICO and GQA benchmarks.
arXiv Detail & Related papers (2022-04-24T02:46:43Z) - PTR: A Benchmark for Part-based Conceptual, Relational, and Physical
Reasoning [135.2892665079159]
We introduce a new large-scale diagnostic visual reasoning dataset named PTR.
PTR contains around 70k RGBD synthetic images with ground truth object and part level annotations.
We examine several state-of-the-art visual reasoning models on this dataset and observe that they still make many surprising mistakes.
arXiv Detail & Related papers (2021-12-09T18:59:34Z) - Object-Centric Diagnosis of Visual Reasoning [118.36750454795428]
This paper presents a systematical object-centric diagnosis of visual reasoning on grounding and robustness.
We develop a diagnostic model, namely Graph Reasoning Machine.
Our model replaces purely symbolic visual representation with probabilistic scene graph and then applies teacher-forcing training for the visual reasoning module.
arXiv Detail & Related papers (2020-12-21T18:59:28Z) - Interpretable Visual Reasoning via Induced Symbolic Space [75.95241948390472]
We study the problem of concept induction in visual reasoning, i.e., identifying concepts and their hierarchical relationships from question-answer pairs associated with images.
We first design a new framework named object-centric compositional attention model (OCCAM) to perform the visual reasoning task with object-level visual features.
We then come up with a method to induce concepts of objects and relations using clues from the attention patterns between objects' visual features and question words.
arXiv Detail & Related papers (2020-11-23T18:21:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.