Object-Centric Diagnosis of Visual Reasoning
- URL: http://arxiv.org/abs/2012.11587v1
- Date: Mon, 21 Dec 2020 18:59:28 GMT
- Title: Object-Centric Diagnosis of Visual Reasoning
- Authors: Jianwei Yang, Jiayuan Mao, Jiajun Wu, Devi Parikh, David D. Cox,
Joshua B. Tenenbaum, Chuang Gan
- Abstract summary: This paper presents a systematical object-centric diagnosis of visual reasoning on grounding and robustness.
We develop a diagnostic model, namely Graph Reasoning Machine.
Our model replaces purely symbolic visual representation with probabilistic scene graph and then applies teacher-forcing training for the visual reasoning module.
- Score: 118.36750454795428
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: When answering questions about an image, it not only needs knowing what --
understanding the fine-grained contents (e.g., objects, relationships) in the
image, but also telling why -- reasoning over grounding visual cues to derive
the answer for a question. Over the last few years, we have seen significant
progress on visual question answering. Though impressive as the accuracy grows,
it still lags behind to get knowing whether these models are undertaking
grounding visual reasoning or just leveraging spurious correlations in the
training data. Recently, a number of works have attempted to answer this
question from perspectives such as grounding and robustness. However, most of
them are either focusing on the language side or coarsely studying the
pixel-level attention maps. In this paper, by leveraging the step-wise object
grounding annotations provided in the GQA dataset, we first present a
systematical object-centric diagnosis of visual reasoning on grounding and
robustness, particularly on the vision side. According to the extensive
comparisons across different models, we find that even models with high
accuracy are not good at grounding objects precisely, nor robust to visual
content perturbations. In contrast, symbolic and modular models have a
relatively better grounding and robustness, though at the cost of accuracy. To
reconcile these different aspects, we further develop a diagnostic model,
namely Graph Reasoning Machine. Our model replaces purely symbolic visual
representation with probabilistic scene graph and then applies teacher-forcing
training for the visual reasoning module. The designed model improves the
performance on all three metrics over the vanilla neural-symbolic model while
inheriting the transparency. Further ablation studies suggest that this
improvement is mainly due to more accurate image understanding and proper
intermediate reasoning supervisions.
Related papers
- Interpretable Visual Question Answering via Reasoning Supervision [4.76359068115052]
Transformer-based architectures have recently demonstrated remarkable performance in the Visual Question Answering (VQA) task.
We propose a novel architecture for visual question answering that leverages common sense reasoning as a supervisory signal.
We demonstrate both quantitatively and qualitatively that the proposed approach can boost the model's visual perception capability and lead to performance increase.
arXiv Detail & Related papers (2023-09-07T14:12:31Z) - Guiding Visual Question Answering with Attention Priors [76.21671164766073]
We propose to guide the attention mechanism using explicit linguistic-visual grounding.
This grounding is derived by connecting structured linguistic concepts in the query to their referents among the visual objects.
The resultant algorithm is capable of probing attention-based reasoning models, injecting relevant associative knowledge, and regulating the core reasoning process.
arXiv Detail & Related papers (2022-05-25T09:53:47Z) - PTR: A Benchmark for Part-based Conceptual, Relational, and Physical
Reasoning [135.2892665079159]
We introduce a new large-scale diagnostic visual reasoning dataset named PTR.
PTR contains around 70k RGBD synthetic images with ground truth object and part level annotations.
We examine several state-of-the-art visual reasoning models on this dataset and observe that they still make many surprising mistakes.
arXiv Detail & Related papers (2021-12-09T18:59:34Z) - Weakly Supervised Relative Spatial Reasoning for Visual Question
Answering [38.05223339919346]
We evaluate the faithfulness of V&L models to such geometric understanding.
We train V&L with weak supervision from off-the-shelf depth estimators.
This leads to considerable improvements in accuracy for the "GQA" visual question answering challenge.
arXiv Detail & Related papers (2021-09-04T21:29:06Z) - A Better Loss for Visual-Textual Grounding [74.81353762517979]
Given a textual phrase and an image, the visual grounding problem is defined as the task of locating the content of the image referenced by the sentence.
It is a challenging task that has several real-world applications in human-computer interaction, image-text reference resolution, and video-text reference resolution.
We propose a model that is able to achieve a higher accuracy than state-of-the-art models thanks to the adoption of a more effective loss function.
arXiv Detail & Related papers (2021-08-11T16:26:54Z) - Visual Distant Supervision for Scene Graph Generation [66.10579690929623]
Scene graph models usually require supervised learning on large quantities of labeled data with intensive human annotation.
We propose visual distant supervision, a novel paradigm of visual relation learning, which can train scene graph models without any human-labeled data.
Comprehensive experimental results show that our distantly supervised model outperforms strong weakly supervised and semi-supervised baselines.
arXiv Detail & Related papers (2021-03-29T06:35:24Z) - Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning" [49.76230210108583]
We propose a framework to isolate and evaluate the reasoning aspect of visual question answering (VQA) separately from its perception.
We also propose a novel top-down calibration technique that allows the model to answer reasoning questions even with imperfect perception.
On the challenging GQA dataset, this framework is used to perform in-depth, disentangled comparisons between well-known VQA models.
arXiv Detail & Related papers (2020-06-20T08:48:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.