Transformation Driven Visual Reasoning
- URL: http://arxiv.org/abs/2011.13160v2
- Date: Fri, 2 Apr 2021 06:25:46 GMT
- Title: Transformation Driven Visual Reasoning
- Authors: Xin Hong, Yanyan Lan, Liang Pang, Jiafeng Guo and Xueqi Cheng
- Abstract summary: This paper defines a new visual reasoning paradigm by introducing an important factor, i.e.transformation.
We argue that this kind of textbfstate driven visual reasoning approach has limitations in reflecting whether the machine has the ability to infer the dynamics between different states.
Experimental results show that the state-of-the-art visual reasoning models perform well on Basic, but are still far from human-level intelligence on Event and View.
- Score: 80.32402545546209
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper defines a new visual reasoning paradigm by introducing an
important factor, i.e.~transformation. The motivation comes from the fact that
most existing visual reasoning tasks, such as CLEVR in VQA, are solely defined
to test how well the machine understands the concepts and relations within
static settings, like one image. We argue that this kind of \textbf{state
driven visual reasoning} approach has limitations in reflecting whether the
machine has the ability to infer the dynamics between different states, which
has been shown as important as state-level reasoning for human cognition in
Piaget's theory. To tackle this problem, we propose a novel
\textbf{transformation driven visual reasoning} task. Given both the initial
and final states, the target is to infer the corresponding single-step or
multi-step transformation, represented as a triplet (object, attribute, value)
or a sequence of triplets, respectively. Following this definition, a new
dataset namely TRANCE is constructed on the basis of CLEVR, including three
levels of settings, i.e.~Basic (single-step transformation), Event (multi-step
transformation), and View (multi-step transformation with variant views).
Experimental results show that the state-of-the-art visual reasoning models
perform well on Basic, but are still far from human-level intelligence on Event
and View. We believe the proposed new paradigm will boost the development of
machine visual reasoning. More advanced methods and real data need to be
investigated in this direction. The resource of TVR is available at
https://hongxin2019.github.io/TVR.
Related papers
- Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR [51.72751335574947]
Visual Commonsense Reasoning (VCR) calls for explanatory reasoning behind question answering over visual scenes.
Progress on the benchmark dataset stems largely from the recent advancement of Vision-Language Transformers (VL Transformers)
This paper posits that the VL Transformers do not exhibit visual commonsense, which is the key to VCR.
arXiv Detail & Related papers (2024-05-27T08:26:58Z) - Visual Transformation Telling [81.99825888461544]
We propose a new visual reasoning task, called textbfVisual textbfTransformation textbfTelling (VTT)
Given a series of states (i.e. images), VTT requires to describe the transformation occurring between every two adjacent states.
We collect a novel dataset to support the study of transformation reasoning from two existing instructional video datasets.
arXiv Detail & Related papers (2023-05-03T07:02:57Z) - Visual Reasoning: from State to Transformation [80.32402545546209]
Existing visual reasoning tasks ignore an important factor, i.e.transformation.
We propose a novel textbftransformation driven visual reasoning (TVR) task.
We show that state-of-the-art visual reasoning models perform well on Basic, but are far from human-level intelligence on Event, View, and TRANCO.
arXiv Detail & Related papers (2023-05-02T14:24:12Z) - See, Think, Confirm: Interactive Prompting Between Vision and Language
Models for Knowledge-based Visual Reasoning [60.43585179885355]
We propose a novel framework named Interactive Prompting Visual Reasoner (IPVR) for few-shot knowledge-based visual reasoning.
IPVR contains three stages, see, think and confirm.
We conduct experiments on a range of knowledge-based visual reasoning datasets.
arXiv Detail & Related papers (2023-01-12T18:59:50Z) - RelViT: Concept-guided Vision Transformer for Visual Relational
Reasoning [139.0548263507796]
We use vision transformers (ViTs) as our base model for visual reasoning.
We make better use of concepts defined as object entities and their relations to improve the reasoning ability of ViTs.
We show the resulting model, Concept-guided Vision Transformer (or RelViT for short), significantly outperforms prior approaches on HICO and GQA benchmarks.
arXiv Detail & Related papers (2022-04-24T02:46:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.