Visual Reasoning: from State to Transformation
- URL: http://arxiv.org/abs/2305.01668v1
- Date: Tue, 2 May 2023 14:24:12 GMT
- Title: Visual Reasoning: from State to Transformation
- Authors: Xin Hong, Yanyan Lan, Liang Pang, Jiafeng Guo, Xueqi Cheng
- Abstract summary: Existing visual reasoning tasks ignore an important factor, i.e.transformation.
We propose a novel textbftransformation driven visual reasoning (TVR) task.
We show that state-of-the-art visual reasoning models perform well on Basic, but are far from human-level intelligence on Event, View, and TRANCO.
- Score: 80.32402545546209
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Most existing visual reasoning tasks, such as CLEVR in VQA, ignore an
important factor, i.e.~transformation. They are solely defined to test how well
machines understand concepts and relations within static settings, like one
image. Such \textbf{state driven} visual reasoning has limitations in
reflecting the ability to infer the dynamics between different states, which
has shown to be equally important for human cognition in Piaget's theory. To
tackle this problem, we propose a novel \textbf{transformation driven} visual
reasoning (TVR) task. Given both the initial and final states, the target
becomes to infer the corresponding intermediate transformation. Following this
definition, a new synthetic dataset namely TRANCE is first constructed on the
basis of CLEVR, including three levels of settings, i.e.~Basic (single-step
transformation), Event (multi-step transformation), and View (multi-step
transformation with variant views). Next, we build another real dataset called
TRANCO based on COIN, to cover the loss of transformation diversity on TRANCE.
Inspired by human reasoning, we propose a three-staged reasoning framework
called TranNet, including observing, analyzing, and concluding, to test how
recent advanced techniques perform on TVR. Experimental results show that the
state-of-the-art visual reasoning models perform well on Basic, but are still
far from human-level intelligence on Event, View, and TRANCO. We believe the
proposed new paradigm will boost the development of machine visual reasoning.
More advanced methods and new problems need to be investigated in this
direction. The resource of TVR is available at
\url{https://hongxin2019.github.io/TVR/}.
Related papers
- Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR [51.72751335574947]
Visual Commonsense Reasoning (VCR) calls for explanatory reasoning behind question answering over visual scenes.
Progress on the benchmark dataset stems largely from the recent advancement of Vision-Language Transformers (VL Transformers)
This paper posits that the VL Transformers do not exhibit visual commonsense, which is the key to VCR.
arXiv Detail & Related papers (2024-05-27T08:26:58Z) - A Novel Bounding Box Regression Method for Single Object Tracking [0.0]
We introduce two novel bounding box regression networks: inception and deformable.
Experiments and ablation studies show that our inception module installed on the recent ODTrack outperforms the latter on three benchmarks.
arXiv Detail & Related papers (2024-05-16T21:09:45Z) - Visual Transformation Telling [81.99825888461544]
We propose a new visual reasoning task, called textbfVisual textbfTransformation textbfTelling (VTT)
Given a series of states (i.e. images), VTT requires to describe the transformation occurring between every two adjacent states.
We collect a novel dataset to support the study of transformation reasoning from two existing instructional video datasets.
arXiv Detail & Related papers (2023-05-03T07:02:57Z) - RelViT: Concept-guided Vision Transformer for Visual Relational
Reasoning [139.0548263507796]
We use vision transformers (ViTs) as our base model for visual reasoning.
We make better use of concepts defined as object entities and their relations to improve the reasoning ability of ViTs.
We show the resulting model, Concept-guided Vision Transformer (or RelViT for short), significantly outperforms prior approaches on HICO and GQA benchmarks.
arXiv Detail & Related papers (2022-04-24T02:46:43Z) - RelTransformer: Balancing the Visual Relationship Detection from Local
Context, Scene and Memory [24.085223165006212]
We propose a novel framework, dubbed as RelTransformer, which performs relationship prediction using rich semantic features from multiple image levels.
Our model significantly improves the accuracy of GQA-LT by 27.4% upon the best baselines on tail-relationship prediction.
arXiv Detail & Related papers (2021-04-24T12:04:04Z) - Transformation Driven Visual Reasoning [80.32402545546209]
This paper defines a new visual reasoning paradigm by introducing an important factor, i.e.transformation.
We argue that this kind of textbfstate driven visual reasoning approach has limitations in reflecting whether the machine has the ability to infer the dynamics between different states.
Experimental results show that the state-of-the-art visual reasoning models perform well on Basic, but are still far from human-level intelligence on Event and View.
arXiv Detail & Related papers (2020-11-26T07:11:31Z) - Dense Regression Network for Video Grounding [97.57178850020327]
We use the distances between the frame within the ground truth and the starting (ending) frame as dense supervisions to improve the video grounding accuracy.
Specifically, we design a novel dense regression network (DRN) to regress the distances from each frame to the starting (ending) frame of the video segment.
We also propose a simple but effective IoU regression head module to explicitly consider the localization quality of the grounding results.
arXiv Detail & Related papers (2020-04-07T17:15:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.