RelViT: Concept-guided Vision Transformer for Visual Relational
Reasoning
- URL: http://arxiv.org/abs/2204.11167v1
- Date: Sun, 24 Apr 2022 02:46:43 GMT
- Title: RelViT: Concept-guided Vision Transformer for Visual Relational
Reasoning
- Authors: Xiaojian Ma, Weili Nie, Zhiding Yu, Huaizu Jiang, Chaowei Xiao, Yuke
Zhu, Song-Chun Zhu, Anima Anandkumar
- Abstract summary: We use vision transformers (ViTs) as our base model for visual reasoning.
We make better use of concepts defined as object entities and their relations to improve the reasoning ability of ViTs.
We show the resulting model, Concept-guided Vision Transformer (or RelViT for short), significantly outperforms prior approaches on HICO and GQA benchmarks.
- Score: 139.0548263507796
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reasoning about visual relationships is central to how humans interpret the
visual world. This task remains challenging for current deep learning
algorithms since it requires addressing three key technical problems jointly:
1) identifying object entities and their properties, 2) inferring semantic
relations between pairs of entities, and 3) generalizing to novel
object-relation combinations, i.e., systematic generalization. In this work, we
use vision transformers (ViTs) as our base model for visual reasoning and make
better use of concepts defined as object entities and their relations to
improve the reasoning ability of ViTs. Specifically, we introduce a novel
concept-feature dictionary to allow flexible image feature retrieval at
training time with concept keys. This dictionary enables two new concept-guided
auxiliary tasks: 1) a global task for promoting relational reasoning, and 2) a
local task for facilitating semantic object-centric correspondence learning. To
examine the systematic generalization of visual reasoning models, we introduce
systematic splits for the standard HICO and GQA benchmarks. We show the
resulting model, Concept-guided Vision Transformer (or RelViT for short)
significantly outperforms prior approaches on HICO and GQA by 16% and 13% in
the original split, and by 43% and 18% in the systematic split. Our ablation
analyses also reveal our model's compatibility with multiple ViT variants and
robustness to hyper-parameters.
Related papers
- Towards Flexible Visual Relationship Segmentation [25.890273232954055]
Visual relationship understanding has been studied separately in human-object interaction detection, scene graph generation, and relationships referring tasks.
We propose FleVRS, a single model that seamlessly integrates the above three aspects in standard and promptable visual relationship segmentation.
Our framework outperforms existing models in standard, promptable, and open-vocabulary tasks.
arXiv Detail & Related papers (2024-08-15T17:57:38Z) - Beyond the Doors of Perception: Vision Transformers Represent Relations Between Objects [30.09778169168547]
Vision transformers (ViTs) have achieved state-of-the-art performance in a variety of settings.
However, they exhibit surprising failures when performing tasks involving visual relations.
arXiv Detail & Related papers (2024-06-22T22:43:10Z) - Interactive Visual Task Learning for Robots [4.114444605090135]
We present a framework for robots to learn novel visual concepts and tasks via in-situ linguistic interactions with human users.
We propose Hi-Viscont, which augments information of a novel concept to its parent nodes within a concept hierarchy.
We represent a visual task as a scene graph with language annotations, allowing us to create novel permutations of a demonstrated task zero-shot in-situ.
arXiv Detail & Related papers (2023-12-20T17:38:04Z) - Visual Commonsense based Heterogeneous Graph Contrastive Learning [79.22206720896664]
We propose a heterogeneous graph contrastive learning method to better finish the visual reasoning task.
Our method is designed as a plug-and-play way, so that it can be quickly and easily combined with a wide range of representative methods.
arXiv Detail & Related papers (2023-11-11T12:01:18Z) - Towards a Unified Transformer-based Framework for Scene Graph Generation
and Human-object Interaction Detection [116.21529970404653]
We introduce SG2HOI+, a unified one-step model based on the Transformer architecture.
Our approach employs two interactive hierarchical Transformers to seamlessly unify the tasks of SGG and HOI detection.
Our approach achieves competitive performance when compared to state-of-the-art HOI methods.
arXiv Detail & Related papers (2023-11-03T07:25:57Z) - Top-Down Visual Attention from Analysis by Synthesis [87.47527557366593]
We consider top-down attention from a classic Analysis-by-Synthesis (AbS) perspective of vision.
We propose Analysis-by-Synthesis Vision Transformer (AbSViT), which is a top-down modulated ViT model that variationally approximates AbS, and controllable achieves top-down attention.
arXiv Detail & Related papers (2023-03-23T05:17:05Z) - SrTR: Self-reasoning Transformer with Visual-linguistic Knowledge for
Scene Graph Generation [12.977857322594206]
One-stage scene graph generation approaches infer the effective relation between entity pairs using sparse proposal sets and a few queries.
A Self-reasoning Transformer with Visual-linguistic Knowledge (SrTR) is proposed to add flexible self-reasoning ability to the model.
Inspired by the large-scale pre-training image-text foundation models, visual-linguistic prior knowledge is introduced.
arXiv Detail & Related papers (2022-12-19T09:47:27Z) - Visual Superordinate Abstraction for Robust Concept Learning [80.15940996821541]
Concept learning constructs visual representations that are connected to linguistic semantics.
We ascribe the bottleneck to a failure of exploring the intrinsic semantic hierarchy of visual concepts.
We propose a visual superordinate abstraction framework for explicitly modeling semantic-aware visual subspaces.
arXiv Detail & Related papers (2022-05-28T14:27:38Z) - A Minimalist Dataset for Systematic Generalization of Perception,
Syntax, and Semantics [131.93113552146195]
We present a new dataset, Handwritten arithmetic with INTegers (HINT), to examine machines' capability of learning generalizable concepts.
In HINT, machines are tasked with learning how concepts are perceived from raw signals such as images.
We undertake extensive experiments with various sequence-to-sequence models, including RNNs, Transformers, and GPT-3.
arXiv Detail & Related papers (2021-03-02T01:32:54Z) - Attention Guided Semantic Relationship Parsing for Visual Question
Answering [36.84737596725629]
Humans explain inter-object relationships with semantic labels that demonstrate a high-level understanding required to perform Vision-Language tasks such as Visual Question Answering (VQA)
Existing VQA models represent relationships as a combination of object-level visual features which constrain a model to express interactions between objects in a single domain, while the model is trying to solve a multi-modal task.
In this paper, we propose a general purpose semantic relationship which generates a semantic feature vector for each subject-predicate-object triplet in an image, and a Mutual and Self Attention mechanism that learns to identify relationship triplets that are important to
arXiv Detail & Related papers (2020-10-05T00:23:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.