Attention Guided Semantic Relationship Parsing for Visual Question
Answering
- URL: http://arxiv.org/abs/2010.01725v1
- Date: Mon, 5 Oct 2020 00:23:49 GMT
- Title: Attention Guided Semantic Relationship Parsing for Visual Question
Answering
- Authors: Moshiur Farazi, Salman Khan and Nick Barnes
- Abstract summary: Humans explain inter-object relationships with semantic labels that demonstrate a high-level understanding required to perform Vision-Language tasks such as Visual Question Answering (VQA)
Existing VQA models represent relationships as a combination of object-level visual features which constrain a model to express interactions between objects in a single domain, while the model is trying to solve a multi-modal task.
In this paper, we propose a general purpose semantic relationship which generates a semantic feature vector for each subject-predicate-object triplet in an image, and a Mutual and Self Attention mechanism that learns to identify relationship triplets that are important to
- Score: 36.84737596725629
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans explain inter-object relationships with semantic labels that
demonstrate a high-level understanding required to perform complex
Vision-Language tasks such as Visual Question Answering (VQA). However,
existing VQA models represent relationships as a combination of object-level
visual features which constrain a model to express interactions between objects
in a single domain, while the model is trying to solve a multi-modal task. In
this paper, we propose a general purpose semantic relationship parser which
generates a semantic feature vector for each subject-predicate-object triplet
in an image, and a Mutual and Self Attention (MSA) mechanism that learns to
identify relationship triplets that are important to answer the given question.
To motivate the significance of semantic relationships, we show an oracle
setting with ground-truth relationship triplets, where our model achieves a
~25% accuracy gain over the closest state-of-the-art model on the challenging
GQA dataset. Further, with our semantic parser, we show that our model
outperforms other comparable approaches on VQA and GQA datasets.
Related papers
- Multimodal Relational Triple Extraction with Query-based Entity Object Transformer [20.97497765985682]
Multimodal Relation Extraction is crucial for constructing flexible and realistic knowledge.
We propose Multimodal Entity-Object Triple Extraction, which aims to extract all triples (entity, relation, object region) from image-text pairs.
We also propose QEOT, a query-based model with a selective attention mechanism to dynamically explore the interaction and fusion of textual and visual information.
arXiv Detail & Related papers (2024-08-16T12:43:38Z) - Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection [14.22646492640906]
We propose a simple and highly efficient decoder-free architecture for open-vocabulary visual relationship detection.
Our model consists of a Transformer-based image encoder that represents objects as tokens and models their relationships implicitly.
Our approach achieves state-of-the-art relationship detection performance on Visual Genome and on the large-vocabulary GQA benchmark at real-time inference speeds.
arXiv Detail & Related papers (2024-03-21T10:15:57Z) - Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets.
We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models.
Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z) - Prototype-based Embedding Network for Scene Graph Generation [105.97836135784794]
Current Scene Graph Generation (SGG) methods explore contextual information to predict relationships among entity pairs.
Due to the diverse visual appearance of numerous possible subject-object combinations, there is a large intra-class variation within each predicate category.
Prototype-based Embedding Network (PE-Net) models entities/predicates with prototype-aligned compact and distinctive representations.
PL is introduced to help PE-Net efficiently learn such entitypredicate matching, and Prototype Regularization (PR) is devised to relieve the ambiguous entity-predicate matching.
arXiv Detail & Related papers (2023-03-13T13:30:59Z) - RelViT: Concept-guided Vision Transformer for Visual Relational
Reasoning [139.0548263507796]
We use vision transformers (ViTs) as our base model for visual reasoning.
We make better use of concepts defined as object entities and their relations to improve the reasoning ability of ViTs.
We show the resulting model, Concept-guided Vision Transformer (or RelViT for short), significantly outperforms prior approaches on HICO and GQA benchmarks.
arXiv Detail & Related papers (2022-04-24T02:46:43Z) - Unified Graph Structured Models for Video Understanding [93.72081456202672]
We propose a message passing graph neural network that explicitly models relational-temporal relations.
We show how our method is able to more effectively model relationships between relevant entities in the scene.
arXiv Detail & Related papers (2021-03-29T14:37:35Z) - Relationship-based Neural Baby Talk [10.342180619706724]
We study three main relationships: textitspatial relationships to explore geometric interactions, textitsemantic relationships to extract semantic interactions, and textitimplicit relationships to capture hidden information.
Our proposed R-NBT model outperforms state-of-the-art models trained on COCO dataset in three image caption generation tasks.
arXiv Detail & Related papers (2021-03-08T15:51:24Z) - Modeling Global Semantics for Question Answering over Knowledge Bases [16.341353183347664]
We present a relational graph convolutional network (RGCN)-based model gRGCN for semantic parsing in KBQA.
Experiments evaluated on benchmarks show that our model outperforms off-the-shelf models.
arXiv Detail & Related papers (2021-01-05T13:51:14Z) - Cascaded Human-Object Interaction Recognition [175.60439054047043]
We introduce a cascade architecture for a multi-stage, coarse-to-fine HOI understanding.
At each stage, an instance localization network progressively refines HOI proposals and feeds them into an interaction recognition network.
With our carefully-designed human-centric relation features, these two modules work collaboratively towards effective interaction understanding.
arXiv Detail & Related papers (2020-03-09T17:05:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.