Relationship-based Neural Baby Talk
- URL: http://arxiv.org/abs/2103.04846v1
- Date: Mon, 8 Mar 2021 15:51:24 GMT
- Title: Relationship-based Neural Baby Talk
- Authors: Fan Fu, Tingting Xie, Ioannis Patras, Sepehr Jalali
- Abstract summary: We study three main relationships: textitspatial relationships to explore geometric interactions, textitsemantic relationships to extract semantic interactions, and textitimplicit relationships to capture hidden information.
Our proposed R-NBT model outperforms state-of-the-art models trained on COCO dataset in three image caption generation tasks.
- Score: 10.342180619706724
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding interactions between objects in an image is an important
element for generating captions. In this paper, we propose a relationship-based
neural baby talk (R-NBT) model to comprehensively investigate several types of
pairwise object interactions by encoding each image via three different
relationship-based graph attention networks (GATs). We study three main
relationships: \textit{spatial relationships} to explore geometric
interactions, \textit{semantic relationships} to extract semantic interactions,
and \textit{implicit relationships} to capture hidden information that could
not be modelled explicitly as above. We construct three relationship graphs
with the objects in an image as nodes, and the mutual relationships of pairwise
objects as edges. By exploring features of neighbouring regions individually
via GATs, we integrate different types of relationships into visual features of
each node. Experiments on COCO dataset show that our proposed R-NBT model
outperforms state-of-the-art models trained on COCO dataset in three image
caption generation tasks.
Related papers
- Hire: Hybrid-modal Interaction with Multiple Relational Enhancements for Image-Text Matching [7.7559623054251]
Image-text matching (ITM) is a fundamental problem in computer vision.
We propose a Hybrid-modal feature the Interaction with multiple Enhancements (termed textitHire) for image-text matching.
In particular, the explicit intra-modal spatial-semantic graph-based reasoning network is designed to improve the contextual representation of visual objects.
arXiv Detail & Related papers (2024-06-05T13:10:55Z) - Relation Rectification in Diffusion Model [64.84686527988809]
We introduce a novel task termed Relation Rectification, aiming to refine the model to accurately represent a given relationship it initially fails to generate.
We propose an innovative solution utilizing a Heterogeneous Graph Convolutional Network (HGCN)
The lightweight HGCN adjusts the text embeddings generated by the text encoder, ensuring the accurate reflection of the textual relation in the embedding space.
arXiv Detail & Related papers (2024-03-29T15:54:36Z) - Semantic Scene Graph Generation Based on an Edge Dual Scene Graph and
Message Passing Neural Network [3.9280441311534653]
Scene graph generation (SGG) captures the relationships between objects in an image and creates a structured graph-based representation.
Existing SGG methods have a limited ability to accurately predict detailed relationships.
A new approach to the modeling multiobject relationships, called edge dual scene graph generation (EdgeSGG), is proposed herein.
arXiv Detail & Related papers (2023-11-02T12:36:52Z) - A Masked Image Reconstruction Network for Document-level Relation
Extraction [3.276435438007766]
Document-level relation extraction requires inference over multiple sentences to extract complex relational triples.
We propose a novel Document-level Relation Extraction model based on a Masked Image Reconstruction network (DRE-MIR)
We evaluate our model on three public document-level relation extraction datasets.
arXiv Detail & Related papers (2022-04-21T02:41:21Z) - Relationformer: A Unified Framework for Image-to-Graph Generation [18.832626244362075]
This work proposes a unified one-stage transformer-based framework, namely Relationformer, that jointly predicts objects and their relations.
We leverage direct set-based object prediction and incorporate the interaction among the objects to learn an object-relation representation jointly.
We achieve state-of-the-art performance on multiple, diverse and multi-domain datasets.
arXiv Detail & Related papers (2022-03-19T00:36:59Z) - Transformer-based Dual Relation Graph for Multi-label Image Recognition [56.12543717723385]
We propose a novel Transformer-based Dual Relation learning framework.
We explore two aspects of correlation, i.e., structural relation graph and semantic relation graph.
Our approach achieves new state-of-the-art on two popular multi-label recognition benchmarks.
arXiv Detail & Related papers (2021-10-10T07:14:52Z) - Unified Graph Structured Models for Video Understanding [93.72081456202672]
We propose a message passing graph neural network that explicitly models relational-temporal relations.
We show how our method is able to more effectively model relationships between relevant entities in the scene.
arXiv Detail & Related papers (2021-03-29T14:37:35Z) - Tensor Composition Net for Visual Relationship Prediction [115.14829858763399]
We present a novel Composition Network (TCN) to predict visual relationships in images.
The key idea of our TCN is to exploit the low rank property of the visual relationship tensor.
We show our TCN's image-level visual relationship prediction provides a simple and efficient mechanism for relation-based image retrieval.
arXiv Detail & Related papers (2020-12-10T06:27:20Z) - Attention Guided Semantic Relationship Parsing for Visual Question
Answering [36.84737596725629]
Humans explain inter-object relationships with semantic labels that demonstrate a high-level understanding required to perform Vision-Language tasks such as Visual Question Answering (VQA)
Existing VQA models represent relationships as a combination of object-level visual features which constrain a model to express interactions between objects in a single domain, while the model is trying to solve a multi-modal task.
In this paper, we propose a general purpose semantic relationship which generates a semantic feature vector for each subject-predicate-object triplet in an image, and a Mutual and Self Attention mechanism that learns to identify relationship triplets that are important to
arXiv Detail & Related papers (2020-10-05T00:23:49Z) - ConsNet: Learning Consistency Graph for Zero-Shot Human-Object
Interaction Detection [101.56529337489417]
We consider the problem of Human-Object Interaction (HOI) Detection, which aims to locate and recognize HOI instances in the form of human, action, object> in images.
We argue that multi-level consistencies among objects, actions and interactions are strong cues for generating semantic representations of rare or previously unseen HOIs.
Our model takes visual features of candidate human-object pairs and word embeddings of HOI labels as inputs, maps them into visual-semantic joint embedding space and obtains detection results by measuring their similarities.
arXiv Detail & Related papers (2020-08-14T09:11:18Z) - Expressing Objects just like Words: Recurrent Visual Embedding for
Image-Text Matching [102.62343739435289]
Existing image-text matching approaches infer the similarity of an image-text pair by capturing and aggregating the affinities between the text and each independent object of the image.
We propose a Dual Path Recurrent Neural Network (DP-RNN) which processes images and sentences symmetrically by recurrent neural networks (RNN)
Our model achieves the state-of-the-art performance on Flickr30K dataset and competitive performance on MS-COCO dataset.
arXiv Detail & Related papers (2020-02-20T00:51:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.