Grounding Scene Graphs on Natural Images via Visio-Lingual Message
Passing
- URL: http://arxiv.org/abs/2211.01969v1
- Date: Thu, 3 Nov 2022 16:46:46 GMT
- Title: Grounding Scene Graphs on Natural Images via Visio-Lingual Message
Passing
- Authors: Aditay Tripathi, Anand Mishra, Anirban Chakraborty
- Abstract summary: This paper presents a framework for jointly grounding objects that follow certain semantic relationship constraints in a scene graph.
A scene graph is an efficient and structured way to represent all the objects and their semantic relationships in the image.
- Score: 17.63475613154152
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents a framework for jointly grounding objects that follow
certain semantic relationship constraints given in a scene graph. A typical
natural scene contains several objects, often exhibiting visual relationships
of varied complexities between them. These inter-object relationships provide
strong contextual cues toward improving grounding performance compared to a
traditional object query-only-based localization task. A scene graph is an
efficient and structured way to represent all the objects and their semantic
relationships in the image. In an attempt towards bridging these two modalities
representing scenes and utilizing contextual information for improving object
localization, we rigorously study the problem of grounding scene graphs on
natural images. To this end, we propose a novel graph neural network-based
approach referred to as Visio-Lingual Message PAssing Graph Neural Network
(VL-MPAG Net). In VL-MPAG Net, we first construct a directed graph with object
proposals as nodes and an edge between a pair of nodes representing a plausible
relation between them. Then a three-step inter-graph and intra-graph message
passing is performed to learn the context-dependent representation of the
proposals and query objects. These object representations are used to score the
proposals to generate object localization. The proposed method significantly
outperforms the baselines on four public datasets.
Related papers
- Composing Object Relations and Attributes for Image-Text Matching [70.47747937665987]
This work introduces a dual-encoder image-text matching model, leveraging a scene graph to represent captions with nodes for objects and attributes interconnected by relational edges.
Our model efficiently encodes object-attribute and object-object semantic relations, resulting in a robust and fast-performing system.
arXiv Detail & Related papers (2024-06-17T17:56:01Z) - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - Unbiased Heterogeneous Scene Graph Generation with Relation-aware
Message Passing Neural Network [9.779600950401315]
We propose an unbiased heterogeneous scene graph generation (HetSGG) framework that captures relation-aware context.
We devise a novel message passing layer, called relation-aware message passing neural network (RMP), that aggregates the contextual information of an image.
arXiv Detail & Related papers (2022-12-01T11:25:36Z) - Image Semantic Relation Generation [0.76146285961466]
Scene graphs can distil complex image information and correct the bias of visual models using semantic-level relations.
In this work, we introduce image semantic relation generation (ISRG), a simple but effective image-to-text model.
arXiv Detail & Related papers (2022-10-19T16:15:19Z) - Relation Regularized Scene Graph Generation [206.76762860019065]
Scene graph generation (SGG) is built on top of detected objects to predict object pairwise visual relations.
We propose a relation regularized network (R2-Net) which can predict whether there is a relationship between two objects.
Our R2-Net can effectively refine object labels and generate scene graphs.
arXiv Detail & Related papers (2022-02-22T11:36:49Z) - Scenes and Surroundings: Scene Graph Generation using Relation
Transformer [13.146732454123326]
This work proposes a novel local-context aware architecture named relation transformer.
Our hierarchical multi-head attention-based approach efficiently captures contextual dependencies between objects and predicts their relationships.
In comparison to state-of-the-art approaches, we have achieved an overall mean textbf4.85% improvement.
arXiv Detail & Related papers (2021-07-12T14:22:20Z) - Segmentation-grounded Scene Graph Generation [47.34166260639392]
We propose a framework for pixel-level segmentation-grounded scene graph generation.
Our framework is agnostic to the underlying scene graph generation method.
It is learned in a multi-task manner with both target and auxiliary datasets.
arXiv Detail & Related papers (2021-04-29T08:54:08Z) - Exploiting Relationship for Complex-scene Image Generation [43.022978211274065]
This work explores relationship-aware complex-scene image generation, where multiple objects are inter-related as a scene graph.
We propose three major updates in the generation framework. First, reasonable spatial layouts are inferred by jointly considering the semantics and relationships among objects.
Second, since the relations between objects significantly influence an object's appearance, we design a relation-guided generator to generate objects reflecting their relationships.
Third, a novel scene graph discriminator is proposed to guarantee the consistency between the generated image and the input scene graph.
arXiv Detail & Related papers (2021-04-01T09:21:39Z) - ORD: Object Relationship Discovery for Visual Dialogue Generation [60.471670447176656]
We propose an object relationship discovery (ORD) framework to preserve the object interactions for visual dialogue generation.
A hierarchical graph convolutional network (HierGCN) is proposed to retain the object nodes and neighbour relationships locally, and then refines the object-object connections globally.
Experiments have proved that the proposed method can significantly improve the quality of dialogue by utilising the contextual information of visual relationships.
arXiv Detail & Related papers (2020-06-15T12:25:40Z) - Expressing Objects just like Words: Recurrent Visual Embedding for
Image-Text Matching [102.62343739435289]
Existing image-text matching approaches infer the similarity of an image-text pair by capturing and aggregating the affinities between the text and each independent object of the image.
We propose a Dual Path Recurrent Neural Network (DP-RNN) which processes images and sentences symmetrically by recurrent neural networks (RNN)
Our model achieves the state-of-the-art performance on Flickr30K dataset and competitive performance on MS-COCO dataset.
arXiv Detail & Related papers (2020-02-20T00:51:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.