Learning to Generate Scene Graph from Natural Language Supervision
- URL: http://arxiv.org/abs/2109.02227v1
- Date: Mon, 6 Sep 2021 03:38:52 GMT
- Title: Learning to Generate Scene Graph from Natural Language Supervision
- Authors: Yiwu Zhong, Jing Shi, Jianwei Yang, Chenliang Xu, Yin Li
- Abstract summary: We propose one of the first methods that learn from image-sentence pairs to extract a graphical representation of localized objects and their relationships within an image, known as scene graph.
We leverage an off-the-shelf object detector to identify and localize object instances, match labels of detected regions to concepts parsed from captions, and thus create "pseudo" labels for learning scene graph.
- Score: 52.18175340725455
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning from image-text data has demonstrated recent success for many
recognition tasks, yet is currently limited to visual features or individual
visual concepts such as objects. In this paper, we propose one of the first
methods that learn from image-sentence pairs to extract a graphical
representation of localized objects and their relationships within an image,
known as scene graph. To bridge the gap between images and texts, we leverage
an off-the-shelf object detector to identify and localize object instances,
match labels of detected regions to concepts parsed from captions, and thus
create "pseudo" labels for learning scene graph. Further, we design a
Transformer-based model to predict these "pseudo" labels via a masked token
prediction task. Learning from only image-sentence pairs, our model achieves
30% relative gain over a latest method trained with human-annotated unlocalized
scene graphs. Our model also shows strong results for weakly and fully
supervised scene graph generation. In addition, we explore an open-vocabulary
setting for detecting scene graphs, and present the first result for open-set
scene graph generation. Our code is available at
https://github.com/YiwuZhong/SGG_from_NLS.
Related papers
- Open-Vocabulary Object Detection via Scene Graph Discovery [53.27673119360868]
Open-vocabulary (OV) object detection has attracted increasing research attention.
We propose a novel Scene-Graph-Based Discovery Network (SGDN) that exploits scene graph cues for OV detection.
arXiv Detail & Related papers (2023-07-07T00:46:19Z) - FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph
Parsing [66.70054075041487]
Existing scene graphs that convert image captions into scene graphs often suffer from two types of errors.
First, the generated scene graphs fail to capture the true semantics of the captions or the corresponding images, resulting in a lack of faithfulness.
Second, the generated scene graphs have high inconsistency, with the same semantics represented by different annotations.
arXiv Detail & Related papers (2023-05-27T15:38:31Z) - Iterative Scene Graph Generation with Generative Transformers [6.243995448840211]
Scene graphs provide a rich, structured representation of a scene by encoding the entities (objects) and their spatial relationships in a graphical format.
Current approaches take a generation-by-classification approach where the scene graph is generated through labeling of all possible edges between objects in a scene.
This work introduces a generative transformer-based approach to generating scene graphs beyond link prediction.
arXiv Detail & Related papers (2022-11-30T00:05:44Z) - Diffusion-Based Scene Graph to Image Generation with Masked Contrastive
Pre-Training [112.94542676251133]
We propose to learn scene graph embeddings by directly optimizing their alignment with images.
Specifically, we pre-train an encoder to extract both global and local information from scene graphs.
The resulting method, called SGDiff, allows for the semantic manipulation of generated images by modifying scene graph nodes and connections.
arXiv Detail & Related papers (2022-11-21T01:11:19Z) - Scene Graph Generation for Better Image Captioning? [48.411957217304]
We propose a model that leverages detected objects and auto-generated visual relationships to describe images in natural language.
We generate a scene graph from raw image pixels by identifying individual objects and visual relationships between them.
This scene graph then serves as input to our graph-to-text model, which generates the final caption.
arXiv Detail & Related papers (2021-09-23T14:35:11Z) - Unconditional Scene Graph Generation [72.53624470737712]
We develop a deep auto-regressive model called SceneGraphGen which can learn the probability distribution over labelled and directed graphs.
We show that the scene graphs generated by SceneGraphGen are diverse and follow the semantic patterns of real-world scenes.
arXiv Detail & Related papers (2021-08-12T17:57:16Z) - A Deep Local and Global Scene-Graph Matching for Image-Text Retrieval [4.159666152160874]
Scene graph presentation is a suitable method for the image-text matching challenge.
We introduce the Local and Global Scene Graph Matching (LGSGM) model that enhances the state-of-the-art method.
Our enhancement with the combination of levels can improve the performance of the baseline method by increasing the recall by more than 10% on the Flickr30k dataset.
arXiv Detail & Related papers (2021-06-04T10:33:14Z) - Segmentation-grounded Scene Graph Generation [47.34166260639392]
We propose a framework for pixel-level segmentation-grounded scene graph generation.
Our framework is agnostic to the underlying scene graph generation method.
It is learned in a multi-task manner with both target and auxiliary datasets.
arXiv Detail & Related papers (2021-04-29T08:54:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.