Scene Graph Generation for Better Image Captioning?
- URL: http://arxiv.org/abs/2109.11398v1
- Date: Thu, 23 Sep 2021 14:35:11 GMT
- Title: Scene Graph Generation for Better Image Captioning?
- Authors: Maximilian Mozes, Martin Schmitt, Vladimir Golkov, Hinrich Sch\"utze,
Daniel Cremers
- Abstract summary: We propose a model that leverages detected objects and auto-generated visual relationships to describe images in natural language.
We generate a scene graph from raw image pixels by identifying individual objects and visual relationships between them.
This scene graph then serves as input to our graph-to-text model, which generates the final caption.
- Score: 48.411957217304
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We investigate the incorporation of visual relationships into the task of
supervised image caption generation by proposing a model that leverages
detected objects and auto-generated visual relationships to describe images in
natural language. To do so, we first generate a scene graph from raw image
pixels by identifying individual objects and visual relationships between them.
This scene graph then serves as input to our graph-to-text model, which
generates the final caption. In contrast to previous approaches, our model thus
explicitly models the detection of objects and visual relationships in the
image. For our experiments we construct a new dataset from the intersection of
Visual Genome and MS COCO, consisting of images with both a corresponding gold
scene graph and human-authored caption. Our results show that our methods
outperform existing state-of-the-art end-to-end models that generate image
descriptions directly from raw input pixels when compared in terms of the BLEU
and METEOR evaluation metrics.
Related papers
- Joint Generative Modeling of Scene Graphs and Images via Diffusion
Models [37.788957749123725]
We present a novel generative task: joint scene graph - image generation.
We introduce a novel diffusion model, DiffuseSG, that jointly models the adjacency matrix along with heterogeneous node and edge attributes.
With a graph transformer being the denoiser, DiffuseSG successively denoises the scene graph representation in a continuous space and discretizes the final representation to generate the clean scene graph.
arXiv Detail & Related papers (2024-01-02T10:10:29Z) - FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph
Parsing [66.70054075041487]
Existing scene graphs that convert image captions into scene graphs often suffer from two types of errors.
First, the generated scene graphs fail to capture the true semantics of the captions or the corresponding images, resulting in a lack of faithfulness.
Second, the generated scene graphs have high inconsistency, with the same semantics represented by different annotations.
arXiv Detail & Related papers (2023-05-27T15:38:31Z) - DisPositioNet: Disentangled Pose and Identity in Semantic Image
Manipulation [83.51882381294357]
DisPositioNet is a model that learns a disentangled representation for each object for the task of image manipulation using scene graphs.
Our framework enables the disentanglement of the variational latent embeddings as well as the feature representation in the graph.
arXiv Detail & Related papers (2022-11-10T11:47:37Z) - Learning to Generate Scene Graph from Natural Language Supervision [52.18175340725455]
We propose one of the first methods that learn from image-sentence pairs to extract a graphical representation of localized objects and their relationships within an image, known as scene graph.
We leverage an off-the-shelf object detector to identify and localize object instances, match labels of detected regions to concepts parsed from captions, and thus create "pseudo" labels for learning scene graph.
arXiv Detail & Related papers (2021-09-06T03:38:52Z) - Unconditional Scene Graph Generation [72.53624470737712]
We develop a deep auto-regressive model called SceneGraphGen which can learn the probability distribution over labelled and directed graphs.
We show that the scene graphs generated by SceneGraphGen are diverse and follow the semantic patterns of real-world scenes.
arXiv Detail & Related papers (2021-08-12T17:57:16Z) - Image Scene Graph Generation (SGG) Benchmark [58.33119409657256]
There is a surge of interest in image scene graph generation (object, and relationship detection)
Due to the lack of a good benchmark, the reported results of different scene graph generation models are not directly comparable.
We have developed a much-needed scene graph generation benchmark based on the maskrcnn-benchmark and several popular models.
arXiv Detail & Related papers (2021-07-27T05:10:09Z) - A Deep Local and Global Scene-Graph Matching for Image-Text Retrieval [4.159666152160874]
Scene graph presentation is a suitable method for the image-text matching challenge.
We introduce the Local and Global Scene Graph Matching (LGSGM) model that enhances the state-of-the-art method.
Our enhancement with the combination of levels can improve the performance of the baseline method by increasing the recall by more than 10% on the Flickr30k dataset.
arXiv Detail & Related papers (2021-06-04T10:33:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.