SG2Caps: Revisiting Scene Graphs for Image Captioning
- URL: http://arxiv.org/abs/2102.04990v1
- Date: Tue, 9 Feb 2021 18:00:53 GMT
- Title: SG2Caps: Revisiting Scene Graphs for Image Captioning
- Authors: Subarna Tripathi and Kien Nguyen and Tanaya Guha and Bang Du and
Truong Q. Nguyen
- Abstract summary: We propose a framework, SG2Caps, that utilizes only the scene graph labels for competitive image caption-ing performance.
Our framework outperforms existing scene graph-only captioning models by a large margin (CIDEr score of 110 vs 71) indicating scene graphs as a promising representation for image captioning.
- Score: 37.58310822924814
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The mainstream image captioning models rely on Convolutional Neural Network
(CNN) image features with an additional attention to salient regions and
objects to generate captions via recurrent models. Recently, scene graph
representations of images have been used to augment captioning models so as to
leverage their structural semantics, such as object entities, relationships and
attributes. Several studies have noted that naive use of scene graphs from a
black-box scene graph generator harms image caption-ing performance, and scene
graph-based captioning mod-els have to incur the overhead of explicit use of
image features to generate decent captions. Addressing these challenges, we
propose a framework, SG2Caps, that utilizes only the scene graph labels for
competitive image caption-ing performance. The basic idea is to close the
semantic gap between two scene graphs - one derived from the input image and
the other one from its caption. In order to achieve this, we leverage the
spatial location of objects and the Human-Object-Interaction (HOI) labels as an
additional HOI graph. Our framework outperforms existing scene graph-only
captioning models by a large margin (CIDEr score of 110 vs 71) indicating scene
graphs as a promising representation for image captioning. Direct utilization
of the scene graph labels avoids expensive graph convolutions over
high-dimensional CNN features resulting in 49%fewer trainable parameters.
Related papers
- From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models [81.92098140232638]
Scene graph generation (SGG) aims to parse a visual scene into an intermediate graph representation for downstream reasoning tasks.
Existing methods struggle to generate scene graphs with novel visual relation concepts.
We introduce a new open-vocabulary SGG framework based on sequence generation.
arXiv Detail & Related papers (2024-04-01T04:21:01Z) - FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph
Parsing [66.70054075041487]
Existing scene graphs that convert image captions into scene graphs often suffer from two types of errors.
First, the generated scene graphs fail to capture the true semantics of the captions or the corresponding images, resulting in a lack of faithfulness.
Second, the generated scene graphs have high inconsistency, with the same semantics represented by different annotations.
arXiv Detail & Related papers (2023-05-27T15:38:31Z) - Diffusion-Based Scene Graph to Image Generation with Masked Contrastive
Pre-Training [112.94542676251133]
We propose to learn scene graph embeddings by directly optimizing their alignment with images.
Specifically, we pre-train an encoder to extract both global and local information from scene graphs.
The resulting method, called SGDiff, allows for the semantic manipulation of generated images by modifying scene graph nodes and connections.
arXiv Detail & Related papers (2022-11-21T01:11:19Z) - Consensus Graph Representation Learning for Better Grounded Image
Captioning [48.208119537050166]
We propose the Consensus Rraph Representation Learning framework (CGRL) for grounded image captioning.
We validate the effectiveness of our model, with a significant decline in object hallucination (-9% CHAIRi) on the Flickr30k Entities dataset.
arXiv Detail & Related papers (2021-12-02T04:17:01Z) - Learning to Generate Scene Graph from Natural Language Supervision [52.18175340725455]
We propose one of the first methods that learn from image-sentence pairs to extract a graphical representation of localized objects and their relationships within an image, known as scene graph.
We leverage an off-the-shelf object detector to identify and localize object instances, match labels of detected regions to concepts parsed from captions, and thus create "pseudo" labels for learning scene graph.
arXiv Detail & Related papers (2021-09-06T03:38:52Z) - ReFormer: The Relational Transformer for Image Captioning [12.184772369145014]
Image captioning is shown to be able to achieve a better performance by using scene graphs to represent the relations of objects in the image.
We propose a novel architecture ReFormer to generate features with relation information embedded.
Our model significantly outperforms state-of-the-art methods on image captioning and scene graph generation.
arXiv Detail & Related papers (2021-07-29T17:03:36Z) - MOC-GAN: Mixing Objects and Captions to Generate Realistic Images [21.240099965546637]
We introduce a more rational setting, generating a realistic image from the objects and captions.
Under this setting, objects explicitly define the critical roles in the targeted images and captions implicitly describe their rich attributes and connections.
A MOC-GAN is proposed to mix the inputs of two modalities to generate realistic images.
arXiv Detail & Related papers (2021-06-06T14:04:07Z) - Are scene graphs good enough to improve Image Captioning? [19.36188161855731]
We investigate the use of scene graphs in image captioning.
We find no significant difference between models that use scene graph features and models that only use object detection features.
Although the quality of predicted scene graphs is very low in general, when using high quality scene graphs we obtain gains of up to 3.3 CIDEr.
arXiv Detail & Related papers (2020-09-25T16:09:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.