Are scene graphs good enough to improve Image Captioning?
- URL: http://arxiv.org/abs/2009.12313v2
- Date: Tue, 27 Oct 2020 17:55:55 GMT
- Title: Are scene graphs good enough to improve Image Captioning?
- Authors: Victor Milewski and Marie-Francine Moens and Iacer Calixto
- Abstract summary: We investigate the use of scene graphs in image captioning.
We find no significant difference between models that use scene graph features and models that only use object detection features.
Although the quality of predicted scene graphs is very low in general, when using high quality scene graphs we obtain gains of up to 3.3 CIDEr.
- Score: 19.36188161855731
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many top-performing image captioning models rely solely on object features
computed with an object detection model to generate image descriptions.
However, recent studies propose to directly use scene graphs to introduce
information about object relations into captioning, hoping to better describe
interactions between objects. In this work, we thoroughly investigate the use
of scene graphs in image captioning. We empirically study whether using
additional scene graph encoders can lead to better image descriptions and
propose a conditional graph attention network (C-GAT), where the image
captioning decoder state is used to condition the graph updates. Finally, we
determine to what extent noise in the predicted scene graphs influence caption
quality. Overall, we find no significant difference between models that use
scene graph features and models that only use object detection features across
different captioning metrics, which suggests that existing scene graph
generation models are still too noisy to be useful in image captioning.
Moreover, although the quality of predicted scene graphs is very low in
general, when using high quality scene graphs we obtain gains of up to 3.3
CIDEr compared to a strong Bottom-Up Top-Down baseline. We open source code to
reproduce all our experiments in
https://github.com/iacercalixto/butd-image-captioning.
Related papers
- FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph
Parsing [66.70054075041487]
Existing scene graphs that convert image captions into scene graphs often suffer from two types of errors.
First, the generated scene graphs fail to capture the true semantics of the captions or the corresponding images, resulting in a lack of faithfulness.
Second, the generated scene graphs have high inconsistency, with the same semantics represented by different annotations.
arXiv Detail & Related papers (2023-05-27T15:38:31Z) - SPAN: Learning Similarity between Scene Graphs and Images with Transformers [29.582313604112336]
We propose a Scene graPh-imAge coNtrastive learning framework, SPAN, that can measure the similarity between scene graphs and images.
We introduce a novel graph serialization technique that transforms a scene graph into a sequence with structural encodings.
arXiv Detail & Related papers (2023-04-02T18:13:36Z) - Diffusion-Based Scene Graph to Image Generation with Masked Contrastive
Pre-Training [112.94542676251133]
We propose to learn scene graph embeddings by directly optimizing their alignment with images.
Specifically, we pre-train an encoder to extract both global and local information from scene graphs.
The resulting method, called SGDiff, allows for the semantic manipulation of generated images by modifying scene graph nodes and connections.
arXiv Detail & Related papers (2022-11-21T01:11:19Z) - Consensus Graph Representation Learning for Better Grounded Image
Captioning [48.208119537050166]
We propose the Consensus Rraph Representation Learning framework (CGRL) for grounded image captioning.
We validate the effectiveness of our model, with a significant decline in object hallucination (-9% CHAIRi) on the Flickr30k Entities dataset.
arXiv Detail & Related papers (2021-12-02T04:17:01Z) - Scene Graph Generation for Better Image Captioning? [48.411957217304]
We propose a model that leverages detected objects and auto-generated visual relationships to describe images in natural language.
We generate a scene graph from raw image pixels by identifying individual objects and visual relationships between them.
This scene graph then serves as input to our graph-to-text model, which generates the final caption.
arXiv Detail & Related papers (2021-09-23T14:35:11Z) - Learning to Generate Scene Graph from Natural Language Supervision [52.18175340725455]
We propose one of the first methods that learn from image-sentence pairs to extract a graphical representation of localized objects and their relationships within an image, known as scene graph.
We leverage an off-the-shelf object detector to identify and localize object instances, match labels of detected regions to concepts parsed from captions, and thus create "pseudo" labels for learning scene graph.
arXiv Detail & Related papers (2021-09-06T03:38:52Z) - ReFormer: The Relational Transformer for Image Captioning [12.184772369145014]
Image captioning is shown to be able to achieve a better performance by using scene graphs to represent the relations of objects in the image.
We propose a novel architecture ReFormer to generate features with relation information embedded.
Our model significantly outperforms state-of-the-art methods on image captioning and scene graph generation.
arXiv Detail & Related papers (2021-07-29T17:03:36Z) - Image Scene Graph Generation (SGG) Benchmark [58.33119409657256]
There is a surge of interest in image scene graph generation (object, and relationship detection)
Due to the lack of a good benchmark, the reported results of different scene graph generation models are not directly comparable.
We have developed a much-needed scene graph generation benchmark based on the maskrcnn-benchmark and several popular models.
arXiv Detail & Related papers (2021-07-27T05:10:09Z) - SG2Caps: Revisiting Scene Graphs for Image Captioning [37.58310822924814]
We propose a framework, SG2Caps, that utilizes only the scene graph labels for competitive image caption-ing performance.
Our framework outperforms existing scene graph-only captioning models by a large margin (CIDEr score of 110 vs 71) indicating scene graphs as a promising representation for image captioning.
arXiv Detail & Related papers (2021-02-09T18:00:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.