FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph
Parsing
- URL: http://arxiv.org/abs/2305.17497v2
- Date: Thu, 1 Jun 2023 04:56:26 GMT
- Title: FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph
Parsing
- Authors: Zhuang Li, Yuyang Chai, Terry Yue Zhuo, Lizhen Qu, Gholamreza Haffari,
Fei Li, Donghong Ji, Quan Hung Tran
- Abstract summary: Existing scene graphs that convert image captions into scene graphs often suffer from two types of errors.
First, the generated scene graphs fail to capture the true semantics of the captions or the corresponding images, resulting in a lack of faithfulness.
Second, the generated scene graphs have high inconsistency, with the same semantics represented by different annotations.
- Score: 66.70054075041487
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Textual scene graph parsing has become increasingly important in various
vision-language applications, including image caption evaluation and image
retrieval. However, existing scene graph parsers that convert image captions
into scene graphs often suffer from two types of errors. First, the generated
scene graphs fail to capture the true semantics of the captions or the
corresponding images, resulting in a lack of faithfulness. Second, the
generated scene graphs have high inconsistency, with the same semantics
represented by different annotations.
To address these challenges, we propose a novel dataset, which involves
re-annotating the captions in Visual Genome (VG) using a new intermediate
representation called FACTUAL-MR. FACTUAL-MR can be directly converted into
faithful and consistent scene graph annotations. Our experimental results
clearly demonstrate that the parser trained on our dataset outperforms existing
approaches in terms of faithfulness and consistency. This improvement leads to
a significant performance boost in both image caption evaluation and zero-shot
image retrieval tasks. Furthermore, we introduce a novel metric for measuring
scene graph similarity, which, when combined with the improved scene graph
parser, achieves state-of-the-art (SOTA) results on multiple benchmark datasets
for the aforementioned tasks. The code and dataset are available at
https://github.com/zhuang-li/FACTUAL .
Related papers
- SPAN: Learning Similarity between Scene Graphs and Images with Transformers [29.582313604112336]
We propose a Scene graPh-imAge coNtrastive learning framework, SPAN, that can measure the similarity between scene graphs and images.
We introduce a novel graph serialization technique that transforms a scene graph into a sequence with structural encodings.
arXiv Detail & Related papers (2023-04-02T18:13:36Z) - Diffusion-Based Scene Graph to Image Generation with Masked Contrastive
Pre-Training [112.94542676251133]
We propose to learn scene graph embeddings by directly optimizing their alignment with images.
Specifically, we pre-train an encoder to extract both global and local information from scene graphs.
The resulting method, called SGDiff, allows for the semantic manipulation of generated images by modifying scene graph nodes and connections.
arXiv Detail & Related papers (2022-11-21T01:11:19Z) - Consensus Graph Representation Learning for Better Grounded Image
Captioning [48.208119537050166]
We propose the Consensus Rraph Representation Learning framework (CGRL) for grounded image captioning.
We validate the effectiveness of our model, with a significant decline in object hallucination (-9% CHAIRi) on the Flickr30k Entities dataset.
arXiv Detail & Related papers (2021-12-02T04:17:01Z) - Learning to Generate Scene Graph from Natural Language Supervision [52.18175340725455]
We propose one of the first methods that learn from image-sentence pairs to extract a graphical representation of localized objects and their relationships within an image, known as scene graph.
We leverage an off-the-shelf object detector to identify and localize object instances, match labels of detected regions to concepts parsed from captions, and thus create "pseudo" labels for learning scene graph.
arXiv Detail & Related papers (2021-09-06T03:38:52Z) - A Deep Local and Global Scene-Graph Matching for Image-Text Retrieval [4.159666152160874]
Scene graph presentation is a suitable method for the image-text matching challenge.
We introduce the Local and Global Scene Graph Matching (LGSGM) model that enhances the state-of-the-art method.
Our enhancement with the combination of levels can improve the performance of the baseline method by increasing the recall by more than 10% on the Flickr30k dataset.
arXiv Detail & Related papers (2021-06-04T10:33:14Z) - SG2Caps: Revisiting Scene Graphs for Image Captioning [37.58310822924814]
We propose a framework, SG2Caps, that utilizes only the scene graph labels for competitive image caption-ing performance.
Our framework outperforms existing scene graph-only captioning models by a large margin (CIDEr score of 110 vs 71) indicating scene graphs as a promising representation for image captioning.
arXiv Detail & Related papers (2021-02-09T18:00:53Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.