Related papers: Towards Unique and Informative Captioning of Images

Towards Unique and Informative Captioning of Images

URL: http://arxiv.org/abs/2009.03949v1
Date: Tue, 8 Sep 2020 19:01:33 GMT
Title: Towards Unique and Informative Captioning of Images
Authors: Zeyu Wang, Berthy Feng, Karthik Narasimhan, Olga Russakovsky
Abstract summary: We analyze both modern captioning systems and evaluation metrics. We design a new metric (SPICE) by introducing a notion of uniqueness over the concepts generated in a caption. We show that SPICE-U is better correlated with human judgements compared to SPICE, and effectively captures notions of diversity and descriptiveness.
Score: 40.036350846970706
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite considerable progress, state of the art image captioning models produce generic captions, leaving out important image details. Furthermore, these systems may even misrepresent the image in order to produce a simpler caption consisting of common concepts. In this paper, we first analyze both modern captioning systems and evaluation metrics through empirical experiments to quantify these phenomena. We find that modern captioning systems return higher likelihoods for incorrect distractor sentences compared to ground truth captions, and that evaluation metrics like SPICE can be 'topped' using simple captioning systems relying on object detectors. Inspired by these observations, we design a new metric (SPICE-U) by introducing a notion of uniqueness over the concepts generated in a caption. We show that SPICE-U is better correlated with human judgements compared to SPICE, and effectively captures notions of diversity and descriptiveness. Finally, we also demonstrate a general technique to improve any existing captioning model -- by using mutual information as a re-ranking objective during decoding. Empirically, this results in more unique and informative captions, and improves three different state-of-the-art models on SPICE-U as well as average score over existing metrics.

Related papers

Group-based Distinctive Image Captioning with Memory Difference Encoding and Attention [62.246950834745796]
Group-based Differential Distinctive Captioning Method. Group-based Differential Memory Attention (GDMA) module. New evaluation metric, the Distinctive Word Rate (DisWordRate)
arXiv Detail & Related papers (2025-04-03T11:19:51Z)
What Makes for Good Image Captions? [50.48589893443939]
Our framework posits that good image captions should balance three key aspects: informationally sufficient, minimally redundant, and readily comprehensible by humans. We introduce the Pyramid of Captions (PoCa) method, which generates enriched captions by integrating local and global visual information.
arXiv Detail & Related papers (2024-05-01T12:49:57Z)
Improving Image Captioning Descriptiveness by Ranking and LLM-based Fusion [17.99150939602917]
State-of-The-Art (SoTA) image captioning models often rely on the Microsoft COCO (MS-COCO) dataset for training. We present a novel approach to address previous challenges by showcasing how captions generated from different SoTA models can be effectively fused.
arXiv Detail & Related papers (2023-06-20T15:13:02Z)
Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation [47.40949434032489]
We propose a new contrastive-based evaluation metric for image captioning, namely Positive-Augmented Contrastive learning Score (PAC-S) PAC-S unifies the learning of a contrastive visual-semantic space with the addition of generated images and text on curated data. Experiments spanning several datasets demonstrate that our new metric achieves the highest correlation with human judgments on both images and videos.
arXiv Detail & Related papers (2023-03-21T18:03:14Z)
Belief Revision based Caption Re-ranker with Visual Semantic Information [31.20692237930281]
We propose a novel re-ranking approach that leverages visual-semantic measures to identify the ideal caption. Our experiments demonstrate the utility of our approach, where we observe that our re-ranker can enhance the performance of a typical image-captioning system.
arXiv Detail & Related papers (2022-09-16T20:36:41Z)
On Distinctive Image Captioning via Comparing and Reweighting [52.3731631461383]
In this paper, we aim to improve the distinctiveness of image captions via comparing and reweighting with a set of similar images. Our metric reveals that the human annotations of each image in the MSCOCO dataset are not equivalent based on distinctiveness. In contrast, previous works normally treat the human annotations equally during training, which could be a reason for generating less distinctive captions.
arXiv Detail & Related papers (2022-04-08T08:59:23Z)
Contrastive Semantic Similarity Learning for Image Captioning Evaluation with Intrinsic Auto-encoder [52.42057181754076]
Motivated by the auto-encoder mechanism and contrastive representation learning advances, we propose a learning-based metric for image captioning. We develop three progressive model structures to learn the sentence level representations. Experiment results show that our proposed method can align well with the scores generated from other contemporary metrics.
arXiv Detail & Related papers (2021-06-29T12:27:05Z)
Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE) Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z)
Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation. Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning. During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.