Transparent Human Evaluation for Image Captioning
- URL: http://arxiv.org/abs/2111.08940v1
- Date: Wed, 17 Nov 2021 07:09:59 GMT
- Title: Transparent Human Evaluation for Image Captioning
- Authors: Jungo Kasai, Keisuke Sakaguchi, Lavinia Dunagan, Jacob Morrison, Ronan
Le Bras, Yejin Choi, Noah A. Smith
- Abstract summary: We develop a rubric-based human evaluation protocol for image captioning models.
We show that human-generated captions show substantially higher quality than machine-generated ones.
We hope that this work will promote a more transparent evaluation protocol for image captioning.
- Score: 70.03979566548823
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We establish a rubric-based human evaluation protocol for image captioning
models. Our scoring rubrics and their definitions are carefully developed based
on machine- and human-generated captions on the MSCOCO dataset. Each caption is
evaluated along two main dimensions in a tradeoff (precision and recall) as
well as other aspects that measure the text quality (fluency, conciseness, and
inclusive language). Our evaluations demonstrate several critical problems of
the current evaluation practice. Human-generated captions show substantially
higher quality than machine-generated ones, especially in coverage of salient
information (i.e., recall), while all automatic metrics say the opposite. Our
rubric-based results reveal that CLIPScore, a recent metric that uses image
features, better correlates with human judgments than conventional text-only
metrics because it is more sensitive to recall. We hope that this work will
promote a more transparent evaluation protocol for image captioning and its
automatic metrics.
Related papers
- A Novel Evaluation Framework for Image2Text Generation [15.10524860121122]
We propose an evaluation framework rooted in a modern large language model (LLM) capable of image generation.
A high similarity score suggests that the image captioning model has accurately generated textual descriptions.
A low similarity score indicates discrepancies, revealing potential shortcomings in the model's performance.
arXiv Detail & Related papers (2024-08-03T09:27:57Z) - BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues [47.213906345208315]
We propose BRIDGE, a new learnable and reference-free image captioning metric.
Our proposal achieves state-of-the-art results compared to existing reference-free evaluation scores.
arXiv Detail & Related papers (2024-07-29T18:00:17Z) - Vision Language Model-based Caption Evaluation Method Leveraging Visual
Context Extraction [27.00018283430169]
This paper presents VisCE$2$, a vision language model-based caption evaluation method.
Our method focuses on visual context, which refers to the detailed content of images, including objects, attributes, and relationships.
arXiv Detail & Related papers (2024-02-28T01:29:36Z) - Likelihood-Based Text-to-Image Evaluation with Patch-Level Perceptual
and Semantic Credit Assignment [48.835298314274254]
We propose to evaluate text-to-image generation performance by directly estimating the likelihood of the generated images.
A higher likelihood indicates better perceptual quality and better text-image alignment.
It can successfully assess the generation ability of these models with as few as a hundred samples.
arXiv Detail & Related papers (2023-08-16T17:26:47Z) - InfoMetIC: An Informative Metric for Reference-free Image Caption
Evaluation [69.1642316502563]
We propose an Informative Metric for Reference-free Image Caption evaluation (InfoMetIC)
Given an image and a caption, InfoMetIC is able to report incorrect words and unmentioned image regions at fine-grained level.
We also construct a token-level evaluation dataset and demonstrate the effectiveness of InfoMetIC in fine-grained evaluation.
arXiv Detail & Related papers (2023-05-10T09:22:44Z) - COSMic: A Coherence-Aware Generation Metric for Image Descriptions [27.41088864449921]
Image metrics have struggled to give accurate learned estimates of the semantic and pragmatic success of text evaluation models.
We present the first learned generation metric for evaluating output captions.
We demonstrate a higher out-efficient for our proposed metric the human judgments for the results of a number of state-of-the-art caption models when compared to several other metrics such as BLEURT and BERT.
arXiv Detail & Related papers (2021-09-11T13:43:36Z) - Contrastive Semantic Similarity Learning for Image Captioning Evaluation
with Intrinsic Auto-encoder [52.42057181754076]
Motivated by the auto-encoder mechanism and contrastive representation learning advances, we propose a learning-based metric for image captioning.
We develop three progressive model structures to learn the sentence level representations.
Experiment results show that our proposed method can align well with the scores generated from other contemporary metrics.
arXiv Detail & Related papers (2021-06-29T12:27:05Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.