Related papers: VCRScore: Image captioning metric based on V\&L Transformers, CLIP, and precision-recall

VCRScore: Image captioning metric based on V\&L Transformers, CLIP, and precision-recall

URL: http://arxiv.org/abs/2501.09155v2
Date: Mon, 27 Jan 2025 16:05:59 GMT
Title: VCRScore: Image captioning metric based on V\&L Transformers, CLIP, and precision-recall
Authors: Guillermo Ruiz, Tania Ramírez, Daniela Moctezuma,
Abstract summary: This work proposes a new evaluation metric for the image captioning problem.<n>It was generated a human-labeled dataset to assess to which degree the captions correlate with the image's content.<n>Outperformed results were also found, and interesting insights were presented and discussed.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Image captioning has become an essential Vision & Language research task. It is about predicting the most accurate caption given a specific image or video. The research community has achieved impressive results by continuously proposing new models and approaches to improve the overall model's performance. Nevertheless, despite increasing proposals, the performance metrics used to measure their advances have remained practically untouched through the years. A probe of that, nowadays metrics like BLEU, METEOR, CIDEr, and ROUGE are still very used, aside from more sophisticated metrics such as BertScore and ClipScore. Hence, it is essential to adjust how are measure the advances, limitations, and scopes of the new image captioning proposals, as well as to adapt new metrics to these new advanced image captioning approaches. This work proposes a new evaluation metric for the image captioning problem. To do that, first, it was generated a human-labeled dataset to assess to which degree the captions correlate with the image's content. Taking these human scores as ground truth, we propose a new metric, and compare it with several well-known metrics, from classical to newer ones. Outperformed results were also found, and interesting insights were presented and discussed.

Related papers

Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning [56.31096024472269]
We introduce DeCapBench along with a novel metric, DCScore, specifically designed for detailed captioning tasks. DCScore evaluates hallucinations and fine-grained comprehensiveness by deconstructing responses into the smallest self-sufficient units. DeCapBench exhibits a high correlation with VLM arena results on descriptive tasks, surpassing existing benchmarks for vision-language models.
arXiv Detail & Related papers (2025-03-10T22:53:56Z)
Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis [35.71703501731081]
We present the first survey and taxonomy of over 70 different image captioning metrics and their usage in hundreds of papers. Despite the diversity of proposed metrics, the vast majority of studies rely on only five popular metrics, which we show to be weakly correlated with human ratings.
arXiv Detail & Related papers (2024-08-09T07:31:06Z)
InfoMetIC: An Informative Metric for Reference-free Image Caption Evaluation [69.1642316502563]
We propose an Informative Metric for Reference-free Image Caption evaluation (InfoMetIC) Given an image and a caption, InfoMetIC is able to report incorrect words and unmentioned image regions at fine-grained level. We also construct a token-level evaluation dataset and demonstrate the effectiveness of InfoMetIC in fine-grained evaluation.
arXiv Detail & Related papers (2023-05-10T09:22:44Z)
Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation [47.40949434032489]
We propose a new contrastive-based evaluation metric for image captioning, namely Positive-Augmented Contrastive learning Score (PAC-S) PAC-S unifies the learning of a contrastive visual-semantic space with the addition of generated images and text on curated data. Experiments spanning several datasets demonstrate that our new metric achieves the highest correlation with human judgments on both images and videos.
arXiv Detail & Related papers (2023-03-21T18:03:14Z)
Are metrics measuring what they should? An evaluation of image captioning task metrics [0.21301560294088315]
Image Captioning is a current research task to describe the image content using the objects and their relationships in the scene. To tackle this task, two important research areas are used, artificial vision, and natural language processing. We present an evaluation of several kinds of Image Captioning metrics and a comparison between them using the well-known MS COCO dataset.
arXiv Detail & Related papers (2022-07-04T21:51:47Z)
Transparent Human Evaluation for Image Captioning [70.03979566548823]
We develop a rubric-based human evaluation protocol for image captioning models. We show that human-generated captions show substantially higher quality than machine-generated ones. We hope that this work will promote a more transparent evaluation protocol for image captioning.
arXiv Detail & Related papers (2021-11-17T07:09:59Z)
Contrastive Semantic Similarity Learning for Image Captioning Evaluation with Intrinsic Auto-encoder [52.42057181754076]
Motivated by the auto-encoder mechanism and contrastive representation learning advances, we propose a learning-based metric for image captioning. We develop three progressive model structures to learn the sentence level representations. Experiment results show that our proposed method can align well with the scores generated from other contemporary metrics.
arXiv Detail & Related papers (2021-06-29T12:27:05Z)
UMIC: An Unreferenced Metric for Image Captioning via Contrastive Learning [39.40274917797253]
In this paper, we introduce a new metric UMIC, an Unreferenced Metric for Image Captioning. Based on Vision-and-Language BERT, we train UMIC to discriminate negative captions via contrastive learning. Also, we observe critical problems of the previous benchmark dataset on image captioning metric, and introduce a new collection of human annotations on the generated captions.
arXiv Detail & Related papers (2021-06-26T13:27:14Z)
Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE) Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.