VCRScore: Image captioning metric based on V\&L Transformers, CLIP, and precision-recall
- URL: http://arxiv.org/abs/2501.09155v2
- Date: Mon, 27 Jan 2025 16:05:59 GMT
- Title: VCRScore: Image captioning metric based on V\&L Transformers, CLIP, and precision-recall
- Authors: Guillermo Ruiz, Tania RamÃrez, Daniela Moctezuma,
- Abstract summary: This work proposes a new evaluation metric for the image captioning problem.
It was generated a human-labeled dataset to assess to which degree the captions correlate with the image's content.
Outperformed results were also found, and interesting insights were presented and discussed.
- Score: 0.0
- License:
- Abstract: Image captioning has become an essential Vision & Language research task. It is about predicting the most accurate caption given a specific image or video. The research community has achieved impressive results by continuously proposing new models and approaches to improve the overall model's performance. Nevertheless, despite increasing proposals, the performance metrics used to measure their advances have remained practically untouched through the years. A probe of that, nowadays metrics like BLEU, METEOR, CIDEr, and ROUGE are still very used, aside from more sophisticated metrics such as BertScore and ClipScore. Hence, it is essential to adjust how are measure the advances, limitations, and scopes of the new image captioning proposals, as well as to adapt new metrics to these new advanced image captioning approaches. This work proposes a new evaluation metric for the image captioning problem. To do that, first, it was generated a human-labeled dataset to assess to which degree the captions correlate with the image's content. Taking these human scores as ground truth, we propose a new metric, and compare it with several well-known metrics, from classical to newer ones. Outperformed results were also found, and interesting insights were presented and discussed.
Related papers
- Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy and Novel Ensemble Method [35.71703501731081]
We present the first survey and taxonomy of over 70 different image captioning metrics.
We find that despite the diversity of proposed metrics, the vast majority of studies rely on only five popular metrics.
We propose EnsembEval -- an ensemble of evaluation methods achieving the highest reported correlation with human judgements.
arXiv Detail & Related papers (2024-08-09T07:31:06Z) - InfoMetIC: An Informative Metric for Reference-free Image Caption
Evaluation [69.1642316502563]
We propose an Informative Metric for Reference-free Image Caption evaluation (InfoMetIC)
Given an image and a caption, InfoMetIC is able to report incorrect words and unmentioned image regions at fine-grained level.
We also construct a token-level evaluation dataset and demonstrate the effectiveness of InfoMetIC in fine-grained evaluation.
arXiv Detail & Related papers (2023-05-10T09:22:44Z) - Positive-Augmented Contrastive Learning for Image and Video Captioning
Evaluation [47.40949434032489]
We propose a new contrastive-based evaluation metric for image captioning, namely Positive-Augmented Contrastive learning Score (PAC-S)
PAC-S unifies the learning of a contrastive visual-semantic space with the addition of generated images and text on curated data.
Experiments spanning several datasets demonstrate that our new metric achieves the highest correlation with human judgments on both images and videos.
arXiv Detail & Related papers (2023-03-21T18:03:14Z) - Are metrics measuring what they should? An evaluation of image
captioning task metrics [0.21301560294088315]
Image Captioning is a current research task to describe the image content using the objects and their relationships in the scene.
To tackle this task, two important research areas are used, artificial vision, and natural language processing.
We present an evaluation of several kinds of Image Captioning metrics and a comparison between them using the well-known MS COCO dataset.
arXiv Detail & Related papers (2022-07-04T21:51:47Z) - On Distinctive Image Captioning via Comparing and Reweighting [52.3731631461383]
In this paper, we aim to improve the distinctiveness of image captions via comparing and reweighting with a set of similar images.
Our metric reveals that the human annotations of each image in the MSCOCO dataset are not equivalent based on distinctiveness.
In contrast, previous works normally treat the human annotations equally during training, which could be a reason for generating less distinctive captions.
arXiv Detail & Related papers (2022-04-08T08:59:23Z) - Transparent Human Evaluation for Image Captioning [70.03979566548823]
We develop a rubric-based human evaluation protocol for image captioning models.
We show that human-generated captions show substantially higher quality than machine-generated ones.
We hope that this work will promote a more transparent evaluation protocol for image captioning.
arXiv Detail & Related papers (2021-11-17T07:09:59Z) - Contrastive Semantic Similarity Learning for Image Captioning Evaluation
with Intrinsic Auto-encoder [52.42057181754076]
Motivated by the auto-encoder mechanism and contrastive representation learning advances, we propose a learning-based metric for image captioning.
We develop three progressive model structures to learn the sentence level representations.
Experiment results show that our proposed method can align well with the scores generated from other contemporary metrics.
arXiv Detail & Related papers (2021-06-29T12:27:05Z) - UMIC: An Unreferenced Metric for Image Captioning via Contrastive
Learning [39.40274917797253]
In this paper, we introduce a new metric UMIC, an Unreferenced Metric for Image Captioning.
Based on Vision-and-Language BERT, we train UMIC to discriminate negative captions via contrastive learning.
Also, we observe critical problems of the previous benchmark dataset on image captioning metric, and introduce a new collection of human annotations on the generated captions.
arXiv Detail & Related papers (2021-06-26T13:27:14Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.