CLIPScore: A Reference-free Evaluation Metric for Image Captioning
- URL: http://arxiv.org/abs/2104.08718v1
- Date: Sun, 18 Apr 2021 05:00:29 GMT
- Title: CLIPScore: A Reference-free Evaluation Metric for Image Captioning
- Authors: Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, Yejin Choi
- Abstract summary: We show that CLIP, a cross-modal model pretrained on 400M image+caption pairs from the web, can be used for robust automatic evaluation of image captioning without the need for references.
Experiments spanning several corpora demonstrate that our new reference-free metric, CLIPScore, achieves the highest correlation with human judgements.
We also present a reference-augmented version, RefCLIPScore, which achieves even higher correlation.
- Score: 44.14502257230038
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image captioning has conventionally relied on reference-based automatic
evaluations, where machine captions are compared against captions written by
humans. This is in stark contrast to the reference-free manner in which humans
assess caption quality.
In this paper, we report the surprising empirical finding that CLIP (Radford
et al., 2021), a cross-modal model pretrained on 400M image+caption pairs from
the web, can be used for robust automatic evaluation of image captioning
without the need for references. Experiments spanning several corpora
demonstrate that our new reference-free metric, CLIPScore, achieves the highest
correlation with human judgements, outperforming existing reference-based
metrics like CIDEr and SPICE. Information gain experiments demonstrate that
CLIPScore, with its tight focus on image-text compatibility, is complementary
to existing reference-based metrics that emphasize text-text similarities.
Thus, we also present a reference-augmented version, RefCLIPScore, which
achieves even higher correlation. Beyond literal description tasks, several
case studies reveal domains where CLIPScore performs well (clip-art images,
alt-text rating), but also where it is relatively weaker vs reference-based
metrics, e.g., news captions that require richer contextual knowledge.
Related papers
- A Novel Evaluation Framework for Image2Text Generation [15.10524860121122]
We propose an evaluation framework rooted in a modern large language model (LLM) capable of image generation.
A high similarity score suggests that the image captioning model has accurately generated textual descriptions.
A low similarity score indicates discrepancies, revealing potential shortcomings in the model's performance.
arXiv Detail & Related papers (2024-08-03T09:27:57Z) - HICEScore: A Hierarchical Metric for Image Captioning Evaluation [10.88292081473071]
We propose a novel reference-free metric for image captioning evaluation, dubbed Hierarchical Image Captioning Evaluation Score (HICE-S)
By detecting local visual regions and textual phrases, HICE-S builds an interpretable hierarchical scoring mechanism.
Our proposed metric achieves the SOTA performance on several benchmarks, outperforming existing reference-free metrics.
arXiv Detail & Related papers (2024-07-26T08:24:30Z) - FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model [5.330266804358638]
We propose FLEUR, an explainable reference-free metric to introduce explainability into image captioning evaluation metrics.
By leveraging a large multimodal model, FLEUR can evaluate the caption against the image without the need for reference captions.
FLEUR achieves high correlations with human judgment across various image captioning evaluation benchmarks.
arXiv Detail & Related papers (2024-06-10T03:57:39Z) - InfoMetIC: An Informative Metric for Reference-free Image Caption
Evaluation [69.1642316502563]
We propose an Informative Metric for Reference-free Image Caption evaluation (InfoMetIC)
Given an image and a caption, InfoMetIC is able to report incorrect words and unmentioned image regions at fine-grained level.
We also construct a token-level evaluation dataset and demonstrate the effectiveness of InfoMetIC in fine-grained evaluation.
arXiv Detail & Related papers (2023-05-10T09:22:44Z) - Positive-Augmented Contrastive Learning for Image and Video Captioning
Evaluation [47.40949434032489]
We propose a new contrastive-based evaluation metric for image captioning, namely Positive-Augmented Contrastive learning Score (PAC-S)
PAC-S unifies the learning of a contrastive visual-semantic space with the addition of generated images and text on curated data.
Experiments spanning several datasets demonstrate that our new metric achieves the highest correlation with human judgments on both images and videos.
arXiv Detail & Related papers (2023-03-21T18:03:14Z) - Fine-grained Image Captioning with CLIP Reward [104.71533106301598]
We propose using CLIP, a multimodal encoder trained on huge image-text pairs from web, to calculate multimodal similarity and use it as a reward function.
We also propose a simple finetuning strategy of the CLIP text encoder to improve grammar that does not require extra text annotation.
In experiments on text-to-image retrieval and FineCapEval, the proposed CLIP-guided model generates more distinctive captions than the CIDEr-optimized model.
arXiv Detail & Related papers (2022-05-26T02:46:09Z) - Transparent Human Evaluation for Image Captioning [70.03979566548823]
We develop a rubric-based human evaluation protocol for image captioning models.
We show that human-generated captions show substantially higher quality than machine-generated ones.
We hope that this work will promote a more transparent evaluation protocol for image captioning.
arXiv Detail & Related papers (2021-11-17T07:09:59Z) - Contrastive Semantic Similarity Learning for Image Captioning Evaluation
with Intrinsic Auto-encoder [52.42057181754076]
Motivated by the auto-encoder mechanism and contrastive representation learning advances, we propose a learning-based metric for image captioning.
We develop three progressive model structures to learn the sentence level representations.
Experiment results show that our proposed method can align well with the scores generated from other contemporary metrics.
arXiv Detail & Related papers (2021-06-29T12:27:05Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.