CLAIR: Evaluating Image Captions with Large Language Models
- URL: http://arxiv.org/abs/2310.12971v1
- Date: Thu, 19 Oct 2023 17:59:01 GMT
- Title: CLAIR: Evaluating Image Captions with Large Language Models
- Authors: David Chan, Suzanne Petryk, Joseph E. Gonzalez, Trevor Darrell, John
Canny
- Abstract summary: We propose CLAIR, a novel method to evaluate machine-generated image captions.
In our evaluations, CLAIR demonstrates a stronger correlation with human judgments of caption quality compared to existing measures.
Clair provides noisily interpretable results by allowing the language model to identify the underlying reasoning behind its assigned score.
- Score: 69.46906537973518
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The evaluation of machine-generated image captions poses an interesting yet
persistent challenge. Effective evaluation measures must consider numerous
dimensions of similarity, including semantic relevance, visual structure,
object interactions, caption diversity, and specificity. Existing
highly-engineered measures attempt to capture specific aspects, but fall short
in providing a holistic score that aligns closely with human judgments. Here,
we propose CLAIR, a novel method that leverages the zero-shot language modeling
capabilities of large language models (LLMs) to evaluate candidate captions. In
our evaluations, CLAIR demonstrates a stronger correlation with human judgments
of caption quality compared to existing measures. Notably, on Flickr8K-Expert,
CLAIR achieves relative correlation improvements over SPICE of 39.6% and over
image-augmented methods such as RefCLIP-S of 18.3%. Moreover, CLAIR provides
noisily interpretable results by allowing the language model to identify the
underlying reasoning behind its assigned score. Code is available at
https://davidmchan.github.io/clair/
Related papers
- CLAIR-A: Leveraging Large Language Models to Judge Audio Captions [73.51087998971418]
evaluating machine-generated audio captions is a complex task that requires considering diverse factors.
We propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models.
In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics.
arXiv Detail & Related papers (2024-09-19T17:59:52Z) - A Novel Evaluation Framework for Image2Text Generation [15.10524860121122]
We propose an evaluation framework rooted in a modern large language model (LLM) capable of image generation.
A high similarity score suggests that the image captioning model has accurately generated textual descriptions.
A low similarity score indicates discrepancies, revealing potential shortcomings in the model's performance.
arXiv Detail & Related papers (2024-08-03T09:27:57Z) - BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues [47.213906345208315]
We propose BRIDGE, a new learnable and reference-free image captioning metric.
Our proposal achieves state-of-the-art results compared to existing reference-free evaluation scores.
arXiv Detail & Related papers (2024-07-29T18:00:17Z) - HICEScore: A Hierarchical Metric for Image Captioning Evaluation [10.88292081473071]
We propose a novel reference-free metric for image captioning evaluation, dubbed Hierarchical Image Captioning Evaluation Score (HICE-S)
By detecting local visual regions and textual phrases, HICE-S builds an interpretable hierarchical scoring mechanism.
Our proposed metric achieves the SOTA performance on several benchmarks, outperforming existing reference-free metrics.
arXiv Detail & Related papers (2024-07-26T08:24:30Z) - Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged.
In this paper, we study if there are any deficiencies in reference-free metrics.
We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z) - SILC: Improving Vision Language Pretraining with Self-Distillation [113.50400246862056]
We introduce SILC, a novel framework for vision language pretraining.
SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation.
We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation.
arXiv Detail & Related papers (2023-10-20T08:44:47Z) - Positive-Augmented Contrastive Learning for Image and Video Captioning
Evaluation [47.40949434032489]
We propose a new contrastive-based evaluation metric for image captioning, namely Positive-Augmented Contrastive learning Score (PAC-S)
PAC-S unifies the learning of a contrastive visual-semantic space with the addition of generated images and text on curated data.
Experiments spanning several datasets demonstrate that our new metric achieves the highest correlation with human judgments on both images and videos.
arXiv Detail & Related papers (2023-03-21T18:03:14Z) - Exploring CLIP for Assessing the Look and Feel of Images [87.97623543523858]
We introduce Contrastive Language-Image Pre-training (CLIP) models for assessing both the quality perception (look) and abstract perception (feel) of images in a zero-shot manner.
Our results show that CLIP captures meaningful priors that generalize well to different perceptual assessments.
arXiv Detail & Related papers (2022-07-25T17:58:16Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.