Positive-Augmented Contrastive Learning for Image and Video Captioning
Evaluation
- URL: http://arxiv.org/abs/2303.12112v3
- Date: Thu, 20 Jul 2023 08:16:09 GMT
- Title: Positive-Augmented Contrastive Learning for Image and Video Captioning
Evaluation
- Authors: Sara Sarto, Manuele Barraco, Marcella Cornia, Lorenzo Baraldi, Rita
Cucchiara
- Abstract summary: We propose a new contrastive-based evaluation metric for image captioning, namely Positive-Augmented Contrastive learning Score (PAC-S)
PAC-S unifies the learning of a contrastive visual-semantic space with the addition of generated images and text on curated data.
Experiments spanning several datasets demonstrate that our new metric achieves the highest correlation with human judgments on both images and videos.
- Score: 47.40949434032489
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The CLIP model has been recently proven to be very effective for a variety of
cross-modal tasks, including the evaluation of captions generated from
vision-and-language architectures. In this paper, we propose a new recipe for a
contrastive-based evaluation metric for image captioning, namely
Positive-Augmented Contrastive learning Score (PAC-S), that in a novel way
unifies the learning of a contrastive visual-semantic space with the addition
of generated images and text on curated data. Experiments spanning several
datasets demonstrate that our new metric achieves the highest correlation with
human judgments on both images and videos, outperforming existing
reference-based metrics like CIDEr and SPICE and reference-free metrics like
CLIP-Score. Finally, we test the system-level correlation of the proposed
metric when considering popular image captioning approaches, and assess the
impact of employing different cross-modal features. Our source code and trained
models are publicly available at: https://github.com/aimagelab/pacscore.
Related papers
- BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues [47.213906345208315]
We propose BRIDGE, a new learnable and reference-free image captioning metric.
Our proposal achieves state-of-the-art results compared to existing reference-free evaluation scores.
arXiv Detail & Related papers (2024-07-29T18:00:17Z) - HICEScore: A Hierarchical Metric for Image Captioning Evaluation [10.88292081473071]
We propose a novel reference-free metric for image captioning evaluation, dubbed Hierarchical Image Captioning Evaluation Score (HICE-S)
By detecting local visual regions and textual phrases, HICE-S builds an interpretable hierarchical scoring mechanism.
Our proposed metric achieves the SOTA performance on several benchmarks, outperforming existing reference-free metrics.
arXiv Detail & Related papers (2024-07-26T08:24:30Z) - CrossScore: Towards Multi-View Image Evaluation and Scoring [24.853612457257697]
Cross-reference image quality assessment method fills the gap in the image assessment landscape.
Our method enables accurate image quality assessment without requiring ground truth references.
arXiv Detail & Related papers (2024-04-22T17:59:36Z) - Exploring CLIP for Assessing the Look and Feel of Images [87.97623543523858]
We introduce Contrastive Language-Image Pre-training (CLIP) models for assessing both the quality perception (look) and abstract perception (feel) of images in a zero-shot manner.
Our results show that CLIP captures meaningful priors that generalize well to different perceptual assessments.
arXiv Detail & Related papers (2022-07-25T17:58:16Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.