Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy and Novel Ensemble Method
- URL: http://arxiv.org/abs/2408.04909v1
- Date: Fri, 9 Aug 2024 07:31:06 GMT
- Title: Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy and Novel Ensemble Method
- Authors: Uri Berger, Gabriel Stanovsky, Omri Abend, Lea Frermann,
- Abstract summary: We present the first survey and taxonomy of over 70 different image captioning metrics.
We find that despite the diversity of proposed metrics, the vast majority of studies rely on only five popular metrics.
We propose EnsembEval -- an ensemble of evaluation methods achieving the highest reported correlation with human judgements.
- Score: 35.71703501731081
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The task of image captioning has recently been gaining popularity, and with it the complex task of evaluating the quality of image captioning models. In this work, we present the first survey and taxonomy of over 70 different image captioning metrics and their usage in hundreds of papers. We find that despite the diversity of proposed metrics, the vast majority of studies rely on only five popular metrics, which we show to be weakly correlated with human judgements. Instead, we propose EnsembEval -- an ensemble of evaluation methods achieving the highest reported correlation with human judgements across 5 image captioning datasets, showing there is a lot of room for improvement by leveraging a diverse set of metrics.
Related papers
- Evaluating authenticity and quality of image captions via sentiment and semantic analyses [0.0]
Deep learning relies heavily on huge amounts of labelled data for tasks such as natural language processing and computer vision.
In image-to-text or image-to-image pipelines, opinion (sentiment) may be inadvertently learned by a model from human-generated image captions.
This study proposes an evaluation method focused on sentiment and semantic richness.
arXiv Detail & Related papers (2024-09-14T23:50:23Z) - Borrowing Human Senses: Comment-Aware Self-Training for Social Media
Multimodal Classification [5.960550152906609]
We capture hinting features from user comments, which are retrieved via jointly leveraging visual and lingual similarity.
The classification tasks are explored via self-training in a teacher-student framework, motivated by the usually limited labeled data scales.
The results show that our method further advances the performance of previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-27T08:59:55Z) - Positive-Augmented Contrastive Learning for Image and Video Captioning
Evaluation [47.40949434032489]
We propose a new contrastive-based evaluation metric for image captioning, namely Positive-Augmented Contrastive learning Score (PAC-S)
PAC-S unifies the learning of a contrastive visual-semantic space with the addition of generated images and text on curated data.
Experiments spanning several datasets demonstrate that our new metric achieves the highest correlation with human judgments on both images and videos.
arXiv Detail & Related papers (2023-03-21T18:03:14Z) - Are metrics measuring what they should? An evaluation of image
captioning task metrics [0.21301560294088315]
Image Captioning is a current research task to describe the image content using the objects and their relationships in the scene.
To tackle this task, two important research areas are used, artificial vision, and natural language processing.
We present an evaluation of several kinds of Image Captioning metrics and a comparison between them using the well-known MS COCO dataset.
arXiv Detail & Related papers (2022-07-04T21:51:47Z) - On Distinctive Image Captioning via Comparing and Reweighting [52.3731631461383]
In this paper, we aim to improve the distinctiveness of image captions via comparing and reweighting with a set of similar images.
Our metric reveals that the human annotations of each image in the MSCOCO dataset are not equivalent based on distinctiveness.
In contrast, previous works normally treat the human annotations equally during training, which could be a reason for generating less distinctive captions.
arXiv Detail & Related papers (2022-04-08T08:59:23Z) - Transparent Human Evaluation for Image Captioning [70.03979566548823]
We develop a rubric-based human evaluation protocol for image captioning models.
We show that human-generated captions show substantially higher quality than machine-generated ones.
We hope that this work will promote a more transparent evaluation protocol for image captioning.
arXiv Detail & Related papers (2021-11-17T07:09:59Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.