Related papers: Can Audio Captions Be Evaluated with Image Caption Metrics?

Can Audio Captions Be Evaluated with Image Caption Metrics?

URL: http://arxiv.org/abs/2110.04684v1
Date: Sun, 10 Oct 2021 02:34:40 GMT
Title: Can Audio Captions Be Evaluated with Image Caption Metrics?
Authors: Zelin Zhou, Zhiling Zhang, Xuenan Xu, Zeyu Xie, Mengyue Wu, Kenny Q. Zhu
Abstract summary: We propose a metric named FENSE, where we combine the strength of Sentence-BERT in capturing similarity, and a novel Error Detector to penalize erroneous sentences for robustness. On the newly established benchmarks, FENSE outperforms current metrics by 14-25% accuracy.
Score: 11.45508807551818
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automated audio captioning aims at generating textual descriptions for an audio clip. To evaluate the quality of generated audio captions, previous works directly adopt image captioning metrics like SPICE and CIDEr, without justifying their suitability in this new domain, which may mislead the development of advanced models. This problem is still unstudied due to the lack of human judgment datasets on caption quality. Therefore, we firstly construct two evaluation benchmarks, AudioCaps-Eval and Clotho-Eval. They are established with pairwise comparison instead of absolute rating to achieve better inter-annotator agreement. Current metrics are found in poor correlation with human annotations on these datasets. To overcome their limitations, we propose a metric named FENSE, where we combine the strength of Sentence-BERT in capturing similarity, and a novel Error Detector to penalize erroneous sentences for robustness. On the newly established benchmarks, FENSE outperforms current metrics by 14-25% accuracy. Code, data and web demo available at: https://github.com/blmoistawinde/fense

Related papers

VCRScore: Image captioning metric based on V\&L Transformers, CLIP, and precision-recall [0.0]
This work proposes a new evaluation metric for the image captioning problem. It was generated a human-labeled dataset to assess to which degree the captions correlate with the image's content. Outperformed results were also found, and interesting insights were presented and discussed.
arXiv Detail & Related papers (2025-01-15T21:14:36Z)
BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues [47.213906345208315]
We propose BRIDGE, a new learnable and reference-free image captioning metric. Our proposal achieves state-of-the-art results compared to existing reference-free evaluation scores.
arXiv Detail & Related papers (2024-07-29T18:00:17Z)
Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged. In this paper, we study if there are any deficiencies in reference-free metrics. We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z)
InfoMetIC: An Informative Metric for Reference-free Image Caption Evaluation [69.1642316502563]
We propose an Informative Metric for Reference-free Image Caption evaluation (InfoMetIC) Given an image and a caption, InfoMetIC is able to report incorrect words and unmentioned image regions at fine-grained level. We also construct a token-level evaluation dataset and demonstrate the effectiveness of InfoMetIC in fine-grained evaluation.
arXiv Detail & Related papers (2023-05-10T09:22:44Z)
Models See Hallucinations: Evaluating the Factuality in Video Captioning [57.85548187177109]
We conduct a human evaluation of the factuality in video captioning and collect two annotated factuality datasets. We find that 57.0% of the model-generated sentences have factual errors, indicating it is a severe problem in this field. We propose a weakly-supervised, model-based factuality metric FactVC, which outperforms previous metrics on factuality evaluation of video captioning.
arXiv Detail & Related papers (2023-03-06T08:32:50Z)
Transparent Human Evaluation for Image Captioning [70.03979566548823]
We develop a rubric-based human evaluation protocol for image captioning models. We show that human-generated captions show substantially higher quality than machine-generated ones. We hope that this work will promote a more transparent evaluation protocol for image captioning.
arXiv Detail & Related papers (2021-11-17T07:09:59Z)
EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching [90.98122161162644]
Current metrics for video captioning are mostly based on the text-level comparison between reference and candidate captions. We propose EMScore (Embedding Matching-based score), a novel reference-free metric for video captioning. We exploit a well pre-trained vision-language model to extract visual and linguistic embeddings for computing EMScore.
arXiv Detail & Related papers (2021-11-17T06:02:43Z)
COSMic: A Coherence-Aware Generation Metric for Image Descriptions [27.41088864449921]
Image metrics have struggled to give accurate learned estimates of the semantic and pragmatic success of text evaluation models. We present the first learned generation metric for evaluating output captions. We demonstrate a higher out-efficient for our proposed metric the human judgments for the results of a number of state-of-the-art caption models when compared to several other metrics such as BLEURT and BERT.
arXiv Detail & Related papers (2021-09-11T13:43:36Z)
Contrastive Semantic Similarity Learning for Image Captioning Evaluation with Intrinsic Auto-encoder [52.42057181754076]
Motivated by the auto-encoder mechanism and contrastive representation learning advances, we propose a learning-based metric for image captioning. We develop three progressive model structures to learn the sentence level representations. Experiment results show that our proposed method can align well with the scores generated from other contemporary metrics.
arXiv Detail & Related papers (2021-06-29T12:27:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.