EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained
Embedding Matching
- URL: http://arxiv.org/abs/2111.08919v1
- Date: Wed, 17 Nov 2021 06:02:43 GMT
- Title: EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained
Embedding Matching
- Authors: Yaya Shi, Xu Yang, Haiyang Xu, Chunfeng Yuan, Bing Li, Weiming Hu,
Zheng-Jun Zha
- Abstract summary: Current metrics for video captioning are mostly based on the text-level comparison between reference and candidate captions.
We propose EMScore (Embedding Matching-based score), a novel reference-free metric for video captioning.
We exploit a well pre-trained vision-language model to extract visual and linguistic embeddings for computing EMScore.
- Score: 90.98122161162644
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current metrics for video captioning are mostly based on the text-level
comparison between reference and candidate captions. However, they have some
insuperable drawbacks, e.g., they cannot handle videos without references, and
they may result in biased evaluation due to the one-to-many nature of
video-to-text and the neglect of visual relevance. From the human evaluator's
viewpoint, a high-quality caption should be consistent with the provided video,
but not necessarily be similar to the reference in literal or semantics.
Inspired by human evaluation, we propose EMScore (Embedding Matching-based
score), a novel reference-free metric for video captioning, which directly
measures similarity between video and candidate captions. Benefit from the
recent development of large-scale pre-training models, we exploit a well
pre-trained vision-language model to extract visual and linguistic embeddings
for computing EMScore. Specifically, EMScore combines matching scores of both
coarse-grained (video and caption) and fine-grained (frames and words) levels,
which takes the overall understanding and detailed characteristics of the video
into account. Furthermore, considering the potential information gain, EMScore
can be flexibly extended to the conditions where human-labeled references are
available. Last but not least, we collect VATEX-EVAL and ActivityNet-FOIl
datasets to systematically evaluate the existing metrics. VATEX-EVAL
experiments demonstrate that EMScore has higher human correlation and lower
reference dependency. ActivityNet-FOIL experiment verifies that EMScore can
effectively identify "hallucinating" captions. The datasets will be released to
facilitate the development of video captioning metrics. The code is available
at: https://github.com/ShiYaya/emscore.
Related papers
- BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues [47.213906345208315]
We propose BRIDGE, a new learnable and reference-free image captioning metric.
Our proposal achieves state-of-the-art results compared to existing reference-free evaluation scores.
arXiv Detail & Related papers (2024-07-29T18:00:17Z) - HICEScore: A Hierarchical Metric for Image Captioning Evaluation [10.88292081473071]
We propose a novel reference-free metric for image captioning evaluation, dubbed Hierarchical Image Captioning Evaluation Score (HICE-S)
By detecting local visual regions and textual phrases, HICE-S builds an interpretable hierarchical scoring mechanism.
Our proposed metric achieves the SOTA performance on several benchmarks, outperforming existing reference-free metrics.
arXiv Detail & Related papers (2024-07-26T08:24:30Z) - DeVAn: Dense Video Annotation for Video-Language Models [68.70692422636313]
We present a novel human annotated dataset for evaluating the ability for visual-language models to generate descriptions for real-world video clips.
The dataset contains 8.5K YouTube video clips of 20-60 seconds in duration and covers a wide range of topics and interests.
arXiv Detail & Related papers (2023-10-08T08:02:43Z) - InfoMetIC: An Informative Metric for Reference-free Image Caption
Evaluation [69.1642316502563]
We propose an Informative Metric for Reference-free Image Caption evaluation (InfoMetIC)
Given an image and a caption, InfoMetIC is able to report incorrect words and unmentioned image regions at fine-grained level.
We also construct a token-level evaluation dataset and demonstrate the effectiveness of InfoMetIC in fine-grained evaluation.
arXiv Detail & Related papers (2023-05-10T09:22:44Z) - Models See Hallucinations: Evaluating the Factuality in Video Captioning [57.85548187177109]
We conduct a human evaluation of the factuality in video captioning and collect two annotated factuality datasets.
We find that 57.0% of the model-generated sentences have factual errors, indicating it is a severe problem in this field.
We propose a weakly-supervised, model-based factuality metric FactVC, which outperforms previous metrics on factuality evaluation of video captioning.
arXiv Detail & Related papers (2023-03-06T08:32:50Z) - Fighting FIRe with FIRE: Assessing the Validity of Text-to-Video
Retrieval Benchmarks [6.540440003084223]
Video captioning datasets have been re-purposed to evaluate models.
Many alternate videos also match the caption, which introduces false-negative caption-video pairs.
We show that when these false negatives are corrected, a recent state-of-the-art model gains 25% recall points.
arXiv Detail & Related papers (2022-10-10T22:45:06Z) - Contrastive Semantic Similarity Learning for Image Captioning Evaluation
with Intrinsic Auto-encoder [52.42057181754076]
Motivated by the auto-encoder mechanism and contrastive representation learning advances, we propose a learning-based metric for image captioning.
We develop three progressive model structures to learn the sentence level representations.
Experiment results show that our proposed method can align well with the scores generated from other contemporary metrics.
arXiv Detail & Related papers (2021-06-29T12:27:05Z) - CLIPScore: A Reference-free Evaluation Metric for Image Captioning [44.14502257230038]
We show that CLIP, a cross-modal model pretrained on 400M image+caption pairs from the web, can be used for robust automatic evaluation of image captioning without the need for references.
Experiments spanning several corpora demonstrate that our new reference-free metric, CLIPScore, achieves the highest correlation with human judgements.
We also present a reference-augmented version, RefCLIPScore, which achieves even higher correlation.
arXiv Detail & Related papers (2021-04-18T05:00:29Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.