Fighting FIRe with FIRE: Assessing the Validity of Text-to-Video
Retrieval Benchmarks
- URL: http://arxiv.org/abs/2210.05038v2
- Date: Wed, 19 Apr 2023 03:50:48 GMT
- Title: Fighting FIRe with FIRE: Assessing the Validity of Text-to-Video
Retrieval Benchmarks
- Authors: Pedro Rodriguez, Mahmoud Azab, Becka Silvert, Renato Sanchez, Linzy
Labson, Hardik Shah and Seungwhan Moon
- Abstract summary: Video captioning datasets have been re-purposed to evaluate models.
Many alternate videos also match the caption, which introduces false-negative caption-video pairs.
We show that when these false negatives are corrected, a recent state-of-the-art model gains 25% recall points.
- Score: 6.540440003084223
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Searching troves of videos with textual descriptions is a core multimodal
retrieval task. Owing to the lack of a purpose-built dataset for text-to-video
retrieval, video captioning datasets have been re-purposed to evaluate models
by (1) treating captions as positive matches to their respective videos and (2)
assuming all other videos to be negatives. However, this methodology leads to a
fundamental flaw during evaluation: since captions are marked as relevant only
to their original video, many alternate videos also match the caption, which
introduces false-negative caption-video pairs. We show that when these false
negatives are corrected, a recent state-of-the-art model gains 25\% recall
points -- a difference that threatens the validity of the benchmark itself. To
diagnose and mitigate this issue, we annotate and release 683K additional
caption-video pairs. Using these, we recompute effectiveness scores for three
models on two standard benchmarks (MSR-VTT and MSVD). We find that (1) the
recomputed metrics are up to 25\% recall points higher for the best models, (2)
these benchmarks are nearing saturation for Recall@10, (3) caption length
(generality) is related to the number of positives, and (4) annotation costs
can be mitigated through sampling. We recommend retiring these benchmarks in
their current form, and we make recommendations for future text-to-video
retrieval benchmarks.
Related papers
- Retrieval Enhanced Zero-Shot Video Captioning [69.96136689829778]
We bridge video and text using three key models: a general video understanding model XCLIP, a general image understanding model CLIP, and a text generation model GPT-2.
To address this problem, we propose using learnable tokens as a communication medium between frozen GPT-2 and frozen XCLIP.
Experiments show 4% to 20% improvements in terms of the main metric CIDEr compared to the existing state-of-the-art methods.
arXiv Detail & Related papers (2024-05-11T16:22:00Z) - DeVAn: Dense Video Annotation for Video-Language Models [68.70692422636313]
We present a novel human annotated dataset for evaluating the ability for visual-language models to generate descriptions for real-world video clips.
The dataset contains 8.5K YouTube video clips of 20-60 seconds in duration and covers a wide range of topics and interests.
arXiv Detail & Related papers (2023-10-08T08:02:43Z) - Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial
Margin Contrastive Learning [35.404100473539195]
Text-video retrieval aims to rank relevant text/video higher than irrelevant ones.
Recent contrastive learning methods have shown promising results for text-video retrieval.
This paper improves contrastive learning using two novel techniques.
arXiv Detail & Related papers (2023-09-20T06:08:11Z) - Models See Hallucinations: Evaluating the Factuality in Video Captioning [57.85548187177109]
We conduct a human evaluation of the factuality in video captioning and collect two annotated factuality datasets.
We find that 57.0% of the model-generated sentences have factual errors, indicating it is a severe problem in this field.
We propose a weakly-supervised, model-based factuality metric FactVC, which outperforms previous metrics on factuality evaluation of video captioning.
arXiv Detail & Related papers (2023-03-06T08:32:50Z) - Learning video retrieval models with relevance-aware online mining [16.548016892117083]
A typical approach consists in learning a joint text-video embedding space, where the similarity of a video and its associated caption is maximized.
This approach assumes that only the video and caption pairs in the dataset are valid, but different captions - positives - may also describe its visual contents, hence some of them may be wrongly penalized.
We propose the Relevance-Aware Negatives and Positives mining (RANP) which, based on the semantics of the negatives, improves their selection while also increasing the similarity of other valid positives.
arXiv Detail & Related papers (2022-03-16T15:23:55Z) - EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained
Embedding Matching [90.98122161162644]
Current metrics for video captioning are mostly based on the text-level comparison between reference and candidate captions.
We propose EMScore (Embedding Matching-based score), a novel reference-free metric for video captioning.
We exploit a well pre-trained vision-language model to extract visual and linguistic embeddings for computing EMScore.
arXiv Detail & Related papers (2021-11-17T06:02:43Z) - Group-aware Contrastive Regression for Action Quality Assessment [85.43203180953076]
We show that the relations among videos can provide important clues for more accurate action quality assessment.
Our approach outperforms previous methods by a large margin and establishes new state-of-the-art on all three benchmarks.
arXiv Detail & Related papers (2021-08-17T17:59:39Z) - On Semantic Similarity in Video Retrieval [31.61611168620582]
We propose a move to semantic similarity video retrieval, where multiple videos/captions can be deemed equally relevant.
Our analysis is performed on three commonly used video retrieval datasets (MSR-VTT, YouCook2 and EPIC-KITCHENS)
arXiv Detail & Related papers (2021-03-18T09:12:40Z) - Generalized Few-Shot Video Classification with Video Retrieval and
Feature Generation [132.82884193921535]
We argue that previous methods underestimate the importance of video feature learning and propose a two-stage approach.
We show that this simple baseline approach outperforms prior few-shot video classification methods by over 20 points on existing benchmarks.
We present two novel approaches that yield further improvement.
arXiv Detail & Related papers (2020-07-09T13:05:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.