On Semantic Similarity in Video Retrieval
- URL: http://arxiv.org/abs/2103.10095v1
- Date: Thu, 18 Mar 2021 09:12:40 GMT
- Title: On Semantic Similarity in Video Retrieval
- Authors: Michael Wray, Hazel Doughty, Dima Damen
- Abstract summary: We propose a move to semantic similarity video retrieval, where multiple videos/captions can be deemed equally relevant.
Our analysis is performed on three commonly used video retrieval datasets (MSR-VTT, YouCook2 and EPIC-KITCHENS)
- Score: 31.61611168620582
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current video retrieval efforts all found their evaluation on an
instance-based assumption, that only a single caption is relevant to a query
video and vice versa. We demonstrate that this assumption results in
performance comparisons often not indicative of models' retrieval capabilities.
We propose a move to semantic similarity video retrieval, where (i) multiple
videos/captions can be deemed equally relevant, and their relative ranking does
not affect a method's reported performance and (ii) retrieved videos/captions
are ranked by their similarity to a query. We propose several proxies to
estimate semantic similarities in large-scale retrieval datasets, without
additional annotations. Our analysis is performed on three commonly used video
retrieval datasets (MSR-VTT, YouCook2 and EPIC-KITCHENS).
Related papers
- Not All Pairs are Equal: Hierarchical Learning for Average-Precision-Oriented Video Retrieval [80.09819072780193]
Average Precision (AP) assesses the overall rankings of relevant videos at the top list.
Recent video retrieval methods utilize pair-wise losses that treat all sample pairs equally.
arXiv Detail & Related papers (2024-07-22T11:52:04Z) - Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval [55.90407811819347]
We consider the task of paraphrased text-to-image retrieval where a model aims to return similar results given a pair of paraphrased queries.
We train a dual-encoder model starting from a language model pretrained on a large text corpus.
Compared to public dual-encoder models such as CLIP and OpenCLIP, the model trained with our best adaptation strategy achieves a significantly higher ranking similarity for paraphrased queries.
arXiv Detail & Related papers (2024-05-06T06:30:17Z) - Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement [72.7576395034068]
Video Corpus Moment Retrieval (VCMR) is a new video retrieval task aimed at retrieving a relevant moment from a large corpus of untrimmed videos using a text query.
We argue that effectively capturing the partial relevance between the query and video is essential for the VCMR task.
For video retrieval, we introduce a multi-modal collaborative video retriever, generating different query representations for the two modalities.
For moment localization, we propose the focus-then-fuse moment localizer, utilizing modality-specific gates to capture essential content.
arXiv Detail & Related papers (2024-02-21T07:16:06Z) - Video Referring Expression Comprehension via Transformer with
Content-conditioned Query [68.06199031102526]
Video Referring Expression (REC) aims to localize a target object in videos based on the queried natural language.
Recent improvements in video REC have been made using Transformer-based methods with learnable queries.
arXiv Detail & Related papers (2023-10-25T06:38:42Z) - Zero-shot Audio Topic Reranking using Large Language Models [42.774019015099704]
Multimodal Video Search by Examples (MVSE) investigates using video clips as the query term for information retrieval.
This work aims to compensate for any performance loss from this rapid archive search by examining reranking approaches.
Performance is evaluated for topic-based retrieval on a publicly available video archive, the BBC Rewind corpus.
arXiv Detail & Related papers (2023-09-14T11:13:36Z) - Towards Video Anomaly Retrieval from Video Anomaly Detection: New
Benchmarks and Model [70.97446870672069]
Video anomaly detection (VAD) has been paid increasing attention due to its potential applications.
Video Anomaly Retrieval ( VAR) aims to pragmatically retrieve relevant anomalous videos by cross-modalities.
We present two benchmarks, UCFCrime-AR and XD-Violence, constructed on top of prevalent anomaly datasets.
arXiv Detail & Related papers (2023-07-24T06:22:37Z) - Fighting FIRe with FIRE: Assessing the Validity of Text-to-Video
Retrieval Benchmarks [6.540440003084223]
Video captioning datasets have been re-purposed to evaluate models.
Many alternate videos also match the caption, which introduces false-negative caption-video pairs.
We show that when these false negatives are corrected, a recent state-of-the-art model gains 25% recall points.
arXiv Detail & Related papers (2022-10-10T22:45:06Z) - Temporal Alignment Prediction for Few-Shot Video Classification [17.18278071760926]
We propose Temporal Alignment Prediction (TAP) based on sequence similarity learning for few-shot video classification.
In order to obtain the similarity of a pair of videos, we predict the alignment scores between all pairs of temporal positions in the two videos.
We evaluate TAP on two video classification benchmarks including Kinetics and Something-Something V2.
arXiv Detail & Related papers (2021-07-26T05:12:27Z) - Support-set bottlenecks for video-text representation learning [131.4161071785107]
The dominant paradigm for learning video-text representations -- noise contrastive learning -- is too strict.
We propose a novel method that alleviates this by leveraging a generative model to naturally push these related samples together.
Our proposed method outperforms others by a large margin on MSR-VTT, VATEX and ActivityNet, and MSVD for video-to-text and text-to-video retrieval.
arXiv Detail & Related papers (2020-10-06T15:38:54Z) - Summarizing the performances of a background subtraction algorithm
measured on several videos [9.440689053774898]
We present a theoretical approach to summarize the performances for multiple videos.
We also give formulas and an algorithm to calculate summarized performances.
We showcase our observations on CDNET 2014.
arXiv Detail & Related papers (2020-02-13T17:35:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.