Partially Relevant Video Retrieval
- URL: http://arxiv.org/abs/2208.12510v1
- Date: Fri, 26 Aug 2022 09:07:16 GMT
- Title: Partially Relevant Video Retrieval
- Authors: Jianfeng Dong, Xianke Chen, Minsong Zhang, Xun Yang, Shujie Chen,
Xirong Li, Xun Wang
- Abstract summary: We propose a novel T2VR subtask termed Partially Relevant Video Retrieval (PRVR)
PRVR aims to retrieve partially relevant videos from a large collection of untrimmed videos.
We formulate PRVR as a multiple instance learning (MIL) problem, where a video is simultaneously viewed as a bag of video clips and a bag of video frames.
- Score: 39.747235541498135
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current methods for text-to-video retrieval (T2VR) are trained and tested on
video-captioning oriented datasets such as MSVD, MSR-VTT and VATEX. A key
property of these datasets is that videos are assumed to be temporally
pre-trimmed with short duration, whilst the provided captions well describe the
gist of the video content. Consequently, for a given paired video and caption,
the video is supposed to be fully relevant to the caption. In reality, however,
as queries are not known a priori, pre-trimmed video clips may not contain
sufficient content to fully meet the query. This suggests a gap between the
literature and the real world. To fill the gap, we propose in this paper a
novel T2VR subtask termed Partially Relevant Video Retrieval (PRVR). An
untrimmed video is considered to be partially relevant w.r.t. a given textual
query if it contains a moment relevant to the query. PRVR aims to retrieve such
partially relevant videos from a large collection of untrimmed videos. PRVR
differs from single video moment retrieval and video corpus moment retrieval,
as the latter two are to retrieve moments rather than untrimmed videos. We
formulate PRVR as a multiple instance learning (MIL) problem, where a video is
simultaneously viewed as a bag of video clips and a bag of video frames. Clips
and frames represent video content at different time scales. We propose a
Multi-Scale Similarity Learning (MS-SL) network that jointly learns clip-scale
and frame-scale similarities for PRVR. Extensive experiments on three datasets
(TVR, ActivityNet Captions, and Charades-STA) demonstrate the viability of the
proposed method. We also show that our method can be used for improving video
corpus moment retrieval.
Related papers
- EgoCVR: An Egocentric Benchmark for Fine-Grained Composed Video Retrieval [52.375143786641196]
EgoCVR is an evaluation benchmark for fine-grained Composed Video Retrieval.
EgoCVR consists of 2,295 queries that specifically focus on high-quality temporal video understanding.
arXiv Detail & Related papers (2024-07-23T17:19:23Z) - Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement [72.7576395034068]
Video Corpus Moment Retrieval (VCMR) is a new video retrieval task aimed at retrieving a relevant moment from a large corpus of untrimmed videos using a text query.
We argue that effectively capturing the partial relevance between the query and video is essential for the VCMR task.
For video retrieval, we introduce a multi-modal collaborative video retriever, generating different query representations for the two modalities.
For moment localization, we propose the focus-then-fuse moment localizer, utilizing modality-specific gates to capture essential content.
arXiv Detail & Related papers (2024-02-21T07:16:06Z) - GMMFormer: Gaussian-Mixture-Model Based Transformer for Efficient
Partially Relevant Video Retrieval [59.47258928867802]
Given a text query, partially relevant video retrieval (PRVR) seeks to find videos containing pertinent moments in a database.
This paper proposes GMMFormer, a Gaussian-Mixture-Model based Transformer which models clip representations implicitly.
Experiments on three large-scale video datasets demonstrate the superiority and efficiency of GMMFormer.
arXiv Detail & Related papers (2023-10-08T15:04:50Z) - ICSVR: Investigating Compositional and Syntactic Understanding in Video Retrieval Models [6.073813559982129]
Video retrieval involves retrieving the ground truth video from the video database given a text caption or vice-versa.
We evaluate the compositional and syntactic understanding of video retrieval models on standard benchmarks such as MSRVTT, MSVD and DIDEMO.
Our experiments reveal that actions and syntax play a minor role compared to objects & attributes in video understanding.
arXiv Detail & Related papers (2023-06-28T20:06:36Z) - Hierarchical Video-Moment Retrieval and Step-Captioning [68.4859260853096]
HiREST consists of 3.4K text-video pairs from an instructional video dataset.
Our hierarchical benchmark consists of video retrieval, moment retrieval, and two novel moment segmentation and step captioning tasks.
arXiv Detail & Related papers (2023-03-29T02:33:54Z) - Semantic Video Moments Retrieval at Scale: A New Task and a Baseline [6.997674465889922]
Semantic Video Moments Retrieval at scale (SVMR) aims at finding relevant videos and re-localizing the video clips in them.
To address these challenges, we propose our two-stage baseline solution of candidate videos retrieval followed by a novel attention-based query-reference semantically alignment framework.
arXiv Detail & Related papers (2022-10-15T22:46:22Z) - Video Corpus Moment Retrieval with Contrastive Learning [56.249924768243375]
Video corpus moment retrieval (VCMR) is to retrieve a temporal moment that semantically corresponds to a given text query.
We propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR.
Experimental results show that ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
arXiv Detail & Related papers (2021-05-13T12:54:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.