Modal-specific Pseudo Query Generation for Video Corpus Moment Retrieval
- URL: http://arxiv.org/abs/2210.12617v1
- Date: Sun, 23 Oct 2022 05:05:18 GMT
- Title: Modal-specific Pseudo Query Generation for Video Corpus Moment Retrieval
- Authors: Minjoon Jung, Seongho Choi, Joochan Kim, Jin-Hwa Kim, Byoung-Tak Zhang
- Abstract summary: Video corpus moment retrieval (VCMR) is the task to retrieve the most relevant video moment from a large video corpus using a natural language query.
We propose a self-supervised learning framework: Modal-specific Pseudo Query Generation Network (MPGN)
MPGN generates pseudo queries exploiting both visual and textual information from selected temporal moments.
We show that MPGN successfully learns to localize the video corpus moment without any explicit annotation.
- Score: 20.493241098064665
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Video corpus moment retrieval (VCMR) is the task to retrieve the most
relevant video moment from a large video corpus using a natural language query.
For narrative videos, e.g., dramas or movies, the holistic understanding of
temporal dynamics and multimodal reasoning is crucial. Previous works have
shown promising results; however, they relied on the expensive query
annotations for VCMR, i.e., the corresponding moment intervals. To overcome
this problem, we propose a self-supervised learning framework: Modal-specific
Pseudo Query Generation Network (MPGN). First, MPGN selects candidate temporal
moments via subtitle-based moment sampling. Then, it generates pseudo queries
exploiting both visual and textual information from the selected temporal
moments. Through the multimodal information in the pseudo queries, we show that
MPGN successfully learns to localize the video corpus moment without any
explicit annotation. We validate the effectiveness of MPGN on the TVR dataset,
showing competitive results compared with both supervised models and
unsupervised setting models.
Related papers
- MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query.
Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity.
This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text.
In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z) - Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement [72.7576395034068]
Video Corpus Moment Retrieval (VCMR) is a new video retrieval task aimed at retrieving a relevant moment from a large corpus of untrimmed videos using a text query.
We argue that effectively capturing the partial relevance between the query and video is essential for the VCMR task.
For video retrieval, we introduce a multi-modal collaborative video retriever, generating different query representations for the two modalities.
For moment localization, we propose the focus-then-fuse moment localizer, utilizing modality-specific gates to capture essential content.
arXiv Detail & Related papers (2024-02-21T07:16:06Z) - Multi-video Moment Ranking with Multimodal Clue [69.81533127815884]
State-of-the-art work for VCMR is based on two-stage method.
MINUTE outperforms the baselines on TVR and DiDeMo datasets.
arXiv Detail & Related papers (2023-01-29T18:38:13Z) - Relation-aware Video Reading Comprehension for Temporal Language
Grounding [67.5613853693704]
Temporal language grounding in videos aims to localize the temporal span relevant to the given query sentence.
This paper will formulate temporal language grounding into video reading comprehension and propose a Relation-aware Network (RaNet) to address it.
arXiv Detail & Related papers (2021-10-12T03:10:21Z) - QVHighlights: Detecting Moments and Highlights in Videos via Natural
Language Queries [89.24431389933703]
We present the Query-based Video Highlights (QVHighlights) dataset.
It consists of over 10,000 YouTube videos, covering a wide range of topics.
Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips.
arXiv Detail & Related papers (2021-07-20T16:42:58Z) - Video Corpus Moment Retrieval with Contrastive Learning [56.249924768243375]
Video corpus moment retrieval (VCMR) is to retrieve a temporal moment that semantically corresponds to a given text query.
We propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR.
Experimental results show that ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
arXiv Detail & Related papers (2021-05-13T12:54:39Z) - A Hierarchical Multi-Modal Encoder for Moment Localization in Video
Corpus [31.387948069111893]
We show how to identify a short segment in a long video that semantically matches a text query.
To tackle this problem, we propose the HierArchical Multi-Modal EncodeR (HAMMER) that encodes a video at both the coarse-grained clip level and the fine-trimmed frame level.
We conduct extensive experiments to evaluate our model on moment localization in video corpus on ActivityNet Captions and TVR datasets.
arXiv Detail & Related papers (2020-11-18T02:42:36Z) - Frame-wise Cross-modal Matching for Video Moment Retrieval [32.68921139236391]
Video moment retrieval targets at retrieving a moment in a video for a given language query.
The challenges of this task include 1) the requirement of localizing the relevant moment in an untrimmed video, and 2) bridging the semantic gap between textual query and video contents.
We propose an Attentive Cross-modal Relevance Matching model which predicts the temporal boundaries based on an interaction modeling.
arXiv Detail & Related papers (2020-09-22T10:25:41Z) - Text-based Localization of Moments in a Video Corpus [38.393877654679414]
We address the task of temporal localization of moments in a corpus of videos for a given sentence query.
We propose Hierarchical Moment Alignment Network (HMAN) which learns an effective joint embedding space for moments and sentences.
In addition to learning subtle differences between intra-video moments, HMAN focuses on distinguishing inter-video global semantic concepts based on sentence queries.
arXiv Detail & Related papers (2020-08-20T00:05:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.