CONQUER: Contextual Query-aware Ranking for Video Corpus Moment
Retrieval
- URL: http://arxiv.org/abs/2109.10016v1
- Date: Tue, 21 Sep 2021 08:07:27 GMT
- Title: CONQUER: Contextual Query-aware Ranking for Video Corpus Moment
Retrieval
- Authors: Zhijian Hou, Chong-Wah Ngo, Wing Kwong Chan
- Abstract summary: Video retrieval applications should enable users to retrieve a precise moment from a large video corpus.
We propose a novel model for effective moment localization and ranking.
We conduct studies on two datasets, TVR for closed-world TV episodes and DiDeMo for open-world user-generated videos.
- Score: 24.649068267308913
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This paper tackles a recently proposed Video Corpus Moment Retrieval task.
This task is essential because advanced video retrieval applications should
enable users to retrieve a precise moment from a large video corpus. We propose
a novel CONtextual QUery-awarE Ranking~(CONQUER) model for effective moment
localization and ranking. CONQUER explores query context for multi-modal fusion
and representation learning in two different steps. The first step derives
fusion weights for the adaptive combination of multi-modal video content. The
second step performs bi-directional attention to tightly couple video and query
as a single joint representation for moment localization. As query context is
fully engaged in video representation learning, from feature fusion to
transformation, the resulting feature is user-centered and has a larger
capacity in capturing multi-modal signals specific to query. We conduct studies
on two datasets, TVR for closed-world TV episodes and DiDeMo for open-world
user-generated videos, to investigate the potential advantages of fusing video
and query online as a joint representation for moment retrieval.
Related papers
- Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement [72.7576395034068]
Video Corpus Moment Retrieval (VCMR) is a new video retrieval task aimed at retrieving a relevant moment from a large corpus of untrimmed videos using a text query.
We argue that effectively capturing the partial relevance between the query and video is essential for the VCMR task.
For video retrieval, we introduce a multi-modal collaborative video retriever, generating different query representations for the two modalities.
For moment localization, we propose the focus-then-fuse moment localizer, utilizing modality-specific gates to capture essential content.
arXiv Detail & Related papers (2024-02-21T07:16:06Z) - Hierarchical Video-Moment Retrieval and Step-Captioning [68.4859260853096]
HiREST consists of 3.4K text-video pairs from an instructional video dataset.
Our hierarchical benchmark consists of video retrieval, moment retrieval, and two novel moment segmentation and step captioning tasks.
arXiv Detail & Related papers (2023-03-29T02:33:54Z) - Multi-video Moment Ranking with Multimodal Clue [69.81533127815884]
State-of-the-art work for VCMR is based on two-stage method.
MINUTE outperforms the baselines on TVR and DiDeMo datasets.
arXiv Detail & Related papers (2023-01-29T18:38:13Z) - DeepQAMVS: Query-Aware Hierarchical Pointer Networks for Multi-Video
Summarization [127.16984421969529]
We introduce a novel Query-Aware Hierarchical Pointer Network for Multi-Video Summarization, termed DeepQAMVS.
DeepQAMVS is trained with reinforcement learning, incorporating rewards that capture representativeness, diversity, query-adaptability and temporal coherence.
We achieve state-of-the-art results on the MVS1K dataset, with inference time scaling linearly with the number of input video frames.
arXiv Detail & Related papers (2021-05-13T17:33:26Z) - Video Corpus Moment Retrieval with Contrastive Learning [56.249924768243375]
Video corpus moment retrieval (VCMR) is to retrieve a temporal moment that semantically corresponds to a given text query.
We propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR.
Experimental results show that ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
arXiv Detail & Related papers (2021-05-13T12:54:39Z) - Frame-wise Cross-modal Matching for Video Moment Retrieval [32.68921139236391]
Video moment retrieval targets at retrieving a moment in a video for a given language query.
The challenges of this task include 1) the requirement of localizing the relevant moment in an untrimmed video, and 2) bridging the semantic gap between textual query and video contents.
We propose an Attentive Cross-modal Relevance Matching model which predicts the temporal boundaries based on an interaction modeling.
arXiv Detail & Related papers (2020-09-22T10:25:41Z) - Fine-grained Iterative Attention Network for TemporalLanguage
Localization in Videos [63.94898634140878]
Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query.
We propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction.
We evaluate the proposed method on three challenging public benchmarks: Ac-tivityNet Captions, TACoS, and Charades-STA.
arXiv Detail & Related papers (2020-08-06T04:09:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.