Frame-wise Cross-modal Matching for Video Moment Retrieval
- URL: http://arxiv.org/abs/2009.10434v2
- Date: Thu, 22 Jul 2021 07:32:20 GMT
- Title: Frame-wise Cross-modal Matching for Video Moment Retrieval
- Authors: Haoyu Tang, Jihua Zhu, Meng Liu, Member, IEEE, Zan Gao, and Zhiyong
Cheng
- Abstract summary: Video moment retrieval targets at retrieving a moment in a video for a given language query.
The challenges of this task include 1) the requirement of localizing the relevant moment in an untrimmed video, and 2) bridging the semantic gap between textual query and video contents.
We propose an Attentive Cross-modal Relevance Matching model which predicts the temporal boundaries based on an interaction modeling.
- Score: 32.68921139236391
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video moment retrieval targets at retrieving a moment in a video for a given
language query. The challenges of this task include 1) the requirement of
localizing the relevant moment in an untrimmed video, and 2) bridging the
semantic gap between textual query and video contents. To tackle those
problems, early approaches adopt the sliding window or uniform sampling to
collect video clips first and then match each clip with the query. Obviously,
these strategies are time-consuming and often lead to unsatisfied accuracy in
localization due to the unpredictable length of the golden moment. To avoid the
limitations, researchers recently attempt to directly predict the relevant
moment boundaries without the requirement to generate video clips first. One
mainstream approach is to generate a multimodal feature vector for the target
query and video frames (e.g., concatenation) and then use a regression approach
upon the multimodal feature vector for boundary detection. Although some
progress has been achieved by this approach, we argue that those methods have
not well captured the cross-modal interactions between the query and video
frames.
In this paper, we propose an Attentive Cross-modal Relevance Matching (ACRM)
model which predicts the temporal boundaries based on an interaction modeling.
In addition, an attention module is introduced to assign higher weights to
query words with richer semantic cues, which are considered to be more
important for finding relevant video contents. Another contribution is that we
propose an additional predictor to utilize the internal frames in the model
training to improve the localization accuracy. Extensive experiments on two
datasets TACoS and Charades-STA demonstrate the superiority of our method over
several state-of-the-art methods. Ablation studies have been also conducted to
examine the effectiveness of different modules in our ACRM model.
Related papers
- Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement [72.7576395034068]
Video Corpus Moment Retrieval (VCMR) is a new video retrieval task aimed at retrieving a relevant moment from a large corpus of untrimmed videos using a text query.
We argue that effectively capturing the partial relevance between the query and video is essential for the VCMR task.
For video retrieval, we introduce a multi-modal collaborative video retriever, generating different query representations for the two modalities.
For moment localization, we propose the focus-then-fuse moment localizer, utilizing modality-specific gates to capture essential content.
arXiv Detail & Related papers (2024-02-21T07:16:06Z) - Towards Video Anomaly Retrieval from Video Anomaly Detection: New
Benchmarks and Model [70.97446870672069]
Video anomaly detection (VAD) has been paid increasing attention due to its potential applications.
Video Anomaly Retrieval ( VAR) aims to pragmatically retrieve relevant anomalous videos by cross-modalities.
We present two benchmarks, UCFCrime-AR and XD-Violence, constructed on top of prevalent anomaly datasets.
arXiv Detail & Related papers (2023-07-24T06:22:37Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form
Video Question Answering [73.61182342844639]
We introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA.
MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules.
Visual concepts at different granularities are then processed efficiently through an attention module.
arXiv Detail & Related papers (2022-12-19T15:05:40Z) - CONQUER: Contextual Query-aware Ranking for Video Corpus Moment
Retrieval [24.649068267308913]
Video retrieval applications should enable users to retrieve a precise moment from a large video corpus.
We propose a novel model for effective moment localization and ranking.
We conduct studies on two datasets, TVR for closed-world TV episodes and DiDeMo for open-world user-generated videos.
arXiv Detail & Related papers (2021-09-21T08:07:27Z) - Deconfounded Video Moment Retrieval with Causal Intervention [80.90604360072831]
We tackle the task of video moment retrieval (VMR), which aims to localize a specific moment in a video according to a textual query.
Existing methods primarily model the matching relationship between query and moment by complex cross-modal interactions.
We propose a causality-inspired VMR framework that builds structural causal model to capture the true effect of query and video content on the prediction.
arXiv Detail & Related papers (2021-06-03T01:33:26Z) - A Simple Yet Effective Method for Video Temporal Grounding with
Cross-Modality Attention [31.218804432716702]
The task of language-guided video temporal grounding is to localize the particular video clip corresponding to a query sentence in an untrimmed video.
We propose a simple two-branch Cross-Modality Attention (CMA) module with intuitive structure design.
In addition, we introduce a new task-specific regression loss function, which improves the temporal grounding accuracy by alleviating the impact of annotation bias.
arXiv Detail & Related papers (2020-09-23T16:03:00Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.