Overcoming Weak Visual-Textual Alignment for Video Moment Retrieval
- URL: http://arxiv.org/abs/2306.02728v2
- Date: Mon, 20 Nov 2023 02:22:58 GMT
- Title: Overcoming Weak Visual-Textual Alignment for Video Moment Retrieval
- Authors: Minjoon Jung, Youwon Jang, Seongho Choi, Joochan Kim, Jin-Hwa Kim,
Byoung-Tak Zhang
- Abstract summary: Video moment retrieval (VMR) identifies a specific moment in an untrimmed video for a given natural language query.
This task is prone to suffer the weak visual-textual alignment problem innate in video datasets.
We propose a background-aware moment detection transformer (BM-DETR)
Our model learns to predict the target moment from the joint probability of each frame given the positive query and the complement of negative queries.
- Score: 20.254815143604777
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video moment retrieval (VMR) identifies a specific moment in an untrimmed
video for a given natural language query. This task is prone to suffer the weak
visual-textual alignment problem innate in video datasets. Due to the
ambiguity, a query does not fully cover the relevant details of the
corresponding moment, or the moment may contain misaligned and irrelevant
frames, potentially limiting further performance gains. To tackle this problem,
we propose a background-aware moment detection transformer (BM-DETR). Our model
adopts a contrastive approach, carefully utilizing the negative queries matched
to other moments in the video. Specifically, our model learns to predict the
target moment from the joint probability of each frame given the positive query
and the complement of negative queries. This leads to effective use of the
surrounding background, improving moment sensitivity and enhancing overall
alignments in videos. Extensive experiments on four benchmarks demonstrate the
effectiveness of our approach.
Related papers
- Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models [58.17315970207874]
We propose a zero-shot method for adapting generalisable visual-textual priors from arbitrary VLM to facilitate moment-text alignment.
Experiments conducted on three VMR benchmark datasets demonstrate the notable performance advantages of our zero-shot algorithm.
arXiv Detail & Related papers (2023-09-01T13:06:50Z) - Is it Really Negative? Evaluating Natural Language Video Localization Performance on Multiple Reliable Videos Pool [24.858928681280634]
Video Corpus Moment Retrieval (VCMR) aims to detect a video moment that matches a given natural language query from multiple videos.
Existing VCMR studies have regarded all videos not paired with a specific query as negative.
We propose an MVMR task that aims to localize video frames within a massive video set.
arXiv Detail & Related papers (2023-08-15T17:38:55Z) - Towards Video Anomaly Retrieval from Video Anomaly Detection: New
Benchmarks and Model [70.97446870672069]
Video anomaly detection (VAD) has been paid increasing attention due to its potential applications.
Video Anomaly Retrieval ( VAR) aims to pragmatically retrieve relevant anomalous videos by cross-modalities.
We present two benchmarks, UCFCrime-AR and XD-Violence, constructed on top of prevalent anomaly datasets.
arXiv Detail & Related papers (2023-07-24T06:22:37Z) - Query-Dependent Video Representation for Moment Retrieval and Highlight
Detection [8.74967598360817]
Key objective of MR/HD is to localize the moment and estimate clip-wise accordance level, i.e., saliency score, to a given text query.
Recent transformer-based models do not fully exploit the information of a given query.
We introduce Query-Dependent DETR (QD-DETR), a detection transformer tailored for MR/HD.
arXiv Detail & Related papers (2023-03-24T09:32:50Z) - Video Moment Retrieval from Text Queries via Single Frame Annotation [65.92224946075693]
Video moment retrieval aims at finding the start and end timestamps of a moment described by a given natural language query.
Fully supervised methods need complete temporal boundary annotations to achieve promising results.
We propose a new paradigm called "glance annotation"
arXiv Detail & Related papers (2022-04-20T11:59:17Z) - QVHighlights: Detecting Moments and Highlights in Videos via Natural
Language Queries [89.24431389933703]
We present the Query-based Video Highlights (QVHighlights) dataset.
It consists of over 10,000 YouTube videos, covering a wide range of topics.
Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips.
arXiv Detail & Related papers (2021-07-20T16:42:58Z) - Deconfounded Video Moment Retrieval with Causal Intervention [80.90604360072831]
We tackle the task of video moment retrieval (VMR), which aims to localize a specific moment in a video according to a textual query.
Existing methods primarily model the matching relationship between query and moment by complex cross-modal interactions.
We propose a causality-inspired VMR framework that builds structural causal model to capture the true effect of query and video content on the prediction.
arXiv Detail & Related papers (2021-06-03T01:33:26Z) - Composable Augmentation Encoding for Video Representation Learning [94.2358972764708]
We focus on contrastive methods for self-supervised video representation learning.
A common paradigm in contrastive learning is to construct positive pairs by sampling different data views for the same instance, with different data instances as negatives.
We propose an 'augmentation aware' contrastive learning framework, where we explicitly provide a sequence of augmentation parameterisations.
We show that our method encodes valuable information about specified spatial or temporal augmentation, and in doing so also achieve state-of-the-art performance on a number of video benchmarks.
arXiv Detail & Related papers (2021-04-01T16:48:53Z) - Uncovering Hidden Challenges in Query-Based Video Moment Retrieval [29.90001703587512]
We present a series of experiments assessing how well the benchmark results reflect the true progress in solving the moment retrieval task.
Our results indicate substantial biases in the popular datasets and unexpected behaviour of the state-of-the-art models.
We suggest possible directions to improve the temporal sentence grounding in the future.
arXiv Detail & Related papers (2020-09-01T10:07:23Z) - Text-based Localization of Moments in a Video Corpus [38.393877654679414]
We address the task of temporal localization of moments in a corpus of videos for a given sentence query.
We propose Hierarchical Moment Alignment Network (HMAN) which learns an effective joint embedding space for moments and sentences.
In addition to learning subtle differences between intra-video moments, HMAN focuses on distinguishing inter-video global semantic concepts based on sentence queries.
arXiv Detail & Related papers (2020-08-20T00:05:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.