Uncovering Hidden Challenges in Query-Based Video Moment Retrieval
- URL: http://arxiv.org/abs/2009.00325v2
- Date: Wed, 7 Oct 2020 10:15:13 GMT
- Title: Uncovering Hidden Challenges in Query-Based Video Moment Retrieval
- Authors: Mayu Otani, Yuta Nakashima, Esa Rahtu, Janne Heikkil\"a
- Abstract summary: We present a series of experiments assessing how well the benchmark results reflect the true progress in solving the moment retrieval task.
Our results indicate substantial biases in the popular datasets and unexpected behaviour of the state-of-the-art models.
We suggest possible directions to improve the temporal sentence grounding in the future.
- Score: 29.90001703587512
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The query-based moment retrieval is a problem of localising a specific clip
from an untrimmed video according a query sentence. This is a challenging task
that requires interpretation of both the natural language query and the video
content. Like in many other areas in computer vision and machine learning, the
progress in query-based moment retrieval is heavily driven by the benchmark
datasets and, therefore, their quality has significant impact on the field. In
this paper, we present a series of experiments assessing how well the benchmark
results reflect the true progress in solving the moment retrieval task. Our
results indicate substantial biases in the popular datasets and unexpected
behaviour of the state-of-the-art models. Moreover, we present new sanity check
experiments and approaches for visualising the results. Finally, we suggest
possible directions to improve the temporal sentence grounding in the future.
Our code for this paper is available at
https://mayu-ot.github.io/hidden-challenges-MR .
Related papers
- Background-aware Moment Detection for Video Moment Retrieval [19.11524416308641]
Video moment retrieval (VMR) identifies a specific moment in an untrimmed video for a given natural language query.
Due to the ambiguity, a query does not fully cover the relevant details of the corresponding moment.
We propose a background-aware moment detection transformer (BM-DETR)
Our model learns to predict the target moment from the joint probability of each frame given the positive query and the complement of negative queries.
arXiv Detail & Related papers (2023-06-05T09:26:33Z) - Deep Learning for Video-Text Retrieval: a Review [13.341694455581363]
Video-Text Retrieval (VTR) aims to search for the most relevant video related to the semantics in a given sentence.
In this survey, we review and summarize over 100 research papers related to VTR.
arXiv Detail & Related papers (2023-02-24T10:14:35Z) - Multi-video Moment Ranking with Multimodal Clue [69.81533127815884]
State-of-the-art work for VCMR is based on two-stage method.
MINUTE outperforms the baselines on TVR and DiDeMo datasets.
arXiv Detail & Related papers (2023-01-29T18:38:13Z) - Selective Query-guided Debiasing Network for Video Corpus Moment
Retrieval [19.51766089306712]
Video moment retrieval aims to localize target moments in untrimmed videos pertinent to a given textual query.
Existing retrieval systems tend to rely on retrieval bias as a shortcut.
We propose a Selective Query-guided Debiasing network (SQuiDNet)
arXiv Detail & Related papers (2022-10-17T03:10:21Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z) - QVHighlights: Detecting Moments and Highlights in Videos via Natural
Language Queries [89.24431389933703]
We present the Query-based Video Highlights (QVHighlights) dataset.
It consists of over 10,000 YouTube videos, covering a wide range of topics.
Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips.
arXiv Detail & Related papers (2021-07-20T16:42:58Z) - Deconfounded Video Moment Retrieval with Causal Intervention [80.90604360072831]
We tackle the task of video moment retrieval (VMR), which aims to localize a specific moment in a video according to a textual query.
Existing methods primarily model the matching relationship between query and moment by complex cross-modal interactions.
We propose a causality-inspired VMR framework that builds structural causal model to capture the true effect of query and video content on the prediction.
arXiv Detail & Related papers (2021-06-03T01:33:26Z) - DORi: Discovering Object Relationship for Moment Localization of a
Natural-Language Query in Video [98.54696229182335]
We study the task of temporal moment localization in a long untrimmed video using natural language query.
Our key innovation is to learn a video feature embedding through a language-conditioned message-passing algorithm.
A temporal sub-graph captures the activities within the video through time.
arXiv Detail & Related papers (2020-10-13T09:50:29Z) - Query Resolution for Conversational Search with Limited Supervision [63.131221660019776]
We propose QuReTeC (Query Resolution by Term Classification), a neural query resolution model based on bidirectional transformers.
We show that QuReTeC outperforms state-of-the-art models, and furthermore, that our distant supervision method can be used to substantially reduce the amount of human-curated data required to train QuReTeC.
arXiv Detail & Related papers (2020-05-24T11:37:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.