A Survey on Video Moment Localization
- URL: http://arxiv.org/abs/2306.07515v1
- Date: Tue, 13 Jun 2023 02:57:32 GMT
- Title: A Survey on Video Moment Localization
- Authors: Meng Liu, Liqiang Nie, Yunxiao Wang, Meng Wang, Yong Rui
- Abstract summary: Video moment localization aims to search a target segment within a video described by a given natural language query.
We present a review of existing video moment localization techniques, including supervised, weakly supervised, and unsupervised ones.
We discuss promising future directions for this field, in particular large-scale datasets and interpretable video moment localization models.
- Score: 61.5323647499912
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Video moment localization, also known as video moment retrieval, aiming to
search a target segment within a video described by a given natural language
query. Beyond the task of temporal action localization whereby the target
actions are pre-defined, video moment retrieval can query arbitrary complex
activities. In this survey paper, we aim to present a comprehensive review of
existing video moment localization techniques, including supervised, weakly
supervised, and unsupervised ones. We also review the datasets available for
video moment localization and group results of related work. In addition, we
discuss promising future directions for this field, in particular large-scale
datasets and interpretable video moment localization models.
Related papers
- Prompting Large Language Models to Reformulate Queries for Moment
Localization [79.57593838400618]
The task of moment localization is to localize a temporal moment in an untrimmed video for a given natural language query.
We make early attempts at reformulating the moment queries into a set of instructions using large language models and making them more friendly to the localization models.
arXiv Detail & Related papers (2023-06-06T05:48:09Z) - What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions [55.574102714832456]
spatial-temporal grounding describes the task of localizing events in space and time.
Models for this task are usually trained with human-annotated sentences and bounding box supervision.
We combine local representation learning, which focuses on fine-grained spatial information, with a global representation that captures higher-level representations.
arXiv Detail & Related papers (2023-03-29T19:38:23Z) - Hierarchical Video-Moment Retrieval and Step-Captioning [68.4859260853096]
HiREST consists of 3.4K text-video pairs from an instructional video dataset.
Our hierarchical benchmark consists of video retrieval, moment retrieval, and two novel moment segmentation and step captioning tasks.
arXiv Detail & Related papers (2023-03-29T02:33:54Z) - Modal-specific Pseudo Query Generation for Video Corpus Moment Retrieval [20.493241098064665]
Video corpus moment retrieval (VCMR) is the task to retrieve the most relevant video moment from a large video corpus using a natural language query.
We propose a self-supervised learning framework: Modal-specific Pseudo Query Generation Network (MPGN)
MPGN generates pseudo queries exploiting both visual and textual information from selected temporal moments.
We show that MPGN successfully learns to localize the video corpus moment without any explicit annotation.
arXiv Detail & Related papers (2022-10-23T05:05:18Z) - Progressive Localization Networks for Language-based Moment Localization [56.54450664871467]
This paper focuses on the task of language-based moment localization.
Most existing methods prefer to first sample sufficient candidate moments with various temporal lengths, and then match them with the given query to determine the target moment.
We propose a novel multi-stage Progressive localization Network (PLN) which progressively localizes the target moment in a coarse-to-fine manner.
arXiv Detail & Related papers (2021-02-02T03:45:59Z) - A Hierarchical Multi-Modal Encoder for Moment Localization in Video
Corpus [31.387948069111893]
We show how to identify a short segment in a long video that semantically matches a text query.
To tackle this problem, we propose the HierArchical Multi-Modal EncodeR (HAMMER) that encodes a video at both the coarse-grained clip level and the fine-trimmed frame level.
We conduct extensive experiments to evaluate our model on moment localization in video corpus on ActivityNet Captions and TVR datasets.
arXiv Detail & Related papers (2020-11-18T02:42:36Z) - DORi: Discovering Object Relationship for Moment Localization of a
Natural-Language Query in Video [98.54696229182335]
We study the task of temporal moment localization in a long untrimmed video using natural language query.
Our key innovation is to learn a video feature embedding through a language-conditioned message-passing algorithm.
A temporal sub-graph captures the activities within the video through time.
arXiv Detail & Related papers (2020-10-13T09:50:29Z) - Text-based Localization of Moments in a Video Corpus [38.393877654679414]
We address the task of temporal localization of moments in a corpus of videos for a given sentence query.
We propose Hierarchical Moment Alignment Network (HMAN) which learns an effective joint embedding space for moments and sentences.
In addition to learning subtle differences between intra-video moments, HMAN focuses on distinguishing inter-video global semantic concepts based on sentence queries.
arXiv Detail & Related papers (2020-08-20T00:05:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.