Progressive Localization Networks for Language-based Moment Localization
- URL: http://arxiv.org/abs/2102.01282v1
- Date: Tue, 2 Feb 2021 03:45:59 GMT
- Title: Progressive Localization Networks for Language-based Moment Localization
- Authors: Qi Zheng, Jianfeng Dong, Xiaoye Qu, Xun Yang, Shouling Ji, Xun Wang
- Abstract summary: This paper focuses on the task of language-based moment localization.
Most existing methods prefer to first sample sufficient candidate moments with various temporal lengths, and then match them with the given query to determine the target moment.
We propose a novel multi-stage Progressive localization Network (PLN) which progressively localizes the target moment in a coarse-to-fine manner.
- Score: 56.54450664871467
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper targets the task of language-based moment localization. The
language-based setting of this task allows for an open set of target
activities, resulting in a large variation of the temporal lengths of video
moments. Most existing methods prefer to first sample sufficient candidate
moments with various temporal lengths, and then match them with the given query
to determine the target moment. However, candidate moments generated with a
fixed temporal granularity may be suboptimal to handle the large variation in
moment lengths. To this end, we propose a novel multi-stage Progressive
Localization Network (PLN) which progressively localizes the target moment in a
coarse-to-fine manner. Specifically, each stage of PLN has a localization
branch, and focuses on candidate moments that are generated with a specific
temporal granularity. The temporal granularities of candidate moments are
different across the stages. Moreover, we devise a conditional feature
manipulation module and an upsampling connection to bridge the multiple
localization branches. In this fashion, the later stages are able to absorb the
previously learned information, thus facilitating the more fine-grained
localization. Extensive experiments on three public datasets demonstrate the
effectiveness of our proposed PLN for language-based moment localization and
its potential for localizing short moments in long videos.
Related papers
- LITA: Language Instructed Temporal-Localization Assistant [71.68815100776278]
We introduce time tokens that encode timestamps relative to the video length to better represent time in videos.
We also introduce SlowFast tokens in the architecture to capture temporal information at fine temporal resolution.
We show that our emphasis on temporal localization also substantially improves video-based text generation compared to existing Video LLMs.
arXiv Detail & Related papers (2024-03-27T22:50:48Z) - Jamp: Controlled Japanese Temporal Inference Dataset for Evaluating
Generalization Capacity of Language Models [18.874880342410876]
We present Jamp, a Japanese benchmark focused on temporal inference.
Our dataset includes a range of temporal inference patterns, which enables us to conduct fine-grained analysis.
We evaluate the generalization capacities of monolingual/multilingual LMs by splitting our dataset based on tense fragments.
arXiv Detail & Related papers (2023-06-19T07:00:14Z) - A Survey on Video Moment Localization [61.5323647499912]
Video moment localization aims to search a target segment within a video described by a given natural language query.
We present a review of existing video moment localization techniques, including supervised, weakly supervised, and unsupervised ones.
We discuss promising future directions for this field, in particular large-scale datasets and interpretable video moment localization models.
arXiv Detail & Related papers (2023-06-13T02:57:32Z) - MS-DETR: Natural Language Video Localization with Sampling Moment-Moment
Interaction [28.21563211881665]
Given a query, the task of Natural Language Video localization (NLVL) is to localize a temporal moment in an untrimmed video that semantically matches the query.
In this paper, we adopt a proposal-based solution that generates proposals (i.e., candidate moments) and then select the best matching proposal.
On top of modeling the cross-modal interaction between candidate moments and the query, our proposed Moment Sampling DETR (MS-DETR) enables efficient moment-moment relation modeling.
arXiv Detail & Related papers (2023-05-30T12:06:35Z) - Context-aware Biaffine Localizing Network for Temporal Sentence
Grounding [61.18824806906945]
This paper addresses the problem of temporal sentence grounding (TSG)
TSG aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query.
We propose a novel localization framework that scores all pairs of start and end indices within the video simultaneously with a biaffine mechanism.
arXiv Detail & Related papers (2021-03-22T03:13:05Z) - DORi: Discovering Object Relationship for Moment Localization of a
Natural-Language Query in Video [98.54696229182335]
We study the task of temporal moment localization in a long untrimmed video using natural language query.
Our key innovation is to learn a video feature embedding through a language-conditioned message-passing algorithm.
A temporal sub-graph captures the activities within the video through time.
arXiv Detail & Related papers (2020-10-13T09:50:29Z) - Language Guided Networks for Cross-modal Moment Retrieval [66.49445903955777]
Cross-modal moment retrieval aims to localize a temporal segment from an untrimmed video described by a natural language query.
Existing methods independently extract the features of videos and sentences.
We present Language Guided Networks (LGN), a new framework that leverages the sentence embedding to guide the whole process of moment retrieval.
arXiv Detail & Related papers (2020-06-18T12:08:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.