Natural Language Video Localization with Learnable Moment Proposals
- URL: http://arxiv.org/abs/2109.10678v1
- Date: Wed, 22 Sep 2021 12:18:58 GMT
- Title: Natural Language Video Localization with Learnable Moment Proposals
- Authors: Shaoning Xiao, Long Chen, Jian Shao, Yueting Zhuang, Jun Xiao
- Abstract summary: We propose a novel model termed LPNet (Learnable Proposal Network for NLVL) with a fixed set of learnable moment proposals.
In this paper, we demonstrate the effectiveness of LPNet over existing state-of-the-art methods.
- Score: 40.91060659795612
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Given an untrimmed video and a natural language query, Natural Language Video
Localization (NLVL) aims to identify the video moment described by the query.
To address this task, existing methods can be roughly grouped into two groups:
1) propose-and-rank models first define a set of hand-designed moment
candidates and then find out the best-matching one. 2) proposal-free models
directly predict two temporal boundaries of the referential moment from frames.
Currently, almost all the propose-and-rank methods have inferior performance
than proposal-free counterparts. In this paper, we argue that propose-and-rank
approach is underestimated due to the predefined manners: 1) Hand-designed
rules are hard to guarantee the complete coverage of targeted segments. 2)
Densely sampled candidate moments cause redundant computation and degrade the
performance of ranking process. To this end, we propose a novel model termed
LPNet (Learnable Proposal Network for NLVL) with a fixed set of learnable
moment proposals. The position and length of these proposals are dynamically
adjusted during training process. Moreover, a boundary-aware loss has been
proposed to leverage frame-level information and further improve the
performance. Extensive ablations on two challenging NLVL benchmarks have
demonstrated the effectiveness of LPNet over existing state-of-the-art methods.
Related papers
- Adaptive Proposal Generation Network for Temporal Sentence Localization
in Videos [58.83440885457272]
We address the problem of temporal sentence localization in videos (TSLV)
Traditional methods follow a top-down framework which localizes the target segment with pre-defined segment proposals.
We propose an Adaptive Proposal Generation Network (APGN) to maintain the segment-level interaction while speeding up the efficiency.
arXiv Detail & Related papers (2021-09-14T02:02:36Z) - VL-NMS: Breaking Proposal Bottlenecks in Two-Stage Visual-Language
Matching [75.71523183166799]
The prevailing framework for matching multimodal inputs is based on a two-stage process.
We argue that these methods overlook an obvious emphmismatch between the roles of proposals in the two stages.
We propose VL-NMS, which is the first method to yield query-aware proposals at the first stage.
arXiv Detail & Related papers (2021-05-12T13:05:25Z) - Boundary Proposal Network for Two-Stage Natural Language Video
Localization [23.817486773852142]
Boundary Proposal Network (BPNet) is a universal two-stage framework that gets rid of the issues mentioned above.
In the first stage, BPNet utilizes an anchor-free model to generate a group of quality candidate video segments with their boundaries.
In the second stage, a visual-language fusion layer is proposed to jointly model the multi-language interaction between the candidate and the language query.
arXiv Detail & Related papers (2021-03-15T03:06:18Z) - BSN++: Complementary Boundary Regressor with Scale-Balanced Relation
Modeling for Temporal Action Proposal Generation [85.13713217986738]
We present BSN++, a new framework which exploits complementary boundary regressor and relation modeling for temporal proposal generation.
Not surprisingly, the proposed BSN++ ranked 1st place in the CVPR19 - ActivityNet challenge leaderboard on temporal action localization task.
arXiv Detail & Related papers (2020-09-15T07:08:59Z) - VLANet: Video-Language Alignment Network for Weakly-Supervised Video
Moment Retrieval [21.189093631175425]
Video Moment Retrieval (VMR) is a task to localize the temporal moment in untrimmed video specified by natural language query.
This paper explores methods for performing VMR in a weakly-supervised manner (wVMR)
The experiments show that the method achieves state-of-the-art performance on Charades-STA and DiDeMo datasets.
arXiv Detail & Related papers (2020-08-24T07:54:59Z) - Regularized Two-Branch Proposal Networks for Weakly-Supervised Moment
Retrieval in Videos [108.55320735031721]
Video moment retrieval aims to localize the target moment in a video according to the given sentence.
Most existing weak-supervised methods apply a MIL-based framework to develop inter-sample confrontment.
We propose a novel Regularized Two-Branch Proposal Network to simultaneously consider the inter-sample and intra-sample confrontments.
arXiv Detail & Related papers (2020-08-19T04:42:46Z) - Complementary Boundary Generator with Scale-Invariant Relation Modeling
for Temporal Action Localization: Submission to ActivityNet Challenge 2020 [66.4527310659592]
This report presents an overview of our solution used in the submission to ActivityNet Challenge 2020 Task 1.
We decouple the temporal action localization task into two stages (i.e. proposal generation and classification) and enrich the proposal diversity.
Our proposed scheme achieves the state-of-the-art performance on the temporal action localization task with textbf42.26 average mAP on the challenge testing set.
arXiv Detail & Related papers (2020-07-20T04:35:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.