Boundary Proposal Network for Two-Stage Natural Language Video
Localization
- URL: http://arxiv.org/abs/2103.08109v1
- Date: Mon, 15 Mar 2021 03:06:18 GMT
- Title: Boundary Proposal Network for Two-Stage Natural Language Video
Localization
- Authors: Shaoning Xiao, Long Chen, Songyang Zhang, Wei Ji, Jian Shao, Lu Ye,
Jun Xiao
- Abstract summary: Boundary Proposal Network (BPNet) is a universal two-stage framework that gets rid of the issues mentioned above.
In the first stage, BPNet utilizes an anchor-free model to generate a group of quality candidate video segments with their boundaries.
In the second stage, a visual-language fusion layer is proposed to jointly model the multi-language interaction between the candidate and the language query.
- Score: 23.817486773852142
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We aim to address the problem of Natural Language Video Localization
(NLVL)-localizing the video segment corresponding to a natural language
description in a long and untrimmed video. State-of-the-art NLVL methods are
almost in one-stage fashion, which can be typically grouped into two
categories: 1) anchor-based approach: it first pre-defines a series of video
segment candidates (e.g., by sliding window), and then does classification for
each candidate; 2) anchor-free approach: it directly predicts the probabilities
for each video frame as a boundary or intermediate frame inside the positive
segment. However, both kinds of one-stage approaches have inherent drawbacks:
the anchor-based approach is susceptible to the heuristic rules, further
limiting the capability of handling videos with variant length. While the
anchor-free approach fails to exploit the segment-level interaction thus
achieving inferior results. In this paper, we propose a novel Boundary Proposal
Network (BPNet), a universal two-stage framework that gets rid of the issues
mentioned above. Specifically, in the first stage, BPNet utilizes an
anchor-free model to generate a group of high-quality candidate video segments
with their boundaries. In the second stage, a visual-language fusion layer is
proposed to jointly model the multi-modal interaction between the candidate and
the language query, followed by a matching score rating layer that outputs the
alignment score for each candidate. We evaluate our BPNet on three challenging
NLVL benchmarks (i.e., Charades-STA, TACoS and ActivityNet-Captions). Extensive
experiments and ablative studies on these datasets demonstrate that the BPNet
outperforms the state-of-the-art methods.
Related papers
- Generation-Guided Multi-Level Unified Network for Video Grounding [18.402093379973085]
Video grounding aims to locate the timestamps best matching the query description within an untrimmed video.
Moment-level approaches directly predict the probability of each transient moment to be the boundary in a global perspective.
Clip-level ones aggregate the moments in different time windows into proposals and then deduce the most similar one, leading to its advantage in fine-grained grounding.
arXiv Detail & Related papers (2023-03-14T09:48:59Z) - Hierarchical Local-Global Transformer for Temporal Sentence Grounding [58.247592985849124]
This paper studies the multimedia problem of temporal sentence grounding.
It aims to accurately determine the specific video segment in an untrimmed video according to a given sentence query.
arXiv Detail & Related papers (2022-08-31T14:16:56Z) - PPMN: Pixel-Phrase Matching Network for One-Stage Panoptic Narrative
Grounding [24.787497472368244]
We propose a one-stage end-to-end Pixel-Phrase Matching Network (PPMN), which directly matches each phrase to its corresponding pixels instead of region proposals.
Our method achieves new state-of-the-art performance on the PNG benchmark with 4.0 absolute Average Recall gains.
arXiv Detail & Related papers (2022-08-11T05:42:12Z) - Skimming, Locating, then Perusing: A Human-Like Framework for Natural
Language Video Localization [19.46938403691984]
We propose a two-step human-like framework called Skimming-Locating-Perusing.
SLP consists of a Skimming-and-Locating (SL) module and a Bi-directional Perusing (BP) module.
Our SLP is superior to the state-of-the-art methods and localizes more precise segment boundaries.
arXiv Detail & Related papers (2022-07-27T10:59:33Z) - Part-level Action Parsing via a Pose-guided Coarse-to-Fine Framework [108.70949305791201]
Part-level Action Parsing (PAP) aims to not only predict the video-level action but also recognize the frame-level fine-grained actions or interactions of body parts for each person in the video.
In particular, our framework first predicts the video-level class of the input video, then localizes the body parts and predicts the part-level action.
Our framework achieves state-of-the-art performance and outperforms existing methods over a 31.10% ROC score.
arXiv Detail & Related papers (2022-03-09T01:30:57Z) - Natural Language Video Localization with Learnable Moment Proposals [40.91060659795612]
We propose a novel model termed LPNet (Learnable Proposal Network for NLVL) with a fixed set of learnable moment proposals.
In this paper, we demonstrate the effectiveness of LPNet over existing state-of-the-art methods.
arXiv Detail & Related papers (2021-09-22T12:18:58Z) - Rethinking Cross-modal Interaction from a Top-down Perspective for
Referring Video Object Segmentation [140.4291169276062]
Referring video object segmentation (RVOS) aims to segment video objects with the guidance of natural language reference.
Previous methods typically tackle RVOS through directly grounding linguistic reference over the image lattice.
In this work, we put forward a two-stage, top-down RVOS solution. First, an exhaustive set of object tracklets is constructed by propagating object masks detected from several sampled frames to the entire video.
Second, a Transformer-based tracklet-language grounding module is proposed, which models instance-level visual relations and cross-modal interactions simultaneously and efficiently.
arXiv Detail & Related papers (2021-06-02T10:26:13Z) - Video Instance Segmentation with a Propose-Reduce Paradigm [68.59137660342326]
Video instance segmentation (VIS) aims to segment and associate all instances of predefined classes for each frame in videos.
Prior methods usually obtain segmentation for a frame or clip first, and then merge the incomplete results by tracking or matching.
We propose a new paradigm -- Propose-Reduce, to generate complete sequences for input videos by a single step.
arXiv Detail & Related papers (2021-03-25T10:58:36Z) - BSN++: Complementary Boundary Regressor with Scale-Balanced Relation
Modeling for Temporal Action Proposal Generation [85.13713217986738]
We present BSN++, a new framework which exploits complementary boundary regressor and relation modeling for temporal proposal generation.
Not surprisingly, the proposed BSN++ ranked 1st place in the CVPR19 - ActivityNet challenge leaderboard on temporal action localization task.
arXiv Detail & Related papers (2020-09-15T07:08:59Z) - Fine-grained Iterative Attention Network for TemporalLanguage
Localization in Videos [63.94898634140878]
Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query.
We propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction.
We evaluate the proposed method on three challenging public benchmarks: Ac-tivityNet Captions, TACoS, and Charades-STA.
arXiv Detail & Related papers (2020-08-06T04:09:03Z) - CBR-Net: Cascade Boundary Refinement Network for Action Detection:
Submission to ActivityNet Challenge 2020 (Task 1) [42.77192990307131]
We present our solution for the task of temporal action localization (detection) (task 1) in ActivityNet Challenge 2020.
The purpose of this task is to temporally localize intervals where actions of interest occur and predict the action categories in a long untrimmed video.
In this stage, we combine the video-level classification results obtained by the fine tuning networks to predict the category of each proposal.
arXiv Detail & Related papers (2020-06-13T01:05:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.