Regularized Two-Branch Proposal Networks for Weakly-Supervised Moment
Retrieval in Videos
- URL: http://arxiv.org/abs/2008.08257v1
- Date: Wed, 19 Aug 2020 04:42:46 GMT
- Title: Regularized Two-Branch Proposal Networks for Weakly-Supervised Moment
Retrieval in Videos
- Authors: Zhu Zhang, Zhijie Lin, Zhou Zhao, Jieming Zhu and Xiuqiang He
- Abstract summary: Video moment retrieval aims to localize the target moment in a video according to the given sentence.
Most existing weak-supervised methods apply a MIL-based framework to develop inter-sample confrontment.
We propose a novel Regularized Two-Branch Proposal Network to simultaneously consider the inter-sample and intra-sample confrontments.
- Score: 108.55320735031721
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video moment retrieval aims to localize the target moment in an video
according to the given sentence. The weak-supervised setting only provides the
video-level sentence annotations during training. Most existing weak-supervised
methods apply a MIL-based framework to develop inter-sample confrontment, but
ignore the intra-sample confrontment between moments with semantically similar
contents. Thus, these methods fail to distinguish the target moment from
plausible negative moments. In this paper, we propose a novel Regularized
Two-Branch Proposal Network to simultaneously consider the inter-sample and
intra-sample confrontments. Concretely, we first devise a language-aware filter
to generate an enhanced video stream and a suppressed video stream. We then
design the sharable two-branch proposal module to generate positive proposals
from the enhanced stream and plausible negative proposals from the suppressed
one for sufficient confrontment. Further, we apply the proposal regularization
to stabilize the training process and improve model performance. The extensive
experiments show the effectiveness of our method. Our code is released at here.
Related papers
- DiffusionVMR: Diffusion Model for Joint Video Moment Retrieval and
Highlight Detection [38.12212015133935]
A novel framework, DiffusionVMR, is proposed to redefine the two tasks as a unified conditional denoising generation process.
Experiments conducted on five widely-used benchmarks demonstrate the effectiveness and flexibility of the proposed DiffusionVMR.
arXiv Detail & Related papers (2023-08-29T08:20:23Z) - Counterfactual Cross-modality Reasoning for Weakly Supervised Video
Moment Localization [67.88493779080882]
Video moment localization aims to retrieve the target segment of an untrimmed video according to the natural language query.
Recent works contrast the cross-modality similarities driven by reconstructing masked queries.
We propose a novel proposed counterfactual cross-modality reasoning method.
arXiv Detail & Related papers (2023-08-10T15:45:45Z) - DiffTAD: Temporal Action Detection with Proposal Denoising Diffusion [137.8749239614528]
We propose a new formulation of temporal action detection (TAD) with denoising diffusion, DiffTAD.
Taking as input random temporal proposals, it can yield action proposals accurately given an untrimmed long video.
arXiv Detail & Related papers (2023-03-27T00:40:52Z) - End-to-End Dense Video Grounding via Parallel Regression [30.984657885692553]
Video grounding aims to localize the corresponding video moment in an untrimmed video given a language query.
We present an end-to-end parallel decoding paradigm by re-purposing a Transformer-alike architecture (PRVG)
Thanks to its simplicity in design, our PRVG framework can be applied in different testing schemes.
arXiv Detail & Related papers (2021-09-23T10:03:32Z) - Natural Language Video Localization with Learnable Moment Proposals [40.91060659795612]
We propose a novel model termed LPNet (Learnable Proposal Network for NLVL) with a fixed set of learnable moment proposals.
In this paper, we demonstrate the effectiveness of LPNet over existing state-of-the-art methods.
arXiv Detail & Related papers (2021-09-22T12:18:58Z) - Two-Stream Consensus Network for Weakly-Supervised Temporal Action
Localization [94.37084866660238]
We present a Two-Stream Consensus Network (TSCN) to simultaneously address these challenges.
The proposed TSCN features an iterative refinement training method, where a frame-level pseudo ground truth is iteratively updated.
We propose a new attention normalization loss to encourage the predicted attention to act like a binary selection, and promote the precise localization of action instance boundaries.
arXiv Detail & Related papers (2020-10-22T10:53:32Z) - Weakly-Supervised Multi-Level Attentional Reconstruction Network for
Grounding Textual Queries in Videos [73.4504252917816]
The task of temporally grounding textual queries in videos is to localize one video segment that semantically corresponds to the given query.
Most of the existing approaches rely on segment-sentence pairs (temporal annotations) for training, which are usually unavailable in real-world scenarios.
We present an effective weakly-supervised model, named as Multi-Level Attentional Reconstruction Network (MARN), which only relies on video-sentence pairs during the training stage.
arXiv Detail & Related papers (2020-03-16T07:01:01Z) - Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of
Sentence in Video [53.69956349097428]
Given an untrimmed video and a query sentence, our goal is to localize a temporal segment in the video that semantically corresponds to the query sentence.
We propose a two-stage model to tackle this problem in a coarse-to-fine manner.
arXiv Detail & Related papers (2020-01-25T13:07:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.