CBR-Net: Cascade Boundary Refinement Network for Action Detection:
Submission to ActivityNet Challenge 2020 (Task 1)
- URL: http://arxiv.org/abs/2006.07526v2
- Date: Wed, 24 Jun 2020 04:22:52 GMT
- Title: CBR-Net: Cascade Boundary Refinement Network for Action Detection:
Submission to ActivityNet Challenge 2020 (Task 1)
- Authors: Xiang Wang, Baiteng Ma, Zhiwu Qing, Yongpeng Sang, Changxin Gao,
Shiwei Zhang, Nong Sang
- Abstract summary: We present our solution for the task of temporal action localization (detection) (task 1) in ActivityNet Challenge 2020.
The purpose of this task is to temporally localize intervals where actions of interest occur and predict the action categories in a long untrimmed video.
In this stage, we combine the video-level classification results obtained by the fine tuning networks to predict the category of each proposal.
- Score: 42.77192990307131
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this report, we present our solution for the task of temporal action
localization (detection) (task 1) in ActivityNet Challenge 2020. The purpose of
this task is to temporally localize intervals where actions of interest occur
and predict the action categories in a long untrimmed video. Our solution
mainly includes three components: 1) feature encoding: we apply three kinds of
backbones, including TSN [7], Slowfast[3] and I3d[1], which are both pretrained
on Kinetics dataset[2]. Applying these models, we can extract snippet-level
video representations; 2) proposal generation: we choose BMN [5] as our
baseline, base on which we design a Cascade Boundary Refinement Network
(CBR-Net) to conduct proposal detection. The CBR-Net mainly contains two
modules: temporal feature encoding, which applies BiLSTM to encode long-term
temporal information; CBR module, which targets to refine the proposal
precision under different parameter settings; 3) action localization: In this
stage, we combine the video-level classification results obtained by the fine
tuning networks to predict the category of each proposal. Moreover, we also
apply to different ensemble strategies to improve the performance of the
designed solution, by which we achieve 42.788% on the testing set of
ActivityNet v1.3 dataset in terms of mean Average Precision metrics.
Related papers
- Unified Static and Dynamic Network: Efficient Temporal Filtering for Video Grounding [56.315932539150324]
We design a Unified Static and Dynamic Network (UniSDNet) to learn the semantic association between the video and text/audio queries.
Our UniSDNet is applicable to both Natural Language Video Grounding (NLVG) and Spoken Language Video Grounding (SLVG) tasks.
arXiv Detail & Related papers (2024-03-21T06:53:40Z) - HTNet: Anchor-free Temporal Action Localization with Hierarchical
Transformers [19.48000379201692]
Temporal action localization (TAL) is a task of identifying a set of actions in a video.
We present a novel anchor-free framework, known as HTNet, which predicts a set of start time, end time, class> triplets from a video.
We demonstrate how our method localizes accurate action instances and state-of-the-art performance on two TAL benchmark datasets.
arXiv Detail & Related papers (2022-07-20T05:40:03Z) - Context-aware Proposal Network for Temporal Action Detection [47.72048484299649]
This report presents our first place winning solution for temporal action detection task in CVPR-2022 AcitivityNet Challenge.
The task aims to localize temporal boundaries of action instances with specific classes in long untrimmed videos.
We argue that the generated proposals contain rich contextual information, which may benefits detection confidence prediction.
arXiv Detail & Related papers (2022-06-18T01:43:43Z) - Proposal Relation Network for Temporal Action Detection [41.23726979184197]
The purpose of this task is to locate and identify actions of interest in long untrimmed videos.
Our solution builds on BMN, and mainly contains three steps: 1) action classification and feature encoding by Slowfast, CSN and ViViT; 2) proposal generation.
We ensemble the results under different settings and achieve 44.7% on the test set, which improves the champion result in ActivityNet 2020 by 1.9% in terms of average mAP.
arXiv Detail & Related papers (2021-06-20T02:51:34Z) - Target-Aware Object Discovery and Association for Unsupervised Video
Multi-Object Segmentation [79.6596425920849]
This paper addresses the task of unsupervised video multi-object segmentation.
We introduce a novel approach for more accurate and efficient unseen-temporal segmentation.
We evaluate the proposed approach on DAVIS$_17$ and YouTube-VIS, and the results demonstrate that it outperforms state-of-the-art methods both in segmentation accuracy and inference speed.
arXiv Detail & Related papers (2021-04-10T14:39:44Z) - Temporal Context Aggregation Network for Temporal Action Proposal
Refinement [93.03730692520999]
Temporal action proposal generation is a challenging yet important task in the video understanding field.
Current methods still suffer from inaccurate temporal boundaries and inferior confidence used for retrieval.
We propose TCANet to generate high-quality action proposals through "local and global" temporal context aggregation.
arXiv Detail & Related papers (2021-03-24T12:34:49Z) - Complementary Boundary Generator with Scale-Invariant Relation Modeling
for Temporal Action Localization: Submission to ActivityNet Challenge 2020 [66.4527310659592]
This report presents an overview of our solution used in the submission to ActivityNet Challenge 2020 Task 1.
We decouple the temporal action localization task into two stages (i.e. proposal generation and classification) and enrich the proposal diversity.
Our proposed scheme achieves the state-of-the-art performance on the temporal action localization task with textbf42.26 average mAP on the challenge testing set.
arXiv Detail & Related papers (2020-07-20T04:35:40Z) - Temporal Fusion Network for Temporal Action Localization:Submission to
ActivityNet Challenge 2020 (Task E) [45.3218136336925]
This report analyzes a temporal action localization method we used in the HACS competition which is hosted in Activitynet Challenge 2020.
The goal of our task is to locate the start time and end time of the action in the untrimmed video, and predict action category.
By fusing the results of multiple models, our method obtains 40.55% on the validation set and 40.53% on the test set in terms of mAP, and achieves Rank 1 in this challenge.
arXiv Detail & Related papers (2020-06-13T00:33:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.