Temporal Fusion Network for Temporal Action Localization:Submission to
ActivityNet Challenge 2020 (Task E)
- URL: http://arxiv.org/abs/2006.07520v1
- Date: Sat, 13 Jun 2020 00:33:00 GMT
- Title: Temporal Fusion Network for Temporal Action Localization:Submission to
ActivityNet Challenge 2020 (Task E)
- Authors: Zhiwu Qing, Xiang Wang, Yongpeng Sang, Changxin Gao, Shiwei Zhang,
Nong Sang
- Abstract summary: This report analyzes a temporal action localization method we used in the HACS competition which is hosted in Activitynet Challenge 2020.
The goal of our task is to locate the start time and end time of the action in the untrimmed video, and predict action category.
By fusing the results of multiple models, our method obtains 40.55% on the validation set and 40.53% on the test set in terms of mAP, and achieves Rank 1 in this challenge.
- Score: 45.3218136336925
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This technical report analyzes a temporal action localization method we used
in the HACS competition which is hosted in Activitynet Challenge 2020.The goal
of our task is to locate the start time and end time of the action in the
untrimmed video, and predict action category.Firstly, we utilize the
video-level feature information to train multiple video-level action
classification models. In this way, we can get the category of action in the
video.Secondly, we focus on generating high quality temporal proposals.For this
purpose, we apply BMN to generate a large number of proposals to obtain high
recall rates. We then refine these proposals by employing a cascade structure
network called Refine Network, which can predict position offset and new IOU
under the supervision of ground truth.To make the proposals more accurate, we
use bidirectional LSTM, Nonlocal and Transformer to capture temporal
relationships between local features of each proposal and global features of
the video data.Finally, by fusing the results of multiple models, our method
obtains 40.55% on the validation set and 40.53% on the test set in terms of
mAP, and achieves Rank 1 in this challenge.
Related papers
- Technical Report for ActivityNet Challenge 2022 -- Temporal Action Localization [20.268572246761895]
We propose to locate the temporal boundaries of each action and predict action class in untrimmed videos.
Faster-TAD simplifies the pipeline of TAD and gets remarkable performance.
arXiv Detail & Related papers (2024-10-31T14:16:56Z) - Context-aware Proposal Network for Temporal Action Detection [47.72048484299649]
This report presents our first place winning solution for temporal action detection task in CVPR-2022 AcitivityNet Challenge.
The task aims to localize temporal boundaries of action instances with specific classes in long untrimmed videos.
We argue that the generated proposals contain rich contextual information, which may benefits detection confidence prediction.
arXiv Detail & Related papers (2022-06-18T01:43:43Z) - Transferable Knowledge-Based Multi-Granularity Aggregation Network for
Temporal Action Localization: Submission to ActivityNet Challenge 2021 [33.840281113206444]
This report presents an overview of our solution used in the submission to 2021 HACS Temporal Action localization Challenge.
We use Temporal Context Aggregation Network (TCANet) to generate high-quality action proposals.
We also adopt an additional module to transfer the knowledge from trimmed videos to untrimmed videos.
Our proposed scheme achieves 39.91 and 29.78 average mAP on the challenge testing set of supervised and weakly-supervised temporal action localization track respectively.
arXiv Detail & Related papers (2021-07-27T06:18:21Z) - Temporal Context Aggregation Network for Temporal Action Proposal
Refinement [93.03730692520999]
Temporal action proposal generation is a challenging yet important task in the video understanding field.
Current methods still suffer from inaccurate temporal boundaries and inferior confidence used for retrieval.
We propose TCANet to generate high-quality action proposals through "local and global" temporal context aggregation.
arXiv Detail & Related papers (2021-03-24T12:34:49Z) - Complementary Boundary Generator with Scale-Invariant Relation Modeling
for Temporal Action Localization: Submission to ActivityNet Challenge 2020 [66.4527310659592]
This report presents an overview of our solution used in the submission to ActivityNet Challenge 2020 Task 1.
We decouple the temporal action localization task into two stages (i.e. proposal generation and classification) and enrich the proposal diversity.
Our proposed scheme achieves the state-of-the-art performance on the temporal action localization task with textbf42.26 average mAP on the challenge testing set.
arXiv Detail & Related papers (2020-07-20T04:35:40Z) - Team RUC_AIM3 Technical Report at Activitynet 2020 Task 2: Exploring
Sequential Events Detection for Dense Video Captioning [63.91369308085091]
We propose a novel and simple model for event sequence generation and explore temporal relationships of the event sequence in the video.
The proposed model omits inefficient two-stage proposal generation and directly generates event boundaries conditioned on bi-directional temporal dependency in one pass.
The overall system achieves state-of-the-art performance on the dense-captioning events in video task with 9.894 METEOR score on the challenge testing set.
arXiv Detail & Related papers (2020-06-14T13:21:37Z) - CBR-Net: Cascade Boundary Refinement Network for Action Detection:
Submission to ActivityNet Challenge 2020 (Task 1) [42.77192990307131]
We present our solution for the task of temporal action localization (detection) (task 1) in ActivityNet Challenge 2020.
The purpose of this task is to temporally localize intervals where actions of interest occur and predict the action categories in a long untrimmed video.
In this stage, we combine the video-level classification results obtained by the fine tuning networks to predict the category of each proposal.
arXiv Detail & Related papers (2020-06-13T01:05:51Z) - Weakly-Supervised Multi-Level Attentional Reconstruction Network for
Grounding Textual Queries in Videos [73.4504252917816]
The task of temporally grounding textual queries in videos is to localize one video segment that semantically corresponds to the given query.
Most of the existing approaches rely on segment-sentence pairs (temporal annotations) for training, which are usually unavailable in real-world scenarios.
We present an effective weakly-supervised model, named as Multi-Level Attentional Reconstruction Network (MARN), which only relies on video-sentence pairs during the training stage.
arXiv Detail & Related papers (2020-03-16T07:01:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.