Weakly-Supervised Temporal Action Localization Through Local-Global
Background Modeling
- URL: http://arxiv.org/abs/2106.11811v1
- Date: Sun, 20 Jun 2021 02:58:45 GMT
- Title: Weakly-Supervised Temporal Action Localization Through Local-Global
Background Modeling
- Authors: Xiang Wang, Zhiwu Qing, Ziyuan Huang, Yutong Feng, Shiwei Zhang,
Jianwen Jiang, Mingqian Tang, Yuanjie Shao, Nong Sang
- Abstract summary: We present our 2021 HACS Challenge - Weakly-supervised Learning Track solution that based on BaSNet to address above problem.
Specifically, we first adopt pre-trained CSN, Slowfast, TDN, and ViViT as feature extractors to get feature sequences.
Then our proposed Local-Global Background Modeling Network (LGBM-Net) is trained to localize instances by using only video-level labels.
- Score: 30.104982661371164
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weakly-Supervised Temporal Action Localization (WS-TAL) task aims to
recognize and localize temporal starts and ends of action instances in an
untrimmed video with only video-level label supervision. Due to lack of
negative samples of background category, it is difficult for the network to
separate foreground and background, resulting in poor detection performance. In
this report, we present our 2021 HACS Challenge - Weakly-supervised Learning
Track solution that based on BaSNet to address above problem. Specifically, we
first adopt pre-trained CSN, Slowfast, TDN, and ViViT as feature extractors to
get feature sequences. Then our proposed Local-Global Background Modeling
Network (LGBM-Net) is trained to localize instances by using only video-level
labels based on Multi-Instance Learning (MIL). Finally, we ensemble multiple
models to get the final detection results and reach 22.45% mAP on the test set
Related papers
- Weakly Supervised Video Anomaly Detection and Localization with Spatio-Temporal Prompts [57.01985221057047]
This paper introduces a novel method that learnstemporal prompt embeddings for weakly supervised video anomaly detection and localization (WSVADL) based on pre-trained vision-language models (VLMs)
Our method achieves state-of-theart performance on three public benchmarks for the WSVADL task.
arXiv Detail & Related papers (2024-08-12T03:31:29Z) - Open-Vocabulary Spatio-Temporal Action Detection [59.91046192096296]
Open-vocabulary-temporal action detection (OV-STAD) is an important fine-grained video understanding task.
OV-STAD requires training a model on a limited set of base classes with box and label supervision.
To better adapt the holistic VLM for the fine-grained action detection task, we carefully fine-tune it on the localized video region-text pairs.
arXiv Detail & Related papers (2024-05-17T14:52:47Z) - Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - HTNet: Anchor-free Temporal Action Localization with Hierarchical
Transformers [19.48000379201692]
Temporal action localization (TAL) is a task of identifying a set of actions in a video.
We present a novel anchor-free framework, known as HTNet, which predicts a set of start time, end time, class> triplets from a video.
We demonstrate how our method localizes accurate action instances and state-of-the-art performance on two TAL benchmark datasets.
arXiv Detail & Related papers (2022-07-20T05:40:03Z) - Unsupervised Pre-training for Temporal Action Localization Tasks [76.01985780118422]
We propose a self-supervised pretext task, coined as Pseudo Action localization (PAL) to Unsupervisedly Pre-train feature encoders for Temporal Action localization tasks (UP-TAL)
Specifically, we first randomly select temporal regions, each of which contains multiple clips, from one video as pseudo actions and then paste them onto different temporal positions of the other two videos.
The pretext task is to align the features of pasted pseudo action regions from two synthetic videos and maximize the agreement between them.
arXiv Detail & Related papers (2022-03-25T12:13:43Z) - Unsupervised Temporal Video Grounding with Deep Semantic Clustering [58.95918952149763]
Temporal video grounding aims to localize a target segment in a video according to a given sentence query.
In this paper, we explore whether a video grounding model can be learned without any paired annotations.
Considering there is no paired supervision, we propose a novel Deep Semantic Clustering Network (DSCNet) to leverage all semantic information from the whole query set.
arXiv Detail & Related papers (2022-01-14T05:16:33Z) - Transferable Knowledge-Based Multi-Granularity Aggregation Network for
Temporal Action Localization: Submission to ActivityNet Challenge 2021 [33.840281113206444]
This report presents an overview of our solution used in the submission to 2021 HACS Temporal Action localization Challenge.
We use Temporal Context Aggregation Network (TCANet) to generate high-quality action proposals.
We also adopt an additional module to transfer the knowledge from trimmed videos to untrimmed videos.
Our proposed scheme achieves 39.91 and 29.78 average mAP on the challenge testing set of supervised and weakly-supervised temporal action localization track respectively.
arXiv Detail & Related papers (2021-07-27T06:18:21Z) - ACSNet: Action-Context Separation Network for Weakly Supervised Temporal
Action Localization [148.55210919689986]
We introduce an Action-Context Separation Network (ACSNet) that takes into account context for accurate action localization.
ACSNet outperforms existing state-of-the-art WS-TAL methods by a large margin.
arXiv Detail & Related papers (2021-03-28T09:20:54Z) - Temporal Fusion Network for Temporal Action Localization:Submission to
ActivityNet Challenge 2020 (Task E) [45.3218136336925]
This report analyzes a temporal action localization method we used in the HACS competition which is hosted in Activitynet Challenge 2020.
The goal of our task is to locate the start time and end time of the action in the untrimmed video, and predict action category.
By fusing the results of multiple models, our method obtains 40.55% on the validation set and 40.53% on the test set in terms of mAP, and achieves Rank 1 in this challenge.
arXiv Detail & Related papers (2020-06-13T00:33:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.