Weakly Supervised Temporal Action Localization Through Learning Explicit
Subspaces for Action and Context
- URL: http://arxiv.org/abs/2103.16155v1
- Date: Tue, 30 Mar 2021 08:26:53 GMT
- Title: Weakly Supervised Temporal Action Localization Through Learning Explicit
Subspaces for Action and Context
- Authors: Ziyi Liu, Le Wang, Wei Tang, Junsong Yuan, Nanning Zheng, Gang Hua
- Abstract summary: Methods learn to localize temporal starts and ends of action instances in a video under only video-level supervision.
We introduce a framework that learns two feature subspaces respectively for actions and their context.
The proposed approach outperforms state-of-the-art WS-TAL methods on three benchmarks.
- Score: 151.23835595907596
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weakly-supervised Temporal Action Localization (WS-TAL) methods learn to
localize temporal starts and ends of action instances in a video under only
video-level supervision. Existing WS-TAL methods rely on deep features learned
for action recognition. However, due to the mismatch between classification and
localization, these features cannot distinguish the frequently co-occurring
contextual background, i.e., the context, and the actual action instances. We
term this challenge action-context confusion, and it will adversely affect the
action localization accuracy. To address this challenge, we introduce a
framework that learns two feature subspaces respectively for actions and their
context. By explicitly accounting for action visual elements, the action
instances can be localized more precisely without the distraction from the
context. To facilitate the learning of these two feature subspaces with only
video-level categorical labels, we leverage the predictions from both spatial
and temporal streams for snippets grouping. In addition, an unsupervised
learning task is introduced to make the proposed module focus on mining
temporal information. The proposed approach outperforms state-of-the-art WS-TAL
methods on three benchmarks, i.e., THUMOS14, ActivityNet v1.2 and v1.3
datasets.
Related papers
- Open-Vocabulary Spatio-Temporal Action Detection [59.91046192096296]
Open-vocabulary-temporal action detection (OV-STAD) is an important fine-grained video understanding task.
OV-STAD requires training a model on a limited set of base classes with box and label supervision.
To better adapt the holistic VLM for the fine-grained action detection task, we carefully fine-tune it on the localized video region-text pairs.
arXiv Detail & Related papers (2024-05-17T14:52:47Z) - Weakly-Supervised Temporal Action Localization with Bidirectional
Semantic Consistency Constraint [83.36913240873236]
Weakly Supervised Temporal Action localization (WTAL) aims to classify and localize temporal boundaries of actions for the video.
We propose a simple yet efficient method, named bidirectional semantic consistency constraint (Bi- SCC) to discriminate the positive actions from co-scene actions.
Experimental results show that our approach outperforms the state-of-the-art methods on THUMOS14 and ActivityNet.
arXiv Detail & Related papers (2023-04-25T07:20:33Z) - Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM.
It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features.
S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z) - ASM-Loc: Action-aware Segment Modeling for Weakly-Supervised Temporal
Action Localization [36.90693762365237]
Weakly-supervised temporal action localization aims to recognize and localize action segments in untrimmed videos given only video-level action labels for training.
We propose system, a novel WTAL framework that enables explicit, action-aware segment modeling beyond standard MIL-based methods.
Our framework entails three segment-centric components: (i) dynamic segment sampling for compensating the contribution of short actions; (ii) intra- and inter-segment attention for modeling action dynamics and capturing temporal dependencies; (iii) pseudo instance-level supervision for improving action boundary prediction.
arXiv Detail & Related papers (2022-03-29T01:59:26Z) - ACM-Net: Action Context Modeling Network for Weakly-Supervised Temporal
Action Localization [18.56421375743287]
We propose an action-context modeling network termed ACM-Net.
It integrates a three-branch attention module to measure the likelihood of each temporal point being action instance, context, or non-action background, simultaneously.
Our method can outperform current state-of-the-art methods, and even achieve comparable performance with fully-supervised methods.
arXiv Detail & Related papers (2021-04-07T07:39:57Z) - ACSNet: Action-Context Separation Network for Weakly Supervised Temporal
Action Localization [148.55210919689986]
We introduce an Action-Context Separation Network (ACSNet) that takes into account context for accurate action localization.
ACSNet outperforms existing state-of-the-art WS-TAL methods by a large margin.
arXiv Detail & Related papers (2021-03-28T09:20:54Z) - PcmNet: Position-Sensitive Context Modeling Network for Temporal Action
Localization [11.685362686431446]
We propose a temporal-position-sensitive context modeling approach to incorporate both positional and semantic information for more precise action localization.
We achieve state-of-the-art performance on both two challenging datasets, THUMOS-14 and ActivityNet-1.3.
arXiv Detail & Related papers (2021-03-09T07:34:01Z) - Intra- and Inter-Action Understanding via Temporal Action Parsing [118.32912239230272]
We construct a new dataset developed on sport videos with manual annotations of sub-actions, and conduct a study on temporal action parsing on top.
Our study shows that a sport activity usually consists of multiple sub-actions and that the awareness of such temporal structures is beneficial to action recognition.
We also investigate a number of temporal parsing methods, and thereon devise an improved method that is capable of mining sub-actions from training data without knowing the labels of them.
arXiv Detail & Related papers (2020-05-20T17:45:18Z) - Weakly Supervised Temporal Action Localization Using Deep Metric
Learning [12.49814373580862]
We propose a weakly supervised temporal action localization method that only requires video-level action instances as supervision during training.
We jointly optimize a balanced binary cross-entropy loss and a metric loss using a standard backpropagation algorithm.
Our approach improves the current state-of-the-art result for THUMOS14 by 6.5% mAP at IoU threshold 0.5, and achieves competitive performance for ActivityNet1.2.
arXiv Detail & Related papers (2020-01-21T22:01:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.