Related papers: Exploring the Temporal Consistency for Point-Level Weakly-Supervised Temporal Action Localization

Exploring the Temporal Consistency for Point-Level Weakly-Supervised Temporal Action Localization

URL: http://arxiv.org/abs/2602.05718v1
Date: Thu, 05 Feb 2026 14:46:21 GMT
Title: Exploring the Temporal Consistency for Point-Level Weakly-Supervised Temporal Action Localization
Authors: Yunchuan Ma, Laiyun Qing, Guorong Li, Yuqing Liu, Yuankai Qi, Qingming Huang,
Abstract summary: Point-supervised Temporal Action Localization (PTAL) adopts a lightly frame-annotated paradigm (textiti.e., labeling only a single frame per action instance) to train a model to locate action instances within unsupervised videos.<n>Most existing approaches design the task head of models with only a point-trimmed snippet-level classification, without explicit modeling of understanding temporal relationships among frames of an action.<n>We propose a multi-task learning framework that fully utilizes point supervision to boost the model's temporal understanding capability for action localization.
Score: 66.80402022104074
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Point-supervised Temporal Action Localization (PTAL) adopts a lightly frame-annotated paradigm (\textit{i.e.}, labeling only a single frame per action instance) to train a model to effectively locate action instances within untrimmed videos. Most existing approaches design the task head of models with only a point-supervised snippet-level classification, without explicit modeling of understanding temporal relationships among frames of an action. However, understanding the temporal relationships of frames is crucial because it can help a model understand how an action is defined and therefore benefits localizing the full frames of an action. To this end, in this paper, we design a multi-task learning framework that fully utilizes point supervision to boost the model's temporal understanding capability for action localization. Specifically, we design three self-supervised temporal understanding tasks: (i) Action Completion, (ii) Action Order Understanding, and (iii) Action Regularity Understanding. These tasks help a model understand the temporal consistency of actions across videos. To the best of our knowledge, this is the first attempt to explicitly explore temporal consistency for point supervision action localization. Extensive experimental results on four benchmark datasets demonstrate the effectiveness of the proposed method compared to several state-of-the-art approaches.

Related papers

POTLoc: Pseudo-Label Oriented Transformer for Point-Supervised Temporal Action Localization [26.506893363676678]
This paper proposes POTLoc, a Pseudo-label Oriented Transformer for weakly-supervised Action localization. POTLoc is designed to identify and track continuous action structures via a self-training strategy. It outperforms the state-of-the-art point-supervised methods on THUMOS'14 and ActivityNet-v1.2 datasets.
arXiv Detail & Related papers (2023-10-20T15:28:06Z)
Proposal-based Temporal Action Localization with Point-level Supervision [29.98225940694062]
Point-level supervised temporal action localization (PTAL) aims at recognizing and localizing actions in untrimmed videos. We propose a novel method that localizes actions by generating and evaluating action proposals of flexible duration. Experiments show that our proposed method achieves competitive or superior performance to the state-of-the-art methods.
arXiv Detail & Related papers (2023-10-09T08:27:05Z)
Structured Attention Composition for Temporal Action Localization [99.66510088698051]
We study temporal action localization from the perspective of multi-modality feature learning. Unlike conventional attention, the proposed module would not infer frame attention and modality attention independently. The proposed structured attention composition module can be deployed as a plug-and-play module into existing action localization frameworks.
arXiv Detail & Related papers (2022-05-20T04:32:09Z)
ASM-Loc: Action-aware Segment Modeling for Weakly-Supervised Temporal Action Localization [36.90693762365237]
Weakly-supervised temporal action localization aims to recognize and localize action segments in untrimmed videos given only video-level action labels for training. We propose system, a novel WTAL framework that enables explicit, action-aware segment modeling beyond standard MIL-based methods. Our framework entails three segment-centric components: (i) dynamic segment sampling for compensating the contribution of short actions; (ii) intra- and inter-segment attention for modeling action dynamics and capturing temporal dependencies; (iii) pseudo instance-level supervision for improving action boundary prediction.
arXiv Detail & Related papers (2022-03-29T01:59:26Z)
Background-Click Supervision for Temporal Action Localization [82.4203995101082]
Weakly supervised temporal action localization aims at learning the instance-level action pattern from the video-level labels, where a significant challenge is action-context confusion. One recent work builds an action-click supervision framework. It requires similar annotation costs but can steadily improve the localization performance when compared to the conventional weakly supervised methods. In this paper, by revealing that the performance bottleneck of the existing approaches mainly comes from the background errors, we find that a stronger action localizer can be trained with labels on the background video frames rather than those on the action frames.
arXiv Detail & Related papers (2021-11-24T12:02:52Z)
Weakly Supervised Temporal Action Localization Through Learning Explicit Subspaces for Action and Context [151.23835595907596]
Methods learn to localize temporal starts and ends of action instances in a video under only video-level supervision. We introduce a framework that learns two feature subspaces respectively for actions and their context. The proposed approach outperforms state-of-the-art WS-TAL methods on three benchmarks.
arXiv Detail & Related papers (2021-03-30T08:26:53Z)
Modeling Multi-Label Action Dependencies for Temporal Action Localization [53.53490517832068]
Real-world videos contain many complex actions with inherent relationships between action classes. We propose an attention-based architecture that models these action relationships for the task of temporal action localization in unoccurrence videos. We show improved performance over state-of-the-art methods on multi-label action localization benchmarks.
arXiv Detail & Related papers (2021-03-04T13:37:28Z)
Point-Level Temporal Action Localization: Bridging Fully-supervised Proposals to Weakly-supervised Losses [84.2964408497058]
Point-level temporal action localization (PTAL) aims to localize actions in untrimmed videos with only one timestamp annotation for each action instance. Existing methods adopt the frame-level prediction paradigm to learn from the sparse single-frame labels. This paper attempts to explore the proposal-based prediction paradigm for point-level annotations.
arXiv Detail & Related papers (2020-12-15T12:11:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.