Bridge the Gap: From Weak to Full Supervision for Temporal Action Localization with PseudoFormer
- URL: http://arxiv.org/abs/2504.14860v1
- Date: Mon, 21 Apr 2025 05:00:07 GMT
- Title: Bridge the Gap: From Weak to Full Supervision for Temporal Action Localization with PseudoFormer
- Authors: Ziyi Liu, Yangcen Liu,
- Abstract summary: We propose PseudoFormer, a novel framework that bridges the gap between weakly and fully-supervised WTAL.<n>RickerFusion maps all predicted action proposals to a global shared space to generate pseudo labels with better quality.<n>We leverage both snippet-level and proposal-level labels with different priors from the weak branch to train the regression-based model in the full branch.<n>PseudoFormer achieves state-of-the-art WTAL results on the two commonly used benchmarks, THUMOS14 and ActivityNet1.3.
- Score: 13.153366072673915
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weakly-supervised Temporal Action Localization (WTAL) has achieved notable success but still suffers from a lack of temporal annotations, leading to a performance and framework gap compared with fully-supervised methods. While recent approaches employ pseudo labels for training, three key challenges: generating high-quality pseudo labels, making full use of different priors, and optimizing training methods with noisy labels remain unresolved. Due to these perspectives, we propose PseudoFormer, a novel two-branch framework that bridges the gap between weakly and fully-supervised Temporal Action Localization (TAL). We first introduce RickerFusion, which maps all predicted action proposals to a global shared space to generate pseudo labels with better quality. Subsequently, we leverage both snippet-level and proposal-level labels with different priors from the weak branch to train the regression-based model in the full branch. Finally, the uncertainty mask and iterative refinement mechanism are applied for training with noisy pseudo labels. PseudoFormer achieves state-of-the-art WTAL results on the two commonly used benchmarks, THUMOS14 and ActivityNet1.3. Besides, extensive ablation studies demonstrate the contribution of each component of our method.
Related papers
- Rethinking Pseudo-Label Guided Learning for Weakly Supervised Temporal Action Localization from the Perspective of Noise Correction [33.89781814072881]
We argue that the noise in pseudo-labels would interfere with the learning of fully-supervised detection head.<n>We introduce a two-stage noisy label learning strategy to harness every potential useful signal in noisy labels.<n>Our model outperforms the previous state-of-the-art method in detection accuracy and inference speed.
arXiv Detail & Related papers (2025-01-19T17:31:40Z) - Full-Stage Pseudo Label Quality Enhancement for Weakly-supervised Temporal Action Localization [11.010846827099936]
We propose a set of simple yet efficient pseudo label quality enhancement mechanisms to build our FuSTAL framework.
FuSTAL achieves an average mAP of 50.8% on THUMOS'14, outperforming the previous best method by 1.2%, and becomes the first method to reach the milestone of 50%.
arXiv Detail & Related papers (2024-07-12T03:53:55Z) - Distilling Vision-Language Pre-training to Collaborate with
Weakly-Supervised Temporal Action Localization [77.19173283023012]
Weakly-supervised temporal action localization learns to detect and classify action instances with only category labels.
Most methods widely adopt the off-the-shelf Classification-Based Pre-training (CBP) to generate video features for action localization.
arXiv Detail & Related papers (2022-12-19T10:02:50Z) - Exploiting Completeness and Uncertainty of Pseudo Labels for Weakly
Supervised Video Anomaly Detection [149.23913018423022]
Weakly supervised video anomaly detection aims to identify abnormal events in videos using only video-level labels.
Two-stage self-training methods have achieved significant improvements by self-generating pseudo labels.
We propose an enhancement framework by exploiting completeness and uncertainty properties for effective self-training.
arXiv Detail & Related papers (2022-12-08T05:53:53Z) - Collaborative Propagation on Multiple Instance Graphs for 3D Instance
Segmentation with Single-point Supervision [63.429704654271475]
We propose a novel weakly supervised method RWSeg that only requires labeling one object with one point.
With these sparse weak labels, we introduce a unified framework with two branches to propagate semantic and instance information.
Specifically, we propose a Cross-graph Competing Random Walks (CRW) algorithm that encourages competition among different instance graphs.
arXiv Detail & Related papers (2022-08-10T02:14:39Z) - Learning Action Completeness from Points for Weakly-supervised Temporal
Action Localization [15.603643098270409]
We tackle the problem of localizing temporal intervals of actions with only a single frame label for each action instance for training.
In this paper, we propose a novel framework, where dense pseudo-labels are generated to provide completeness guidance for the model.
arXiv Detail & Related papers (2021-08-11T04:54:39Z) - Refining Pseudo Labels with Clustering Consensus over Generations for
Unsupervised Object Re-identification [84.72303377833732]
Unsupervised object re-identification targets at learning discriminative representations for object retrieval without any annotations.
We propose to estimate pseudo label similarities between consecutive training generations with clustering consensus and refine pseudo labels with temporally propagated and ensembled pseudo labels.
The proposed pseudo label refinery strategy is simple yet effective and can be seamlessly integrated into existing clustering-based unsupervised re-identification methods.
arXiv Detail & Related papers (2021-06-11T02:42:42Z) - Learning Salient Boundary Feature for Anchor-free Temporal Action
Localization [81.55295042558409]
Temporal action localization is an important yet challenging task in video understanding.
We propose the first purely anchor-free temporal localization method.
Our model includes (i) an end-to-end trainable basic predictor, (ii) a saliency-based refinement module, and (iii) several consistency constraints.
arXiv Detail & Related papers (2021-03-24T12:28:32Z) - Point-Level Temporal Action Localization: Bridging Fully-supervised
Proposals to Weakly-supervised Losses [84.2964408497058]
Point-level temporal action localization (PTAL) aims to localize actions in untrimmed videos with only one timestamp annotation for each action instance.
Existing methods adopt the frame-level prediction paradigm to learn from the sparse single-frame labels.
This paper attempts to explore the proposal-based prediction paradigm for point-level annotations.
arXiv Detail & Related papers (2020-12-15T12:11:48Z) - Two-phase Pseudo Label Densification for Self-training based Domain
Adaptation [93.03265290594278]
We propose a novel Two-phase Pseudo Label Densification framework, referred to as TPLD.
In the first phase, we use sliding window voting to propagate the confident predictions, utilizing intrinsic spatial-correlations in the images.
In the second phase, we perform a confidence-based easy-hard classification.
To ease the training process and avoid noisy predictions, we introduce the bootstrapping mechanism to the original self-training loss.
arXiv Detail & Related papers (2020-12-09T02:35:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.