Related papers: Improving Weakly Supervised Temporal Action Localization by Exploiting Multi-resolution Information in Temporal Domain

Improving Weakly Supervised Temporal Action Localization by Exploiting Multi-resolution Information in Temporal Domain

URL: http://arxiv.org/abs/2506.18261v1
Date: Mon, 23 Jun 2025 03:20:18 GMT
Title: Improving Weakly Supervised Temporal Action Localization by Exploiting Multi-resolution Information in Temporal Domain
Authors: Rui Su, Dong Xu, Luping Zhou, Wanli Ouyang,
Abstract summary: We propose a two-stage approach to fully exploit multi-resolution information in the temporal domain.<n>In the first stage, we generate reliable initial frame-level pseudo labels based on both appearance and motion streams.<n>In the second stage, we iteratively refine the pseudo labels and use a set of selected frames with highly confident pseudo labels to train neural networks.
Score: 84.73693644211596
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Weakly supervised temporal action localization is a challenging task as only the video-level annotation is available during the training process. To address this problem, we propose a two-stage approach to fully exploit multi-resolution information in the temporal domain and generate high quality frame-level pseudo labels based on both appearance and motion streams. Specifically, in the first stage, we generate reliable initial frame-level pseudo labels, and in the second stage, we iteratively refine the pseudo labels and use a set of selected frames with highly confident pseudo labels to train neural networks and better predict action class scores at each frame. We fully exploit temporal information at multiple scales to improve temporal action localization performance. Specifically, in order to obtain reliable initial frame-level pseudo labels, in the first stage, we propose an Initial Label Generation (ILG) module, which leverages temporal multi-resolution consistency to generate high quality class activation sequences (CASs), which consist of a number of sequences with each sequence measuring how likely each video frame belongs to one specific action class. In the second stage, we propose a Progressive Temporal Label Refinement (PTLR) framework. In our PTLR framework, two networks called Network-OTS and Network-RTS, which are respectively used to generate CASs for the original temporal scale and the reduced temporal scales, are used as two streams (i.e., the OTS stream and the RTS stream) to refine the pseudo labels in turn. By this way, the multi-resolution information in the temporal domain is exchanged at the pseudo label level, and our work can help improve each stream (i.e., the OTS/RTS stream) by exploiting the refined pseudo labels from another stream (i.e., the RTS/OTS stream).

Related papers

Frame Order Matters: A Temporal Sequence-Aware Model for Few-Shot Action Recognition [14.97527336050901]
We propose a novel Temporal Sequence-Aware Model (TSAM) for few-shot action recognition (FSAR) It incorporates a sequential perceiver adapter into the pre-training framework, to integrate both the spatial information and the sequential temporal dynamics into the feature embeddings. Experimental results on five FSAR datasets demonstrate that our method set a new benchmark, beating the second-best competitors with large margins.
arXiv Detail & Related papers (2024-08-22T15:13:27Z)
FormerTime: Hierarchical Multi-Scale Representations for Multivariate Time Series Classification [53.55504611255664]
FormerTime is a hierarchical representation model for improving the classification capacity for the multivariate time series classification task. It exhibits three aspects of merits: (1) learning hierarchical multi-scale representations from time series data, (2) inheriting the strength of both transformers and convolutional networks, and (3) tacking the efficiency challenges incurred by the self-attention mechanism.
arXiv Detail & Related papers (2023-02-20T07:46:14Z)
Timestamp-Supervised Action Segmentation from the Perspective of Clustering [12.661218632080207]
Most existing methods generate pseudo-labels for all frames in each video to train the segmentation model. We propose a novel framework from the perspective of clustering, which includes the following two parts. iterative clustering iteratively propagates the pseudo-labels to the ambiguous intervals by clustering, and thus updates the pseudo-label sequences to train the model.
arXiv Detail & Related papers (2022-12-22T13:35:00Z)
HTNet: Anchor-free Temporal Action Localization with Hierarchical Transformers [19.48000379201692]
Temporal action localization (TAL) is a task of identifying a set of actions in a video. We present a novel anchor-free framework, known as HTNet, which predicts a set of start time, end time, class> triplets from a video. We demonstrate how our method localizes accurate action instances and state-of-the-art performance on two TAL benchmark datasets.
arXiv Detail & Related papers (2022-07-20T05:40:03Z)
Fine-grained Temporal Contrastive Learning for Weakly-supervised Temporal Action Localization [87.47977407022492]
This paper argues that learning by contextually comparing sequence-to-sequence distinctions offers an essential inductive bias in weakly-supervised action localization. Under a differentiable dynamic programming formulation, two complementary contrastive objectives are designed, including Fine-grained Sequence Distance (FSD) contrasting and Longest Common Subsequence (LCS) contrasting. Our method achieves state-of-the-art performance on two popular benchmarks.
arXiv Detail & Related papers (2022-03-31T05:13:50Z)
Weakly Supervised Temporal Action Localization via Representative Snippet Knowledge Propagation [36.86505596138256]
Weakly supervised temporal action localization aims to localize temporal boundaries of actions and simultaneously identify their categories with only video-level category labels. Many existing methods seek to generate pseudo labels for bridging the discrepancy between classification and localization, but usually only make use of limited contextual information for pseudo label generation. Our method seeks to mine the representative snippets in each video for propagating information between video snippets to generate better pseudo labels.
arXiv Detail & Related papers (2022-03-06T09:53:55Z)
Transferable Knowledge-Based Multi-Granularity Aggregation Network for Temporal Action Localization: Submission to ActivityNet Challenge 2021 [33.840281113206444]
This report presents an overview of our solution used in the submission to 2021 HACS Temporal Action localization Challenge. We use Temporal Context Aggregation Network (TCANet) to generate high-quality action proposals. We also adopt an additional module to transfer the knowledge from trimmed videos to untrimmed videos. Our proposed scheme achieves 39.91 and 29.78 average mAP on the challenge testing set of supervised and weakly-supervised temporal action localization track respectively.
arXiv Detail & Related papers (2021-07-27T06:18:21Z)
TimeLens: Event-based Video Frame Interpolation [54.28139783383213]
We introduce Time Lens, a novel indicates equal contribution method that leverages the advantages of both synthesis-based and flow-based approaches. We show an up to 5.21 dB improvement in terms of PSNR over state-of-the-art frame-based and event-based methods.
arXiv Detail & Related papers (2021-06-14T10:33:47Z)
Learnable Dynamic Temporal Pooling for Time Series Classification [22.931314501371805]
We present a dynamic temporal pooling (DTP) technique that reduces the temporal size of hidden representations by aggregating the features at the segment-level. For the partition of a whole series into multiple segments, we utilize dynamic time warping (DTW) to align each time point in a temporal order with the prototypical features of the segments. The DTP layer combined with a fully-connected layer helps to extract further discriminative features considering their temporal position within an input time series.
arXiv Detail & Related papers (2021-04-02T08:58:44Z)
Dual-Refinement: Joint Label and Feature Refinement for Unsupervised Domain Adaptive Person Re-Identification [51.98150752331922]
Unsupervised domain adaptive (UDA) person re-identification (re-ID) is a challenging task due to the missing of labels for the target domain data. We propose a novel approach, called Dual-Refinement, that jointly refines pseudo labels at the off-line clustering phase and features at the on-line training phase. Our method outperforms the state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2020-12-26T07:35:35Z)
SF-Net: Single-Frame Supervision for Temporal Action Localization [60.202516362976645]
Single-frame supervision introduces extra temporal action signals while maintaining low annotation overhead. We propose a unified system called SF-Net to make use of such single-frame supervision. SF-Net significantly improves upon state-of-the-art weakly-supervised methods in terms of both segment localization and single-frame localization.
arXiv Detail & Related papers (2020-03-15T15:06:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.