Related papers: UTAL-GNN: Unsupervised Temporal Action Localization using Graph Neural Networks

UTAL-GNN: Unsupervised Temporal Action Localization using Graph Neural Networks

URL: http://arxiv.org/abs/2508.19647v1
Date: Wed, 27 Aug 2025 07:51:02 GMT
Title: UTAL-GNN: Unsupervised Temporal Action Localization using Graph Neural Networks
Authors: Bikash Kumar Badatya, Vipul Baghel, Ravi Hegde,
Abstract summary: Fine-grained action localization in untrimmed sports videos presents a significant challenge due to rapid and subtle motion transitions.<n>Existing supervised and weakly supervised solutions often rely on extensive datasets and high-capacity models, making them computationally intensive and less adaptable to real-world scenarios.<n>Our approach pre-trains an Attention-based Spatio-Temporal Graph Convolutional Network (ASTGCN) on a pose-sequence denoising annotated with blockwise partitions.<n>Our method achieves a mean Average Precision (mAP) of 82.66% and average latency localization of 29.09 ms on the DSV Diving dataset
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Fine-grained action localization in untrimmed sports videos presents a significant challenge due to rapid and subtle motion transitions over short durations. Existing supervised and weakly supervised solutions often rely on extensive annotated datasets and high-capacity models, making them computationally intensive and less adaptable to real-world scenarios. In this work, we introduce a lightweight and unsupervised skeleton-based action localization pipeline that leverages spatio-temporal graph neural representations. Our approach pre-trains an Attention-based Spatio-Temporal Graph Convolutional Network (ASTGCN) on a pose-sequence denoising task with blockwise partitions, enabling it to learn intrinsic motion dynamics without any manual labeling. At inference, we define a novel Action Dynamics Metric (ADM), computed directly from low-dimensional ASTGCN embeddings, which detects motion boundaries by identifying inflection points in its curvature profile. Our method achieves a mean Average Precision (mAP) of 82.66% and average localization latency of 29.09 ms on the DSV Diving dataset, matching state-of-the-art supervised performance while maintaining computational efficiency. Furthermore, it generalizes robustly to unseen, in-the-wild diving footage without retraining, demonstrating its practical applicability for lightweight, real-time action analysis systems in embedded or dynamic environments.

Related papers

Event-based Visual Deformation Measurement [76.25283405575108]
Visual Deformation Measurement aims to recover dense deformation fields by tracking surface motion from camera observations.<n>Traditional image-based methods rely on minimal inter-frame motion to constrain the correspondence search space.<n>We propose an event-frame fusion framework that exploits events for temporally dense motion cues and frames for spatially dense precise estimation.
arXiv Detail & Related papers (2026-02-16T01:04:48Z)
Zero-shot 3D-Aware Trajectory-Guided image-to-video generation via Test-Time Training [27.251232052868033]
Trajectory-Guided image-to-video (I2V) generation aims to synthesize videos that adhere to user-specified motion instructions.<n>Zo3T significantly enhances 3D realism and motion accuracy in trajectory-controlled I2V generation.
arXiv Detail & Related papers (2025-09-08T14:21:45Z)
Temporal Point-Supervised Signal Reconstruction: A Human-Annotation-Free Framework for Weak Moving Target Detection [1.187456026346823]
We propose a novel Temporal Point-Supervised (TPS) framework that enables high-performance detection of weak targets without any manual annotations.<n>A Temporal Signal Reconstruction Network (TSRNet) is developed under the TPS paradigm to reconstruct these transient signals.<n>Extensive experiments on a purpose-built low-SNR dataset demonstrate that our framework outperforms state-of-the-art methods while requiring no human annotations.
arXiv Detail & Related papers (2025-07-23T09:02:09Z)
FDDet: Frequency-Decoupling for Boundary Refinement in Temporal Action Detection [4.015022008487465]
Large-scale pre-trained video encoders tend to introduce background clutter and irrelevant semantics, leading to context confusion and boundaries.<n>We propose a frequency-aware decoupling network that improves action discriminability by filtering out noisy semantics captured by pre-trained models.<n>Our method achieves state-of-the-art performance on temporal action detection benchmarks.
arXiv Detail & Related papers (2025-04-01T10:57:37Z)
MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion [118.74385965694694]
We present Motion DUSt3R (MonST3R), a novel geometry-first approach that directly estimates per-timestep geometry from dynamic scenes.<n>By simply estimating a pointmap for each timestep, we can effectively adapt DUST3R's representation, previously only used for static scenes, to dynamic scenes.<n>We show that by posing the problem as a fine-tuning task, identifying several suitable datasets, and strategically training the model on this limited data, we can surprisingly enable the model to handle dynamics.
arXiv Detail & Related papers (2024-10-04T18:00:07Z)
DyG-Mamba: Continuous State Space Modeling on Dynamic Graphs [59.434893231950205]
Dynamic graph learning aims to uncover evolutionary laws in real-world systems. We propose DyG-Mamba, a new continuous state space model for dynamic graph learning. We show that DyG-Mamba achieves state-of-the-art performance on most datasets.
arXiv Detail & Related papers (2024-08-13T15:21:46Z)
Weakly Supervised Video Anomaly Detection and Localization with Spatio-Temporal Prompts [57.01985221057047]
This paper introduces a novel method that learnstemporal prompt embeddings for weakly supervised video anomaly detection and localization (WSVADL) based on pre-trained vision-language models (VLMs) Our method achieves state-of-theart performance on three public benchmarks for the WSVADL task.
arXiv Detail & Related papers (2024-08-12T03:31:29Z)
Neuromorphic Vision-based Motion Segmentation with Graph Transformer Neural Network [4.386534439007928]
We propose a novel event-based motion segmentation algorithm using a Graph Transformer Neural Network, dubbed GTNN. Our proposed algorithm processes event streams as 3D graphs by a series nonlinear transformations to unveil local and global correlations between events. We show that GTNN outperforms state-of-the-art methods in the presence of dynamic background variations, motion patterns, and multiple dynamic objects with varying sizes and velocities.
arXiv Detail & Related papers (2024-04-16T22:44:29Z)
Spatial-wise Dynamic Distillation for MLP-like Efficient Visual Fault Detection of Freight Trains [11.13191969085042]
We present a dynamic distillation framework based on multi-layer perceptron (MLP) for fault detection of freight trains. We propose a dynamic teacher that can effectively eliminate the semantic discrepancy with the student model. Our approach outperforms the current state-of-the-art detectors and achieves the highest accuracy with real-time detection at a lower computational cost.
arXiv Detail & Related papers (2023-12-10T09:18:24Z)
Uncovering the Missing Pattern: Unified Framework Towards Trajectory Imputation and Prediction [60.60223171143206]
Trajectory prediction is a crucial undertaking in understanding entity movement or human behavior from observed sequences. Current methods often assume that the observed sequences are complete while ignoring the potential for missing values. This paper presents a unified framework, the Graph-based Conditional Variational Recurrent Neural Network (GC-VRNN), which can perform trajectory imputation and prediction simultaneously.
arXiv Detail & Related papers (2023-03-28T14:27:27Z)
EasyDGL: Encode, Train and Interpret for Continuous-time Dynamic Graph Learning [92.71579608528907]
This paper aims to design an easy-to-use pipeline (termed as EasyDGL) composed of three key modules with both strong ability fitting and interpretability. EasyDGL can effectively quantify the predictive power of frequency content that a model learn from the evolving graph data.
arXiv Detail & Related papers (2023-03-22T06:35:08Z)
ACGNet: Action Complement Graph Network for Weakly-supervised Temporal Action Localization [39.377289930528555]
Weakly-trimmed temporal action localization (WTAL) in unsupervised videos has emerged as a practical but challenging task since only video-level labels are available. Existing approaches typically leverage off-the-shelf segment-level features, which suffer from spatial incompleteness and temporal incoherence. In this paper, we tackle this problem by enhancing segment-level representations with a simple yet effective graph convolutional network.
arXiv Detail & Related papers (2021-12-21T04:18:44Z)
MotionHint: Self-Supervised Monocular Visual Odometry with Motion Constraints [70.76761166614511]
We present a novel self-supervised algorithm named MotionHint for monocular visual odometry (VO) Our MotionHint algorithm can be easily applied to existing open-sourced state-of-the-art SSM-VO systems.
arXiv Detail & Related papers (2021-09-14T15:35:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.