VIFSS: View-Invariant and Figure Skating-Specific Pose Representation Learning for Temporal Action Segmentation
- URL: http://arxiv.org/abs/2508.10281v1
- Date: Thu, 14 Aug 2025 02:15:21 GMT
- Title: VIFSS: View-Invariant and Figure Skating-Specific Pose Representation Learning for Temporal Action Segmentation
- Authors: Ryota Tanaka, Tomohiro Suzuki, Keisuke Fujii,
- Abstract summary: We propose a new TAS framework for figure skating jumps that explicitly incorporates both the three-dimensional nature and the semantic procedure of jump movements.<n>Our method achieves over 92% F1@50 on element-level TAS, which requires recognizing both jump types and rotation levels.
- Score: 5.453385501324681
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Understanding human actions from videos plays a critical role across various domains, including sports analytics. In figure skating, accurately recognizing the type and timing of jumps a skater performs is essential for objective performance evaluation. However, this task typically requires expert-level knowledge due to the fine-grained and complex nature of jump procedures. While recent approaches have attempted to automate this task using Temporal Action Segmentation (TAS), there are two major limitations to TAS for figure skating: the annotated data is insufficient, and existing methods do not account for the inherent three-dimensional aspects and procedural structure of jump actions. In this work, we propose a new TAS framework for figure skating jumps that explicitly incorporates both the three-dimensional nature and the semantic procedure of jump movements. First, we propose a novel View-Invariant, Figure Skating-Specific pose representation learning approach (VIFSS) that combines contrastive learning as pre-training and action classification as fine-tuning. For view-invariant contrastive pre-training, we construct FS-Jump3D, the first publicly available 3D pose dataset specialized for figure skating jumps. Second, we introduce a fine-grained annotation scheme that marks the ``entry (preparation)'' and ``landing'' phases, enabling TAS models to learn the procedural structure of jumps. Extensive experiments demonstrate the effectiveness of our framework. Our method achieves over 92% F1@50 on element-level TAS, which requires recognizing both jump types and rotation levels. Furthermore, we show that view-invariant contrastive pre-training is particularly effective when fine-tuning data is limited, highlighting the practicality of our approach in real-world scenarios.
Related papers
- Universal Pose Pretraining for Generalizable Vision-Language-Action Policies [83.39008378156647]
Existing Vision-Language-Action (VLA) models often suffer from feature collapse and low training efficiency.<n>We propose Pose-VLA, a decoupled paradigm that separates VLA training into a pre-training phase for extracting universal 3D spatial priors.<n>Our framework follows a two-stage pre-training pipeline, establishing fundamental spatial grounding via poses followed by motion alignment.
arXiv Detail & Related papers (2026-02-23T11:00:08Z) - Exploring the Temporal Consistency for Point-Level Weakly-Supervised Temporal Action Localization [66.80402022104074]
Point-supervised Temporal Action Localization (PTAL) adopts a lightly frame-annotated paradigm (textiti.e., labeling only a single frame per action instance) to train a model to locate action instances within unsupervised videos.<n>Most existing approaches design the task head of models with only a point-trimmed snippet-level classification, without explicit modeling of understanding temporal relationships among frames of an action.<n>We propose a multi-task learning framework that fully utilizes point supervision to boost the model's temporal understanding capability for action localization.
arXiv Detail & Related papers (2026-02-05T14:46:21Z) - Skeleton-Snippet Contrastive Learning with Multiscale Feature Fusion for Action Localization [8.574131591092138]
We develop a snippet discrimination pretext task for self-supervised pretraining.<n>We also build on strong backbones of skeleton-based action recognition models by fusing intermediate features with a U-shaped module.<n>Our approach consistently improves existing skeleton-based contrastive learning methods for action localization on BABEL.
arXiv Detail & Related papers (2025-12-18T13:15:52Z) - YourSkatingCoach: A Figure Skating Video Benchmark for Fine-Grained Element Analysis [10.444961818248624]
dataset contains 454 videos of jump elements, the detected skater skeletons in each video, along with the gold labels of the start and ending frames of each jump, together as a video benchmark for figure skating.
We propose air time detection, a novel motion analysis task, the goal of which is to accurately detect the duration of the air time of a jump.
To verify the generalizability of the fine-grained labels, we apply the same process to other sports as cross-sports tasks but for coarse-grained task action classification.
arXiv Detail & Related papers (2024-10-27T12:52:28Z) - 3D Pose-Based Temporal Action Segmentation for Figure Skating: A Fine-Grained and Jump Procedure-Aware Annotation Approach [5.453385501324681]
In figure skating, technical judgments are performed by watching skaters' 3D movements, and its part of the judging procedure can be regarded as a Temporal Action (TAS) task.
There is a lack of datasets and effective methods for TAS tasks requiring 3D pose data.
In this study, we first created the FS-Jump3D dataset of complex and dynamic figure skating jumps using optical markerless motion capture.
We also propose a new fine-grained figure skating jump TAS dataset annotation method with which TAS models can learn jump procedures.
arXiv Detail & Related papers (2024-08-29T15:42:06Z) - Test-Time Zero-Shot Temporal Action Localization [58.84919541314969]
ZS-TAL seeks to identify and locate actions in untrimmed videos unseen during training.
Training-based ZS-TAL approaches assume the availability of labeled data for supervised learning.
We introduce a novel method that performs Test-Time adaptation for Temporal Action localization (T3AL)
arXiv Detail & Related papers (2024-04-08T11:54:49Z) - D$^2$ST-Adapter: Disentangled-and-Deformable Spatio-Temporal Adapter for Few-shot Action Recognition [64.153799533257]
D$2$ST-Adapter is structured with an internal dual-pathway architecture that enables built-in disentangled encoding of spatial and temporal features.<n>Our method is particularly well-suited to challenging scenarios where temporal dynamics are critical for action recognition.
arXiv Detail & Related papers (2023-12-03T15:40:10Z) - Multi-body SE(3) Equivariance for Unsupervised Rigid Segmentation and
Motion Estimation [49.56131393810713]
We present an SE(3) equivariant architecture and a training strategy to tackle this task in an unsupervised manner.
Our method excels in both model performance and computational efficiency, with only 0.25M parameters and 0.92G FLOPs.
arXiv Detail & Related papers (2023-06-08T22:55:32Z) - An Effective Motion-Centric Paradigm for 3D Single Object Tracking in
Point Clouds [50.19288542498838]
3D single object tracking in LiDAR point clouds (LiDAR SOT) plays a crucial role in autonomous driving.
Current approaches all follow the Siamese paradigm based on appearance matching.
We introduce a motion-centric paradigm to handle LiDAR SOT from a new perspective.
arXiv Detail & Related papers (2023-03-21T17:28:44Z) - Few-Shot Classification with Contrastive Learning [10.236150550121163]
We propose a novel contrastive learning-based framework that seamlessly integrates contrastive learning into both stages.
In the meta-training stage, we propose a cross-view episodic training mechanism to perform the nearest centroid classification on two different views of the same episode.
These two strategies force the model to overcome the bias between views and promote the transferability of representations.
arXiv Detail & Related papers (2022-09-17T02:39:09Z) - FineGym: A Hierarchical Video Dataset for Fine-grained Action
Understanding [118.32912239230272]
FineGym is a new action recognition dataset built on top of gymnastic videos.
It provides temporal annotations at both action and sub-action levels with a three-level semantic hierarchy.
This new level of granularity presents significant challenges for action recognition.
arXiv Detail & Related papers (2020-04-14T17:55:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.