ASMa: Asymmetric Spatio-temporal Masking for Skeleton Action Representation Learning
- URL: http://arxiv.org/abs/2602.06251v1
- Date: Thu, 05 Feb 2026 22:59:35 GMT
- Title: ASMa: Asymmetric Spatio-temporal Masking for Skeleton Action Representation Learning
- Authors: Aman Anand, Amir Eskandari, Elyas Rahsno, Farhana Zulkernine,
- Abstract summary: Self-supervised learning (SSL) has shown remarkable success in skeleton-based action recognition.<n>Existing SSL methods rely on data augmentations that predominantly focus on masking high-motion frames and high-degree joints.<n>We propose Asymmetric S-temporal Masking (ASMa) for Action Representation Learning, a novel combination masking to learn a full spectrum of motion dynamics.
- Score: 0.410492188035848
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Self-supervised learning (SSL) has shown remarkable success in skeleton-based action recognition by leveraging data augmentations to learn meaningful representations. However, existing SSL methods rely on data augmentations that predominantly focus on masking high-motion frames and high-degree joints such as joints with degree 3 or 4. This results in biased and incomplete feature representations that struggle to generalize across varied motion patterns. To address this, we propose Asymmetric Spatio-temporal Masking (ASMa) for Skeleton Action Representation Learning, a novel combination of masking to learn a full spectrum of spatio-temporal dynamics inherent in human actions. ASMa employs two complementary masking strategies: one that selectively masks high-degree joints and low-motion, and another that masks low-degree joints and high-motion frames. These masking strategies ensure a more balanced and comprehensive skeleton representation learning. Furthermore, we introduce a learnable feature alignment module to effectively align the representations learned from both masked views. To facilitate deployment in resource-constrained settings and on low-resource devices, we compress the learned and aligned representation into a lightweight model using knowledge distillation. Extensive experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets demonstrate that our approach outperforms existing SSL methods with an average improvement of 2.7-4.4% in fine-tuning and up to 5.9% in transfer learning to noisy datasets and achieves competitive performance compared to fully supervised baselines. Our distilled model achieves 91.4% parameter reduction and 3x faster inference on edge devices while maintaining competitive accuracy, enabling practical deployment in resource-constrained scenarios.
Related papers
- Dual-Branch Center-Surrounding Contrast: Rethinking Contrastive Learning for 3D Point Clouds [55.5576033344795]
We propose a novel DualBranch textbfCentertextbfSurrounding textbfContrast (CSCon) framework for 3D point clouds.<n>Under the FULL and ALL protocols, CSCon achieves performance comparable to generative methods.<n>Our method attains state-of-the-art results, even surpassing cross-modal approaches.
arXiv Detail & Related papers (2025-12-09T14:56:35Z) - MIRAM: Masked Image Reconstruction Across Multiple Scales for Breast Lesion Risk Prediction [2.0199924721373392]
Masked image modeling (MIM) has emerged as a more potent SSL technique.<n>This research paper introduces a scalable and practical SSL approach centered around more challenging pretext tasks.<n>Our hypothesis posits that reconstructing high-resolution images enables the model to attend to finer spatial details.
arXiv Detail & Related papers (2025-03-10T10:32:55Z) - MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models [87.64417894918506]
This work introduces MaskLLM, a learnable pruning method that establishes Semi-structured (or N:M'') Sparsity in Large Language Models.<n>MaskLLM explicitly models N:M patterns as a learnable distribution through Gumbel Softmax sampling.
arXiv Detail & Related papers (2024-09-26T02:37:41Z) - Static for Dynamic: Towards a Deeper Understanding of Dynamic Facial Expressions Using Static Expression Data [85.71013961405036]
We propose a unified dual-modal learning framework that integrates SFER data as a complementary resource for DFER.<n>S4D employs dual-modal self-supervised pre-training on facial images and videos using a shared Transformer (ViT) encoder-decoder architecture.<n>Experiments demonstrate that S4D achieves a deeper understanding of DFER, setting new state-of-the-art performance.
arXiv Detail & Related papers (2024-09-10T01:57:57Z) - MLAE: Masked LoRA Experts for Visual Parameter-Efficient Fine-Tuning [45.93128932828256]
Masked LoRA Experts (MLAE) is an innovative approach that applies the concept of masking to visual PEFT.
Our method incorporates a cellular decomposition strategy that transforms a low-rank matrix into independent rank-1 submatrices.
We show that MLAE achieves new state-of-the-art (SOTA) performance with an average accuracy score of 78.8% on the VTAB-1k benchmark and 90.9% on the FGVC benchmark.
arXiv Detail & Related papers (2024-05-29T08:57:23Z) - Maximum Manifold Capacity Representations in State Representation Learning [8.938418994111716]
manifold-based self-supervised learning (SSL) builds on the manifold hypothesis.
DeepInfomax with an unbalanced atlas (DIM-UA) has emerged as a powerful tool.
MMCR presents a new frontier for SSL by optimizing class separability via manifold compression.
We present an innovative integration of MMCR into existing SSL methods, incorporating a discerning regularization strategy.
arXiv Detail & Related papers (2024-05-22T17:19:30Z) - Skeleton2vec: A Self-supervised Learning Framework with Contextualized
Target Representations for Skeleton Sequence [56.092059713922744]
We show that using high-level contextualized features as prediction targets can achieve superior performance.
Specifically, we propose Skeleton2vec, a simple and efficient self-supervised 3D action representation learning framework.
Our proposed Skeleton2vec outperforms previous methods and achieves state-of-the-art results.
arXiv Detail & Related papers (2024-01-01T12:08:35Z) - MA2CL:Masked Attentive Contrastive Learning for Multi-Agent
Reinforcement Learning [128.19212716007794]
We propose an effective framework called textbfMulti-textbfAgent textbfMasked textbfAttentive textbfContrastive textbfLearning (MA2CL)
MA2CL encourages learning representation to be both temporal and agent-level predictive by reconstructing the masked agent observation in latent space.
Our method significantly improves the performance and sample efficiency of different MARL algorithms and outperforms other methods in various vision-based and state-based scenarios.
arXiv Detail & Related papers (2023-06-03T05:32:19Z) - Self-supervised Action Representation Learning from Partial
Spatio-Temporal Skeleton Sequences [29.376328807860993]
We propose a Partial Spatio-Temporal Learning (PSTL) framework to exploit the local relationship between different skeleton joints and video frames.
Our method achieves state-of-the-art performance on NTURGB+D 60, NTURGBMM+D 120 and PKU-D under various downstream tasks.
arXiv Detail & Related papers (2023-02-17T17:35:05Z) - Imposing Consistency for Optical Flow Estimation [73.53204596544472]
Imposing consistency through proxy tasks has been shown to enhance data-driven learning.
This paper introduces novel and effective consistency strategies for optical flow estimation.
arXiv Detail & Related papers (2022-04-14T22:58:30Z) - Mask-based Latent Reconstruction for Reinforcement Learning [58.43247393611453]
Mask-based Latent Reconstruction (MLR) is proposed to predict the complete state representations in the latent space from the observations with spatially and temporally masked pixels.
Extensive experiments show that our MLR significantly improves the sample efficiency in deep reinforcement learning.
arXiv Detail & Related papers (2022-01-28T13:07:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.