Related papers: Zero-Shot Skeleton-Based Action Recognition With Prototype-Guided Feature Alignment

Zero-Shot Skeleton-Based Action Recognition With Prototype-Guided Feature Alignment

URL: http://arxiv.org/abs/2507.00566v2
Date: Thu, 24 Jul 2025 07:56:39 GMT
Title: Zero-Shot Skeleton-Based Action Recognition With Prototype-Guided Feature Alignment
Authors: Kai Zhou, Shuhai Zhang, Zeng You, Jinwu Hu, Mingkui Tan, Fei Liu,
Abstract summary: Zero-shot skeleton-based action recognition aims to classify unseen skeleton-based human actions without prior exposure to such categories during training.<n>Previous studies typically use two-stage training: pre-training skeleton encoders on seen action categories using cross-entropy loss and then aligning pre-extracted skeleton and text features.<n>We propose a prototype-guided feature alignment paradigm for zero-shot skeleton-based action recognition, termed PGFA.
Score: 33.06899506252672
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Zero-shot skeleton-based action recognition aims to classify unseen skeleton-based human actions without prior exposure to such categories during training. This task is extremely challenging due to the difficulty in generalizing from known to unknown actions. Previous studies typically use two-stage training: pre-training skeleton encoders on seen action categories using cross-entropy loss and then aligning pre-extracted skeleton and text features, enabling knowledge transfer to unseen classes through skeleton-text alignment and language models' generalization. However, their efficacy is hindered by 1) insufficient discrimination for skeleton features, as the fixed skeleton encoder fails to capture necessary alignment information for effective skeleton-text alignment; 2) the neglect of alignment bias between skeleton and unseen text features during testing. To this end, we propose a prototype-guided feature alignment paradigm for zero-shot skeleton-based action recognition, termed PGFA. Specifically, we develop an end-to-end cross-modal contrastive training framework to improve skeleton-text alignment, ensuring sufficient discrimination for skeleton features. Additionally, we introduce a prototype-guided text feature alignment strategy to mitigate the adverse impact of the distribution discrepancy during testing. We provide a theoretical analysis to support our prototype-guided text feature alignment strategy and empirically evaluate our overall PGFA on three well-known datasets. Compared with the top competitor SMIE method, our PGFA achieves absolute accuracy improvements of 22.96%, 12.53%, and 18.54% on the NTU-60, NTU-120, and PKU-MMD datasets, respectively.

Related papers

Frequency-Semantic Enhanced Variational Autoencoder for Zero-Shot Skeleton-based Action Recognition [11.11236920942621]
Zero-shot skeleton-based action recognition aims to identify actions beyond the categories encountered during training.<n>Previous approaches have primarily focused on aligning visual and semantic representations.<n>We propose a Frequency-Semantic Enhanced Variational Autoencoder (FS-VAE) to explore the skeleton semantic representation learning with frequency decomposition.
arXiv Detail & Related papers (2025-06-27T12:44:08Z)
TDSM: Triplet Diffusion for Skeleton-Text Matching in Zero-Shot Action Recognition [25.341177384559174]
In zero-shot skeleton-based action recognition, aligning skeleton features with the text features of action labels is essential for accurately predicting unseen actions.<n>Our framework is designed as a Triplet Diffusion for Skeleton-Text Matching ( TDSM) method which aligns skeleton features with text prompts through reverse diffusion.<n>To enhance discriminative power, we introduce a novel triplet diffusion (TD) loss that encourages our TDSM to correct skeleton-text matches while pushing apart incorrect ones.
arXiv Detail & Related papers (2024-11-16T08:55:18Z)
Affinity-Graph-Guided Contractive Learning for Pretext-Free Medical Image Segmentation with Minimal Annotation [55.325956390997]
This paper proposes an affinity-graph-guided semi-supervised contrastive learning framework (Semi-AGCL) for medical image segmentation. The framework first designs an average-patch-entropy-driven inter-patch sampling method, which can provide a robust initial feature space. With merely 10% of the complete annotation set, our model approaches the accuracy of the fully annotated baseline, manifesting a marginal deviation of only 2.52%.
arXiv Detail & Related papers (2024-10-14T10:44:47Z)
Fine-Grained Side Information Guided Dual-Prompts for Zero-Shot Skeleton Action Recognition [18.012159340628557]
We propose a novel method via Side information and dual-prompts learning for skeleton-based zero-shot action recognition (STAR) at the fine-grained level. Our method achieves state-of-the-art performance in ZSL and GZSL settings on datasets.
arXiv Detail & Related papers (2024-04-11T05:51:06Z)
Skeleton2vec: A Self-supervised Learning Framework with Contextualized Target Representations for Skeleton Sequence [56.092059713922744]
We show that using high-level contextualized features as prediction targets can achieve superior performance. Specifically, we propose Skeleton2vec, a simple and efficient self-supervised 3D action representation learning framework. Our proposed Skeleton2vec outperforms previous methods and achieves state-of-the-art results.
arXiv Detail & Related papers (2024-01-01T12:08:35Z)
A Generically Contrastive Spatiotemporal Representation Enhancement for 3D Skeleton Action Recognition [10.403751563214113]
We propose a Contrastive Spatiotemporal Representation Enhancement (CSRE) framework to obtain more discriminative representations from the sequences.<n>Specifically, we decompose the representation into spatial- and temporal-specific features to explore fine-grained motion patterns.<n>To explicitly exploit the latent data distributions, we employ the attentive features to contrastive learning, which models the cross-sequence semantic relations.
arXiv Detail & Related papers (2023-12-23T02:54:41Z)
FigCaps-HF: A Figure-to-Caption Generative Framework and Benchmark with Human Feedback [69.4639239117551]
FigCaps-HF is a new framework for figure-caption generation that incorporates domain expert feedback in generating captions optimized for reader preferences.<n>Our framework comprises of 1) an automatic method for evaluating quality of figure-caption pairs, 2) a novel reinforcement learning with human feedback (RLHF) method to optimize a generative figure-to-caption model for reader preferences.
arXiv Detail & Related papers (2023-07-20T13:40:22Z)
SkeletonMAE: Graph-based Masked Autoencoder for Skeleton Sequence Pre-training [110.55093254677638]
We propose an efficient skeleton sequence learning framework, named Skeleton Sequence Learning (SSL) In this paper, we build an asymmetric graph-based encoder-decoder pre-training architecture named SkeletonMAE. Our SSL generalizes well across different datasets and outperforms the state-of-the-art self-supervised skeleton-based action recognition methods.
arXiv Detail & Related papers (2023-07-17T13:33:11Z)
Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage. We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets. By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z)
SimMC: Simple Masked Contrastive Learning of Skeleton Representations for Unsupervised Person Re-Identification [63.903237777588316]
We present a generic Simple Masked Contrastive learning (SimMC) framework to learn effective representations from unlabeled 3D skeletons for person re-ID. Specifically, to fully exploit skeleton features within each skeleton sequence, we first devise a masked prototype contrastive learning (MPC) scheme. Then, we propose the masked intra-sequence contrastive learning (MIC) to capture intra-sequence pattern consistency between subsequences.
arXiv Detail & Related papers (2022-04-21T00:19:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.