Learning by Neighbor-Aware Semantics, Deciding by Open-form Flows: Towards Robust Zero-Shot Skeleton Action Recognition
- URL: http://arxiv.org/abs/2511.09388v1
- Date: Thu, 13 Nov 2025 01:51:08 GMT
- Title: Learning by Neighbor-Aware Semantics, Deciding by Open-form Flows: Towards Robust Zero-Shot Skeleton Action Recognition
- Authors: Yang Chen, Miaoge Li, Zhijie Rao, Deze Zeng, Song Guo, Jingcai Guo,
- Abstract summary: We propose a novel method for zero-shot skeleton action recognition, termed $texttt$textbfFlora$$.<n>Specifically, we attune textual semantics by incorporating direction-aware regional semantics, and a cross-modal consistency objective.<n>Experiments on three benchmark datasets validate the effectiveness of our method, showing particularly impressive performance even when trained with only 10% of the seen data.
- Score: 41.77490816513839
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recognizing unseen skeleton action categories remains highly challenging due to the absence of corresponding skeletal priors. Existing approaches generally follow an "align-then-classify" paradigm but face two fundamental issues, i.e., (i) fragile point-to-point alignment arising from imperfect semantics, and (ii) rigid classifiers restricted by static decision boundaries and coarse-grained anchors. To address these issues, we propose a novel method for zero-shot skeleton action recognition, termed $\texttt{$\textbf{Flora}$}$, which builds upon $\textbf{F}$lexib$\textbf{L}$e neighb$\textbf{O}$r-aware semantic attunement and open-form dist$\textbf{R}$ibution-aware flow cl$\textbf{A}$ssifier. Specifically, we flexibly attune textual semantics by incorporating neighboring inter-class contextual cues to form direction-aware regional semantics, coupled with a cross-modal geometric consistency objective that ensures stable and robust point-to-region alignment. Furthermore, we employ noise-free flow matching to bridge the modality distribution gap between semantic and skeleton latent embeddings, while a condition-free contrastive regularization enhances discriminability, leading to a distribution-aware classifier with fine-grained decision boundaries achieved through token-level velocity predictions. Extensive experiments on three benchmark datasets validate the effectiveness of our method, showing particularly impressive performance even when trained with only 10\% of the seen data. Code is available at https://github.com/cseeyangchen/Flora.
Related papers
- Punctuation-aware Hybrid Trainable Sparse Attention for Large Language Models [44.28116882776357]
We present textbfPunctuation-aware textbfHybrid textbfSparse textbfAttention textbf(PHSA), a trainable sparse attention framework that leverages punctuation tokens as semantic boundary anchors.<n>Specifically, we design a dual-branch aggregation mechanism that fuses global semantic representations with punctuation-enhanced boundary features, preserving the core semantic structure while introducing almost no additional computational overhead.
arXiv Detail & Related papers (2026-01-06T08:47:16Z) - The Finer the Better: Towards Granular-aware Open-set Domain Generalization [31.197204515055756]
Open-Set Domain Generalization tackles the realistic scenario where deployed models encounter both domain shifts and novel object categories.<n>Existing methods still fall into the dilemma between structural risk of known-classes and open-space risk from unknown-classes.<n>We propose a Semantic-enhanced CLIP framework that explicitly addresses this dilemma through fine-grained semantic enhancement.
arXiv Detail & Related papers (2025-11-21T06:19:19Z) - Ambiguity-aware Point Cloud Segmentation by Adaptive Margin Contrastive Learning [65.94127546086156]
We propose an adaptive margin contrastive learning method for semantic segmentation on point clouds.<n>We first design AMContrast3D, a method comprising contrastive learning into an ambiguity estimation framework.<n>Inspired by the insight of joint training, we propose AMContrast3D++ integrating with two branches trained in parallel.
arXiv Detail & Related papers (2025-07-09T07:00:32Z) - Frequency-Semantic Enhanced Variational Autoencoder for Zero-Shot Skeleton-based Action Recognition [11.11236920942621]
Zero-shot skeleton-based action recognition aims to identify actions beyond the categories encountered during training.<n>Previous approaches have primarily focused on aligning visual and semantic representations.<n>We propose a Frequency-Semantic Enhanced Variational Autoencoder (FS-VAE) to explore the skeleton semantic representation learning with frequency decomposition.
arXiv Detail & Related papers (2025-06-27T12:44:08Z) - Selecting and Pruning: A Differentiable Causal Sequentialized State-Space Model for Two-View Correspondence Learning [36.25732435294088]
Two-view correspondence learning aims to discern true and false correspondences between image pairs.<n>Inspired by Mamba's inherent selectivity, we propose textbfCorrMamba, a textbfCorrespondence filter.<n>Our method surpasses the previous SOTA by $2.58$ absolute percentage points in AUC@20textdegree.
arXiv Detail & Related papers (2025-03-23T04:44:21Z) - Bridging the Skeleton-Text Modality Gap: Diffusion-Powered Modality Alignment for Zero-shot Skeleton-based Action Recognition [25.341177384559174]
In zero-shot skeleton-based action recognition, aligning skeleton features with the text features of action labels is essential.<n>Previous methods focus on direct alignment between skeleton and text latent spaces.<n>We present a diffusion-based skeleton-text alignment framework for ZSAR.
arXiv Detail & Related papers (2024-11-16T08:55:18Z) - SLAck: Semantic, Location, and Appearance Aware Open-Vocabulary Tracking [89.43370214059955]
Open-vocabulary Multiple Object Tracking (MOT) aims to generalize trackers to novel categories not in the training set.
We present a unified framework that jointly considers semantics, location, and appearance priors in the early steps of association.
Our method eliminates complex post-processings for fusing different cues and boosts the association performance significantly for large-scale open-vocabulary tracking.
arXiv Detail & Related papers (2024-09-17T14:36:58Z) - Spatial Semantic Recurrent Mining for Referring Image Segmentation [63.34997546393106]
We propose Stextsuperscript2RM to achieve high-quality cross-modality fusion.
It follows a working strategy of trilogy: distributing language feature, spatial semantic recurrent coparsing, and parsed-semantic balancing.
Our proposed method performs favorably against other state-of-the-art algorithms.
arXiv Detail & Related papers (2024-05-15T00:17:48Z) - Semantic Connectivity-Driven Pseudo-labeling for Cross-domain
Segmentation [89.41179071022121]
Self-training is a prevailing approach in cross-domain semantic segmentation.
We propose a novel approach called Semantic Connectivity-driven pseudo-labeling.
This approach formulates pseudo-labels at the connectivity level and thus can facilitate learning structured and low-noise semantics.
arXiv Detail & Related papers (2023-12-11T12:29:51Z) - Composed Image Retrieval with Text Feedback via Multi-grained
Uncertainty Regularization [73.04187954213471]
We introduce a unified learning approach to simultaneously modeling the coarse- and fine-grained retrieval.
The proposed method has achieved +4.03%, +3.38%, and +2.40% Recall@50 accuracy over a strong baseline.
arXiv Detail & Related papers (2022-11-14T14:25:40Z) - Unsupervised Semantic Segmentation by Distilling Feature Correspondences [94.73675308961944]
Unsupervised semantic segmentation aims to discover and localize semantically meaningful categories within image corpora without any form of annotation.
We present STEGO, a novel framework that distills unsupervised features into high-quality discrete semantic labels.
STEGO yields a significant improvement over the prior state of the art, on both the CocoStuff and Cityscapes challenges.
arXiv Detail & Related papers (2022-03-16T06:08:47Z) - Exploiting a Joint Embedding Space for Generalized Zero-Shot Semantic
Segmentation [25.070027668717422]
Generalized zero-shot semantic segmentation (GZS3) predicts pixel-wise semantic labels for seen and unseen classes.
Most GZS3 methods adopt a generative approach that synthesizes visual features of unseen classes from corresponding semantic ones.
We propose a discriminative approach to address limitations in a unified framework.
arXiv Detail & Related papers (2021-08-14T13:33:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.