Related papers: Learning by Neighbor-Aware Semantics, Deciding by Open-form Flows: Towards Robust Zero-Shot Skeleton Action Recognition

Learning by Neighbor-Aware Semantics, Deciding by Open-form Flows: Towards Robust Zero-Shot Skeleton Action Recognition

URL: http://arxiv.org/abs/2511.09388v1
Date: Thu, 13 Nov 2025 01:51:08 GMT
Title: Learning by Neighbor-Aware Semantics, Deciding by Open-form Flows: Towards Robust Zero-Shot Skeleton Action Recognition
Authors: Yang Chen, Miaoge Li, Zhijie Rao, Deze Zeng, Song Guo, Jingcai Guo,
Abstract summary: We propose a novel method for zero-shot skeleton action recognition, termed $texttt$textbfFlora$$.<n>Specifically, we attune textual semantics by incorporating direction-aware regional semantics, and a cross-modal consistency objective.<n>Experiments on three benchmark datasets validate the effectiveness of our method, showing particularly impressive performance even when trained with only 10% of the seen data.
Score: 41.77490816513839
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recognizing unseen skeleton action categories remains highly challenging due to the absence of corresponding skeletal priors. Existing approaches generally follow an "align-then-classify" paradigm but face two fundamental issues, i.e., (i) fragile point-to-point alignment arising from imperfect semantics, and (ii) rigid classifiers restricted by static decision boundaries and coarse-grained anchors. To address these issues, we propose a novel method for zero-shot skeleton action recognition, termed $\texttt{$\textbf{Flora}$}$, which builds upon $\textbf{F}$lexib$\textbf{L}$e neighb$\textbf{O}$r-aware semantic attunement and open-form dist$\textbf{R}$ibution-aware flow cl$\textbf{A}$ssifier. Specifically, we flexibly attune textual semantics by incorporating neighboring inter-class contextual cues to form direction-aware regional semantics, coupled with a cross-modal geometric consistency objective that ensures stable and robust point-to-region alignment. Furthermore, we employ noise-free flow matching to bridge the modality distribution gap between semantic and skeleton latent embeddings, while a condition-free contrastive regularization enhances discriminability, leading to a distribution-aware classifier with fine-grained decision boundaries achieved through token-level velocity predictions. Extensive experiments on three benchmark datasets validate the effectiveness of our method, showing particularly impressive performance even when trained with only 10\% of the seen data. Code is available at https://github.com/cseeyangchen/Flora.

Related papers

Punctuation-aware Hybrid Trainable Sparse Attention for Large Language Models [44.28116882776357]
We present textbfPunctuation-aware textbfHybrid textbfSparse textbfAttention textbf(PHSA), a trainable sparse attention framework that leverages punctuation tokens as semantic boundary anchors.<n>Specifically, we design a dual-branch aggregation mechanism that fuses global semantic representations with punctuation-enhanced boundary features, preserving the core semantic structure while introducing almost no additional computational overhead.
arXiv Detail & Related papers (2026-01-06T08:47:16Z)
The Finer the Better: Towards Granular-aware Open-set Domain Generalization [31.197204515055756]
Open-Set Domain Generalization tackles the realistic scenario where deployed models encounter both domain shifts and novel object categories.<n>Existing methods still fall into the dilemma between structural risk of known-classes and open-space risk from unknown-classes.<n>We propose a Semantic-enhanced CLIP framework that explicitly addresses this dilemma through fine-grained semantic enhancement.
arXiv Detail & Related papers (2025-11-21T06:19:19Z)
Ambiguity-aware Point Cloud Segmentation by Adaptive Margin Contrastive Learning [65.94127546086156]
We propose an adaptive margin contrastive learning method for semantic segmentation on point clouds.<n>We first design AMContrast3D, a method comprising contrastive learning into an ambiguity estimation framework.<n>Inspired by the insight of joint training, we propose AMContrast3D++ integrating with two branches trained in parallel.
arXiv Detail & Related papers (2025-07-09T07:00:32Z)
Frequency-Semantic Enhanced Variational Autoencoder for Zero-Shot Skeleton-based Action Recognition [11.11236920942621]
Zero-shot skeleton-based action recognition aims to identify actions beyond the categories encountered during training.<n>Previous approaches have primarily focused on aligning visual and semantic representations.<n>We propose a Frequency-Semantic Enhanced Variational Autoencoder (FS-VAE) to explore the skeleton semantic representation learning with frequency decomposition.
arXiv Detail & Related papers (2025-06-27T12:44:08Z)
Selecting and Pruning: A Differentiable Causal Sequentialized State-Space Model for Two-View Correspondence Learning [36.25732435294088]
Two-view correspondence learning aims to discern true and false correspondences between image pairs.<n>Inspired by Mamba's inherent selectivity, we propose textbfCorrMamba, a textbfCorrespondence filter.<n>Our method surpasses the previous SOTA by $2.58$ absolute percentage points in AUC@20textdegree.
arXiv Detail & Related papers (2025-03-23T04:44:21Z)
Bridging the Skeleton-Text Modality Gap: Diffusion-Powered Modality Alignment for Zero-shot Skeleton-based Action Recognition [25.341177384559174]
In zero-shot skeleton-based action recognition, aligning skeleton features with the text features of action labels is essential.<n>Previous methods focus on direct alignment between skeleton and text latent spaces.<n>We present a diffusion-based skeleton-text alignment framework for ZSAR.
arXiv Detail & Related papers (2024-11-16T08:55:18Z)
SLAck: Semantic, Location, and Appearance Aware Open-Vocabulary Tracking [89.43370214059955]
Open-vocabulary Multiple Object Tracking (MOT) aims to generalize trackers to novel categories not in the training set. We present a unified framework that jointly considers semantics, location, and appearance priors in the early steps of association. Our method eliminates complex post-processings for fusing different cues and boosts the association performance significantly for large-scale open-vocabulary tracking.
arXiv Detail & Related papers (2024-09-17T14:36:58Z)
Spatial Semantic Recurrent Mining for Referring Image Segmentation [63.34997546393106]
We propose Stextsuperscript2RM to achieve high-quality cross-modality fusion. It follows a working strategy of trilogy: distributing language feature, spatial semantic recurrent coparsing, and parsed-semantic balancing. Our proposed method performs favorably against other state-of-the-art algorithms.
arXiv Detail & Related papers (2024-05-15T00:17:48Z)
Semantic Connectivity-Driven Pseudo-labeling for Cross-domain Segmentation [89.41179071022121]
Self-training is a prevailing approach in cross-domain semantic segmentation. We propose a novel approach called Semantic Connectivity-driven pseudo-labeling. This approach formulates pseudo-labels at the connectivity level and thus can facilitate learning structured and low-noise semantics.
arXiv Detail & Related papers (2023-12-11T12:29:51Z)
Composed Image Retrieval with Text Feedback via Multi-grained Uncertainty Regularization [73.04187954213471]
We introduce a unified learning approach to simultaneously modeling the coarse- and fine-grained retrieval. The proposed method has achieved +4.03%, +3.38%, and +2.40% Recall@50 accuracy over a strong baseline.
arXiv Detail & Related papers (2022-11-14T14:25:40Z)
Unsupervised Semantic Segmentation by Distilling Feature Correspondences [94.73675308961944]
Unsupervised semantic segmentation aims to discover and localize semantically meaningful categories within image corpora without any form of annotation. We present STEGO, a novel framework that distills unsupervised features into high-quality discrete semantic labels. STEGO yields a significant improvement over the prior state of the art, on both the CocoStuff and Cityscapes challenges.
arXiv Detail & Related papers (2022-03-16T06:08:47Z)
Exploiting a Joint Embedding Space for Generalized Zero-Shot Semantic Segmentation [25.070027668717422]
Generalized zero-shot semantic segmentation (GZS3) predicts pixel-wise semantic labels for seen and unseen classes. Most GZS3 methods adopt a generative approach that synthesizes visual features of unseen classes from corresponding semantic ones. We propose a discriminative approach to address limitations in a unified framework.
arXiv Detail & Related papers (2021-08-14T13:33:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.