Related papers: Bridging the Skeleton-Text Modality Gap: Diffusion-Powered Modality Alignment for Zero-shot Skeleton-based Action Recognition

Bridging the Skeleton-Text Modality Gap: Diffusion-Powered Modality Alignment for Zero-shot Skeleton-based Action Recognition

URL: http://arxiv.org/abs/2411.10745v4
Date: Wed, 16 Jul 2025 11:28:48 GMT
Title: Bridging the Skeleton-Text Modality Gap: Diffusion-Powered Modality Alignment for Zero-shot Skeleton-based Action Recognition
Authors: Jeonghyeok Do, Munchurl Kim,
Abstract summary: In zero-shot skeleton-based action recognition, aligning skeleton features with the text features of action labels is essential.<n>Previous methods focus on direct alignment between skeleton and text latent spaces.<n>We present a diffusion-based skeleton-text alignment framework for ZSAR.
Score: 25.341177384559174
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In zero-shot skeleton-based action recognition (ZSAR), aligning skeleton features with the text features of action labels is essential for accurately predicting unseen actions. ZSAR faces a fundamental challenge in bridging the modality gap between the two-kind features, which severely limits generalization to unseen actions. Previous methods focus on direct alignment between skeleton and text latent spaces, but the modality gaps between these spaces hinder robust generalization learning. Motivated by the success of diffusion models in multi-modal alignment (e.g., text-to-image, text-to-video), we firstly present a diffusion-based skeleton-text alignment framework for ZSAR. Our approach, Triplet Diffusion for Skeleton-Text Matching (TDSM), focuses on cross-alignment power of diffusion models rather than their generative capability. Specifically, TDSM aligns skeleton features with text prompts by incorporating text features into the reverse diffusion process, where skeleton features are denoised under text guidance, forming a unified skeleton-text latent space for robust matching. To enhance discriminative power, we introduce a triplet diffusion (TD) loss that encourages our TDSM to correct skeleton-text matches while pushing them apart for different action classes. Our TDSM significantly outperforms very recent state-of-the-art methods with significantly large margins of 2.36%-point to 13.05%-point, demonstrating superior accuracy and scalability in zero-shot settings through effective skeleton-text matching.

Related papers

TripleFDS: Triple Feature Disentanglement and Synthesis for Scene Text Editing [56.73004765030206]
Scene Text Editing (STE) aims to naturally modify text in images while preserving visual consistency.<n>We propose TripleFDS, a novel framework for STE with disentangled modular attributes.<n>TripleFDS achieves state-of-the-art image fidelity (SSIM of 44.54) and text accuracy (ACC of 93.58%) on the mainstream STE benchmarks.
arXiv Detail & Related papers (2025-11-17T14:15:03Z)
Learning by Neighbor-Aware Semantics, Deciding by Open-form Flows: Towards Robust Zero-Shot Skeleton Action Recognition [41.77490816513839]
We propose a novel method for zero-shot skeleton action recognition, termed $texttt$textbfFlora$$.<n>Specifically, we attune textual semantics by incorporating direction-aware regional semantics, and a cross-modal consistency objective.<n>Experiments on three benchmark datasets validate the effectiveness of our method, showing particularly impressive performance even when trained with only 10% of the seen data.
arXiv Detail & Related papers (2025-11-12T14:54:53Z)
Zero-Shot Skeleton-Based Action Recognition With Prototype-Guided Feature Alignment [33.06899506252672]
Zero-shot skeleton-based action recognition aims to classify unseen skeleton-based human actions without prior exposure to such categories during training.<n>Previous studies typically use two-stage training: pre-training skeleton encoders on seen action categories using cross-entropy loss and then aligning pre-extracted skeleton and text features.<n>We propose a prototype-guided feature alignment paradigm for zero-shot skeleton-based action recognition, termed PGFA.
arXiv Detail & Related papers (2025-07-01T08:34:35Z)
Frequency-Semantic Enhanced Variational Autoencoder for Zero-Shot Skeleton-based Action Recognition [11.11236920942621]
Zero-shot skeleton-based action recognition aims to identify actions beyond the categories encountered during training.<n>Previous approaches have primarily focused on aligning visual and semantic representations.<n>We propose a Frequency-Semantic Enhanced Variational Autoencoder (FS-VAE) to explore the skeleton semantic representation learning with frequency decomposition.
arXiv Detail & Related papers (2025-06-27T12:44:08Z)
Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers [79.94246924019984]
Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-driven visual generation.<n>We propose textbfTemperature-Adjusted Cross-modal Attention (TACA), a parameter-efficient method that dynamically rebalances multimodal interactions.<n>Our findings highlight the importance of balancing cross-modal attention in improving semantic fidelity in text-to-image diffusion models.
arXiv Detail & Related papers (2025-06-09T17:54:04Z)
Enhancing Cross-Tokenizer Knowledge Distillation with Contextual Dynamical Mapping [85.48043537327258]
Contextual Dynamic Mapping (CDM) is a novel cross-tokenizer distillation framework. It uses contextual information to enhance sequence alignment precision and dynamically improve vocabulary mapping. Our method shows significant advantages over existing cross-tokenizer distillation baselines across diverse benchmarks.
arXiv Detail & Related papers (2025-02-16T12:46:07Z)
Zero-Shot Skeleton-based Action Recognition with Dual Visual-Text Alignment [11.72557768532557]
Key to zero-shot action recognition lies in aligning visual features with semantic vectors representing action categories. Our approach achieves state-of-the-art performances on several popular zero-shot skeleton-based action recognition benchmarks.
arXiv Detail & Related papers (2024-09-22T06:44:58Z)
SA-DVAE: Improving Zero-Shot Skeleton-Based Action Recognition by Disentangled Variational Autoencoders [7.618223798662929]
We propose SA-DVAE -- Semantic Alignment via Disentangled Variational Autoencoders. We implement this idea via a pair of modality-specific variational autoencoders coupled with a total correction penalty. Experiments show that SA-DAVE produces improved performance over existing methods.
arXiv Detail & Related papers (2024-07-18T12:35:46Z)
Part-aware Unified Representation of Language and Skeleton for Zero-shot Action Recognition [57.97930719585095]
We introduce Part-aware Unified Representation between Language and Skeleton (PURLS) to explore visual-semantic alignment at both local and global scales. Our approach is evaluated on various skeleton/language backbones and three large-scale datasets. The results showcase the universality and superior performance of PURLS, surpassing prior skeleton-based solutions and standard baselines from other domains.
arXiv Detail & Related papers (2024-06-19T08:22:32Z)
Spatial Semantic Recurrent Mining for Referring Image Segmentation [63.34997546393106]
We propose Stextsuperscript2RM to achieve high-quality cross-modality fusion. It follows a working strategy of trilogy: distributing language feature, spatial semantic recurrent coparsing, and parsed-semantic balancing. Our proposed method performs favorably against other state-of-the-art algorithms.
arXiv Detail & Related papers (2024-05-15T00:17:48Z)
High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations. Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z)
MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask [84.84034179136458]
A crucial factor leading to the text-image mismatch issue is the inadequate cross-modality relation learning. We propose an adaptive mask, which is conditioned on the attention maps and the prompt embeddings, to dynamically adjust the contribution of each text token to the image features. Our method, termed MaskDiffusion, is training-free and hot-pluggable for popular pre-trained diffusion models.
arXiv Detail & Related papers (2023-09-08T15:53:37Z)
Three ways to improve feature alignment for open vocabulary detection [88.65076922242184]
Key problem in zero-shot open vocabulary detection is how to align visual and text features, so that the detector performs well on unseen classes. Previous approaches train the feature pyramid and detection head from scratch, which breaks the vision-text feature alignment established during pretraining. We propose three methods to alleviate these issues. Firstly, a simple scheme is used to augment the text embeddings which prevents overfitting to a small number of classes seen during training. Secondly, the feature pyramid network and the detection head are modified to include trainable shortcuts. Finally, a self-training approach is used to leverage a larger corpus of
arXiv Detail & Related papers (2023-03-23T17:59:53Z)
Self-supervised Action Representation Learning from Partial Spatio-Temporal Skeleton Sequences [29.376328807860993]
We propose a Partial Spatio-Temporal Learning (PSTL) framework to exploit the local relationship between different skeleton joints and video frames. Our method achieves state-of-the-art performance on NTURGB+D 60, NTURGBMM+D 120 and PKU-D under various downstream tasks.
arXiv Detail & Related papers (2023-02-17T17:35:05Z)
Self-supervised Character-to-Character Distillation for Text Recognition [54.12490492265583]
We propose a novel self-supervised Character-to-Character Distillation method, CCD, which enables versatile augmentations to facilitate text representation learning. CCD achieves state-of-the-art results, with average performance gains of 1.38% in text recognition, 1.7% in text segmentation, 0.24 dB (PSNR) and 0.0321 (SSIM) in text super-resolution.
arXiv Detail & Related papers (2022-11-01T05:48:18Z)
Contrastive Learning from Spatio-Temporal Mixed Skeleton Sequences for Self-Supervised Skeleton-Based Action Recognition [21.546894064451898]
We show that directly extending contrastive pairs based on normal augmentations brings limited returns in terms of performance. We propose SkeleMixCLR: a contrastive learning framework with atemporal skeleton mixing augmentation (SkeleMix) to complement current contrastive learning approaches.
arXiv Detail & Related papers (2022-07-07T03:18:09Z)
SimMC: Simple Masked Contrastive Learning of Skeleton Representations for Unsupervised Person Re-Identification [63.903237777588316]
We present a generic Simple Masked Contrastive learning (SimMC) framework to learn effective representations from unlabeled 3D skeletons for person re-ID. Specifically, to fully exploit skeleton features within each skeleton sequence, we first devise a masked prototype contrastive learning (MPC) scheme. Then, we propose the masked intra-sequence contrastive learning (MIC) to capture intra-sequence pattern consistency between subsequences.
arXiv Detail & Related papers (2022-04-21T00:19:38Z)
Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation. This paper aims to address the issue with a mask-and-predict strategy. We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions. Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.