TDSM: Triplet Diffusion for Skeleton-Text Matching in Zero-Shot Action Recognition
- URL: http://arxiv.org/abs/2411.10745v3
- Date: Mon, 10 Mar 2025 04:40:07 GMT
- Title: TDSM: Triplet Diffusion for Skeleton-Text Matching in Zero-Shot Action Recognition
- Authors: Jeonghyeok Do, Munchurl Kim,
- Abstract summary: In zero-shot skeleton-based action recognition, aligning skeleton features with the text features of action labels is essential for accurately predicting unseen actions.<n>Our framework is designed as a Triplet Diffusion for Skeleton-Text Matching ( TDSM) method which aligns skeleton features with text prompts through reverse diffusion.<n>To enhance discriminative power, we introduce a novel triplet diffusion (TD) loss that encourages our TDSM to correct skeleton-text matches while pushing apart incorrect ones.
- Score: 25.341177384559174
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We firstly present a diffusion-based action recognition with zero-shot learning for skeleton inputs. In zero-shot skeleton-based action recognition, aligning skeleton features with the text features of action labels is essential for accurately predicting unseen actions. Previous methods focus on direct alignment between skeleton and text latent spaces, but the modality gaps between these spaces hinder robust generalization learning. Motivated from the remarkable performance of text-to-image diffusion models, we leverage their alignment capabilities between different modalities mostly by focusing on the training process during reverse diffusion rather than using their generative power. Based on this, our framework is designed as a Triplet Diffusion for Skeleton-Text Matching (TDSM) method which aligns skeleton features with text prompts through reverse diffusion, embedding the prompts into the unified skeleton-text latent space to achieve robust matching. To enhance discriminative power, we introduce a novel triplet diffusion (TD) loss that encourages our TDSM to correct skeleton-text matches while pushing apart incorrect ones. Our TDSM significantly outperforms the very recent state-of-the-art methods with large margins of 2.36%-point to 13.05%-point, demonstrating superior accuracy and scalability in zero-shot settings through effective skeleton-text matching.
Related papers
- Zero-Shot Skeleton-Based Action Recognition With Prototype-Guided Feature Alignment [33.06899506252672]
Zero-shot skeleton-based action recognition aims to classify unseen skeleton-based human actions without prior exposure to such categories during training.<n>Previous studies typically use two-stage training: pre-training skeleton encoders on seen action categories using cross-entropy loss and then aligning pre-extracted skeleton and text features.<n>We propose a prototype-guided feature alignment paradigm for zero-shot skeleton-based action recognition, termed PGFA.
arXiv Detail & Related papers (2025-07-01T08:34:35Z) - Frequency-Semantic Enhanced Variational Autoencoder for Zero-Shot Skeleton-based Action Recognition [11.11236920942621]
Zero-shot skeleton-based action recognition aims to identify actions beyond the categories encountered during training.<n>Previous approaches have primarily focused on aligning visual and semantic representations.<n>We propose a Frequency-Semantic Enhanced Variational Autoencoder (FS-VAE) to explore the skeleton semantic representation learning with frequency decomposition.
arXiv Detail & Related papers (2025-06-27T12:44:08Z) - Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers [79.94246924019984]
Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-driven visual generation.<n>We propose textbfTemperature-Adjusted Cross-modal Attention (TACA), a parameter-efficient method that dynamically rebalances multimodal interactions.<n>Our findings highlight the importance of balancing cross-modal attention in improving semantic fidelity in text-to-image diffusion models.
arXiv Detail & Related papers (2025-06-09T17:54:04Z) - Enhancing Cross-Tokenizer Knowledge Distillation with Contextual Dynamical Mapping [85.48043537327258]
Contextual Dynamic Mapping (CDM) is a novel cross-tokenizer distillation framework.
It uses contextual information to enhance sequence alignment precision and dynamically improve vocabulary mapping.
Our method shows significant advantages over existing cross-tokenizer distillation baselines across diverse benchmarks.
arXiv Detail & Related papers (2025-02-16T12:46:07Z) - Zero-Shot Skeleton-based Action Recognition with Dual Visual-Text Alignment [11.72557768532557]
Key to zero-shot action recognition lies in aligning visual features with semantic vectors representing action categories.
Our approach achieves state-of-the-art performances on several popular zero-shot skeleton-based action recognition benchmarks.
arXiv Detail & Related papers (2024-09-22T06:44:58Z) - SA-DVAE: Improving Zero-Shot Skeleton-Based Action Recognition by Disentangled Variational Autoencoders [7.618223798662929]
We propose SA-DVAE -- Semantic Alignment via Disentangled Variational Autoencoders.
We implement this idea via a pair of modality-specific variational autoencoders coupled with a total correction penalty.
Experiments show that SA-DAVE produces improved performance over existing methods.
arXiv Detail & Related papers (2024-07-18T12:35:46Z) - Part-aware Unified Representation of Language and Skeleton for Zero-shot Action Recognition [57.97930719585095]
We introduce Part-aware Unified Representation between Language and Skeleton (PURLS) to explore visual-semantic alignment at both local and global scales.
Our approach is evaluated on various skeleton/language backbones and three large-scale datasets.
The results showcase the universality and superior performance of PURLS, surpassing prior skeleton-based solutions and standard baselines from other domains.
arXiv Detail & Related papers (2024-06-19T08:22:32Z) - Spatial Semantic Recurrent Mining for Referring Image Segmentation [63.34997546393106]
We propose Stextsuperscript2RM to achieve high-quality cross-modality fusion.
It follows a working strategy of trilogy: distributing language feature, spatial semantic recurrent coparsing, and parsed-semantic balancing.
Our proposed method performs favorably against other state-of-the-art algorithms.
arXiv Detail & Related papers (2024-05-15T00:17:48Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask [84.84034179136458]
A crucial factor leading to the text-image mismatch issue is the inadequate cross-modality relation learning.
We propose an adaptive mask, which is conditioned on the attention maps and the prompt embeddings, to dynamically adjust the contribution of each text token to the image features.
Our method, termed MaskDiffusion, is training-free and hot-pluggable for popular pre-trained diffusion models.
arXiv Detail & Related papers (2023-09-08T15:53:37Z) - Three ways to improve feature alignment for open vocabulary detection [88.65076922242184]
Key problem in zero-shot open vocabulary detection is how to align visual and text features, so that the detector performs well on unseen classes.
Previous approaches train the feature pyramid and detection head from scratch, which breaks the vision-text feature alignment established during pretraining.
We propose three methods to alleviate these issues. Firstly, a simple scheme is used to augment the text embeddings which prevents overfitting to a small number of classes seen during training.
Secondly, the feature pyramid network and the detection head are modified to include trainable shortcuts.
Finally, a self-training approach is used to leverage a larger corpus of
arXiv Detail & Related papers (2023-03-23T17:59:53Z) - Self-supervised Action Representation Learning from Partial
Spatio-Temporal Skeleton Sequences [29.376328807860993]
We propose a Partial Spatio-Temporal Learning (PSTL) framework to exploit the local relationship between different skeleton joints and video frames.
Our method achieves state-of-the-art performance on NTURGB+D 60, NTURGBMM+D 120 and PKU-D under various downstream tasks.
arXiv Detail & Related papers (2023-02-17T17:35:05Z) - Self-supervised Character-to-Character Distillation for Text Recognition [54.12490492265583]
We propose a novel self-supervised Character-to-Character Distillation method, CCD, which enables versatile augmentations to facilitate text representation learning.
CCD achieves state-of-the-art results, with average performance gains of 1.38% in text recognition, 1.7% in text segmentation, 0.24 dB (PSNR) and 0.0321 (SSIM) in text super-resolution.
arXiv Detail & Related papers (2022-11-01T05:48:18Z) - Contrastive Learning from Spatio-Temporal Mixed Skeleton Sequences for
Self-Supervised Skeleton-Based Action Recognition [21.546894064451898]
We show that directly extending contrastive pairs based on normal augmentations brings limited returns in terms of performance.
We propose SkeleMixCLR: a contrastive learning framework with atemporal skeleton mixing augmentation (SkeleMix) to complement current contrastive learning approaches.
arXiv Detail & Related papers (2022-07-07T03:18:09Z) - SimMC: Simple Masked Contrastive Learning of Skeleton Representations
for Unsupervised Person Re-Identification [63.903237777588316]
We present a generic Simple Masked Contrastive learning (SimMC) framework to learn effective representations from unlabeled 3D skeletons for person re-ID.
Specifically, to fully exploit skeleton features within each skeleton sequence, we first devise a masked prototype contrastive learning (MPC) scheme.
Then, we propose the masked intra-sequence contrastive learning (MIC) to capture intra-sequence pattern consistency between subsequences.
arXiv Detail & Related papers (2022-04-21T00:19:38Z) - Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation.
This paper aims to address the issue with a mask-and-predict strategy.
We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions.
Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.