Related papers: StgcDiff: Spatial-Temporal Graph Condition Diffusion for Sign Language Transition Generation

StgcDiff: Spatial-Temporal Graph Condition Diffusion for Sign Language Transition Generation

URL: http://arxiv.org/abs/2506.13156v1
Date: Mon, 16 Jun 2025 07:09:51 GMT
Title: StgcDiff: Spatial-Temporal Graph Condition Diffusion for Sign Language Transition Generation
Authors: Jiashu He, Jiayi He, Shengeng Tang, Huixia Ben, Lechao Cheng, Richang Hong,
Abstract summary: We propose StgcDiff, a graph-based conditional diffusion framework that generates smooth transitions between discrete signs.<n>Specifically, we train an encoder-decoder architecture to learn a structure-aware representation of spatial-temporal skeleton.<n>We design the Sign-GCN module as the key component in our framework, which effectively models the spatial-temporal features.
Score: 33.695308849489784
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Sign language transition generation seeks to convert discrete sign language segments into continuous sign videos by synthesizing smooth transitions. However,most existing methods merely concatenate isolated signs, resulting in poor visual coherence and semantic accuracy in the generated videos. Unlike textual languages,sign language is inherently rich in spatial-temporal cues, making it more complex to model. To address this,we propose StgcDiff, a graph-based conditional diffusion framework that generates smooth transitions between discrete signs by capturing the unique spatial-temporal dependencies of sign language. Specifically, we first train an encoder-decoder architecture to learn a structure-aware representation of spatial-temporal skeleton sequences. Next, we optimize a diffusion denoiser conditioned on the representations learned by the pre-trained encoder, which is tasked with predicting transition frames from noise. Additionally, we design the Sign-GCN module as the key component in our framework, which effectively models the spatial-temporal features. Extensive experiments conducted on the PHOENIX14T, USTC-CSL100,and USTC-SLR500 datasets demonstrate the superior performance of our method.

Related papers

AutoSign: Direct Pose-to-Text Translation for Continuous Sign Language Recognition [0.0]
Continuously recognizing sign gestures and converting them to glosses plays a key role in bridging the gap between hearing and hearing-impaired communities.<n>We propose AutoSign, an autoregressive decoder-only transformer that directly translates pose sequences to natural language text.<n>By eliminating the multi-stage pipeline, AutoSign achieves substantial improvements on the Isharah-1000 dataset.
arXiv Detail & Related papers (2025-07-26T07:28:33Z)
Advanced Sign Language Video Generation with Compressed and Quantized Multi-Condition Tokenization [13.619845845897947]
SignViP is a novel framework that incorporates multiple fine-grained conditions for improved generation fidelity.<n>SignViP achieves state-of-the-art performance across metrics, including video quality temporal coherence, and semantic fidelity.
arXiv Detail & Related papers (2025-06-19T02:56:06Z)
Signs as Tokens: A Retrieval-Enhanced Multilingual Sign Language Generator [55.94334001112357]
We introduce a multilingual sign language model, Signs as Tokens (SOKE), which can generate 3D sign avatars autoregressively from text inputs.<n>We propose a retrieval-enhanced SLG approach, which incorporates external sign dictionaries to provide accurate word-level signs.
arXiv Detail & Related papers (2024-11-26T18:28:09Z)
Discrete to Continuous: Generating Smooth Transition Poses from Sign Language Observation [45.214169930573775]
We propose a conditional diffusion model to synthesize contextually smooth transition frames. Our approach transforms the unsupervised problem of transition frame generation into a supervised training task. Experiments on the PHO14TENIX, USTC-CSL100, and USTC-500 datasets demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2024-11-25T15:06:49Z)
MS2SL: Multimodal Spoken Data-Driven Continuous Sign Language Production [93.32354378820648]
We propose a unified framework for continuous sign language production, easing communication between sign and non-sign language users. A sequence diffusion model, utilizing embeddings extracted from text or speech, is crafted to generate sign predictions step by step. Experiments on How2Sign and PHOENIX14T datasets demonstrate that our model achieves competitive performance in sign language production.
arXiv Detail & Related papers (2024-07-04T13:53:50Z)
Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment [130.15775113897553]
Finsta is a fine-grained structural-temporal alignment learning method. It consistently improves the existing 13 strong-tuning video-language models.
arXiv Detail & Related papers (2024-06-27T15:23:36Z)
Multi-Modal Zero-Shot Sign Language Recognition [51.07720650677784]
We propose a multi-modal Zero-Shot Sign Language Recognition model. A Transformer-based model along with a C3D model is used for hand detection and deep features extraction. A semantic space is used to map the visual features to the lingual embedding of the class labels.
arXiv Detail & Related papers (2021-09-02T09:10:39Z)
PiSLTRc: Position-informed Sign Language Transformer with Content-aware Convolution [0.42970700836450487]
We propose a new model architecture, namely PiSLTRc, with two distinctive characteristics. We explicitly select relevant features using a novel content-aware neighborhood gathering method. We aggregate these features with position-informed temporal convolution layers, thus generating robust neighborhood-enhanced sign representation. Compared with the vanilla Transformer model, our model performs consistently better on three large-scale sign language benchmarks.
arXiv Detail & Related papers (2021-07-27T05:01:27Z)
RealTranS: End-to-End Simultaneous Speech Translation with Convolutional Weighted-Shrinking Transformer [33.876412404781846]
RealTranS is an end-to-end model for simultaneous speech translation. It maps speech features into text space with a weighted-shrinking operation and a semantic encoder. Experiments show that RealTranS with the Wait-K-Stride-N strategy outperforms prior end-to-end models.
arXiv Detail & Related papers (2021-06-09T06:35:46Z)
Pose-based Sign Language Recognition using GCN and BERT [0.0]
Word-level sign language recognition (WSLR) is the first important step towards understanding and interpreting sign language. recognizing signs from videos is a challenging task as the meaning of a word depends on a combination of subtle body motions, hand configurations, and other movements. Recent pose-based architectures for W SLR either model both the spatial and temporal dependencies among the poses in different frames simultaneously or only model the temporal information without fully utilizing the spatial information. We tackle the problem of W SLR using a novel pose-based approach, which captures spatial and temporal information separately and performs late fusion.
arXiv Detail & Related papers (2020-12-01T19:10:50Z)
TSPNet: Hierarchical Feature Learning via Temporal Semantic Pyramid for Sign Language Translation [101.6042317204022]
Sign language translation (SLT) aims to interpret sign video sequences into text-based natural language sentences. Existing SLT models usually represent sign visual features in a frame-wise manner. We develop a novel hierarchical sign video feature learning method via a temporal semantic pyramid network, called TSPNet.
arXiv Detail & Related papers (2020-10-12T05:58:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.