Related papers: SignAligner: Harmonizing Complementary Pose Modalities for Coherent Sign Language Generation

SignAligner: Harmonizing Complementary Pose Modalities for Coherent Sign Language Generation

URL: http://arxiv.org/abs/2506.11621v1
Date: Fri, 13 Jun 2025 09:44:42 GMT
Title: SignAligner: Harmonizing Complementary Pose Modalities for Coherent Sign Language Generation
Authors: Xu Wang, Shengeng Tang, Lechao Cheng, Feng Li, Shuo Wang, Richang Hong,
Abstract summary: We introduce ENIX14T+, an extended version of the widely-used RWTH-ENIXPHO-Weather 2014T dataset, featuring three new sign representations: Pose, Hamer and Smplerx.<n>We also propose a novel method, SignAligner, for realistic sign language generation, consisting of three stages: text-driven pose modalities co-generation, online collaborative correction of multimodality, and realistic sign video synthesis.
Score: 41.240893601941536
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sign language generation aims to produce diverse sign representations based on spoken language. However, achieving realistic and naturalistic generation remains a significant challenge due to the complexity of sign language, which encompasses intricate hand gestures, facial expressions, and body movements. In this work, we introduce PHOENIX14T+, an extended version of the widely-used RWTH-PHOENIX-Weather 2014T dataset, featuring three new sign representations: Pose, Hamer and Smplerx. We also propose a novel method, SignAligner, for realistic sign language generation, consisting of three stages: text-driven pose modalities co-generation, online collaborative correction of multimodality, and realistic sign video synthesis. First, by incorporating text semantics, we design a joint sign language generator to simultaneously produce posture coordinates, gesture actions, and body movements. The text encoder, based on a Transformer architecture, extracts semantic features, while a cross-modal attention mechanism integrates these features to generate diverse sign language representations, ensuring accurate mapping and controlling the diversity of modal features. Next, online collaborative correction is introduced to refine the generated pose modalities using a dynamic loss weighting strategy and cross-modal attention, facilitating the complementarity of information across modalities, eliminating spatiotemporal conflicts, and ensuring semantic coherence and action consistency. Finally, the corrected pose modalities are fed into a pre-trained video generation network to produce high-fidelity sign language videos. Extensive experiments demonstrate that SignAligner significantly improves both the accuracy and expressiveness of the generated sign videos.

Related papers

Motion is the Choreographer: Learning Latent Pose Dynamics for Seamless Sign Language Generation [24.324964949728045]
We propose a new paradigm for sign language video generation that decouples motion semantics from signer identity.<n>First, we construct a signer-independent multimodal motion lexicon, where each gloss is stored as identity-agnostic pose, gesture, and 3D mesh sequences.<n>This compact representation enables our second key innovation: a discrete-to-continuous motion synthesis stage that transforms retrieved gloss sequences into temporally coherent motion trajectories.
arXiv Detail & Related papers (2025-08-06T03:23:10Z)
DeepGesture: A conversational gesture synthesis system based on emotions and semantics [0.0]
DeepGesture is a diffusion-based gesture synthesis framework.<n>It generates expressive co-speech gestures conditioned on multimodal signals.<n>We show that DeepGesture produces gestures with improved human-likeness and contextual appropriateness.
arXiv Detail & Related papers (2025-07-03T20:04:04Z)
HOP: Heterogeneous Topology-based Multimodal Entanglement for Co-Speech Gesture Generation [42.30003982604611]
Co-speech gestures are crucial non-verbal cues that enhance speech clarity and strides in human communication.<n>We propose a novel method named HOP for co-speech gesture generation, capturing heterogeneous entanglement between gesture motion, audio rhythm, and text semantics.<n>HOP achieves state-of-the-art offering more natural and expressive co-speech gesture generation.
arXiv Detail & Related papers (2025-03-03T04:47:39Z)
Signs as Tokens: A Retrieval-Enhanced Multilingual Sign Language Generator [55.94334001112357]
We introduce a multilingual sign language model, Signs as Tokens (SOKE), which can generate 3D sign avatars autoregressively from text inputs.<n>We propose a retrieval-enhanced SLG approach, which incorporates external sign dictionaries to provide accurate word-level signs.
arXiv Detail & Related papers (2024-11-26T18:28:09Z)
MS2SL: Multimodal Spoken Data-Driven Continuous Sign Language Production [93.32354378820648]
We propose a unified framework for continuous sign language production, easing communication between sign and non-sign language users. A sequence diffusion model, utilizing embeddings extracted from text or speech, is crafted to generate sign predictions step by step. Experiments on How2Sign and PHOENIX14T datasets demonstrate that our model achieves competitive performance in sign language production.
arXiv Detail & Related papers (2024-07-04T13:53:50Z)
Self-Supervised Representation Learning with Spatial-Temporal Consistency for Sign Language Recognition [96.62264528407863]
We propose a self-supervised contrastive learning framework to excavate rich context via spatial-temporal consistency. Inspired by the complementary property of motion and joint modalities, we first introduce first-order motion information into sign language modeling. Our method is evaluated with extensive experiments on four public benchmarks, and achieves new state-of-the-art performance with a notable margin.
arXiv Detail & Related papers (2024-06-15T04:50:19Z)
SignAvatar: Sign Language 3D Motion Reconstruction and Generation [10.342253593687781]
SignAvatar is a framework capable of both word-level sign language reconstruction and generation.<n>We contribute the ASL3DWord dataset, composed of 3D joint rotation data for the body, hands, and face.
arXiv Detail & Related papers (2024-05-13T17:48:22Z)
Improving Joint Speech-Text Representations Without Alignment [92.60384956736536]
We show that joint speech-text encoders naturally achieve consistent representations across modalities by disregarding sequence length. We argue that consistency losses could forgive length differences and simply assume the best alignment.
arXiv Detail & Related papers (2023-08-11T13:28:48Z)
Cross-modality Data Augmentation for End-to-End Sign Language Translation [66.46877279084083]
End-to-end sign language translation (SLT) aims to convert sign language videos into spoken language texts directly without intermediate representations. It has been a challenging task due to the modality gap between sign videos and texts and the data scarcity of labeled data. We propose a novel Cross-modality Data Augmentation (XmDA) framework to transfer the powerful gloss-to-text translation capabilities to end-to-end sign language translation.
arXiv Detail & Related papers (2023-05-18T16:34:18Z)
Everybody Sign Now: Translating Spoken Language to Photo Realistic Sign Language Video [43.45785951443149]
To be truly understandable by Deaf communities, an automatic Sign Language Production system must generate a photo-realistic signer. We propose SignGAN, the first SLP model to produce photo-realistic continuous sign language videos directly from spoken language. A pose-conditioned human synthesis model is then introduced to generate a photo-realistic sign language video from the skeletal pose sequence.
arXiv Detail & Related papers (2020-11-19T14:31:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.