Motion is the Choreographer: Learning Latent Pose Dynamics for Seamless Sign Language Generation
- URL: http://arxiv.org/abs/2508.04049v1
- Date: Wed, 06 Aug 2025 03:23:10 GMT
- Title: Motion is the Choreographer: Learning Latent Pose Dynamics for Seamless Sign Language Generation
- Authors: Jiayi He, Xu Wang, Shengeng Tang, Yaxiong Wang, Lechao Cheng, Dan Guo,
- Abstract summary: We propose a new paradigm for sign language video generation that decouples motion semantics from signer identity.<n>First, we construct a signer-independent multimodal motion lexicon, where each gloss is stored as identity-agnostic pose, gesture, and 3D mesh sequences.<n>This compact representation enables our second key innovation: a discrete-to-continuous motion synthesis stage that transforms retrieved gloss sequences into temporally coherent motion trajectories.
- Score: 24.324964949728045
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sign language video generation requires producing natural signing motions with realistic appearances under precise semantic control, yet faces two critical challenges: excessive signer-specific data requirements and poor generalization. We propose a new paradigm for sign language video generation that decouples motion semantics from signer identity through a two-phase synthesis framework. First, we construct a signer-independent multimodal motion lexicon, where each gloss is stored as identity-agnostic pose, gesture, and 3D mesh sequences, requiring only one recording per sign. This compact representation enables our second key innovation: a discrete-to-continuous motion synthesis stage that transforms retrieved gloss sequences into temporally coherent motion trajectories, followed by identity-aware neural rendering to produce photorealistic videos of arbitrary signers. Unlike prior work constrained by signer-specific datasets, our method treats motion as a first-class citizen: the learned latent pose dynamics serve as a portable "choreography layer" that can be visually realized through different human appearances. Extensive experiments demonstrate that disentangling motion from identity is not just viable but advantageous - enabling both high-quality synthesis and unprecedented flexibility in signer personalization.
Related papers
- Text2Lip: Progressive Lip-Synced Talking Face Generation from Text via Viseme-Guided Rendering [53.2204901422631]
Text2Lip is a viseme-centric framework that constructs an interpretable phonetic-visual bridge.<n>We show that Text2Lip outperforms existing approaches in semantic fidelity, visual realism, and modality robustness.
arXiv Detail & Related papers (2025-08-04T12:50:22Z) - X-NeMo: Expressive Neural Motion Reenactment via Disentangled Latent Attention [52.94097577075215]
X-NeMo is a zero-shot diffusion-based portrait animation pipeline.<n>It animates a static portrait using facial movements from a driving video of a different individual.
arXiv Detail & Related papers (2025-07-30T22:46:52Z) - SignAligner: Harmonizing Complementary Pose Modalities for Coherent Sign Language Generation [41.240893601941536]
We introduce ENIX14T+, an extended version of the widely-used RWTH-ENIXPHO-Weather 2014T dataset, featuring three new sign representations: Pose, Hamer and Smplerx.<n>We also propose a novel method, SignAligner, for realistic sign language generation, consisting of three stages: text-driven pose modalities co-generation, online collaborative correction of multimodality, and realistic sign video synthesis.
arXiv Detail & Related papers (2025-06-13T09:44:42Z) - EMO2: End-Effector Guided Audio-Driven Avatar Video Generation [17.816939983301474]
We propose a novel audio-driven talking head method capable of simultaneously generating highly expressive facial expressions and hand gestures.<n>In the first stage, we generate hand poses directly from audio input, leveraging the strong correlation between audio signals and hand movements.<n>In the second stage, we employ a diffusion model to synthesize video frames, incorporating the hand poses generated in the first stage to produce realistic facial expressions and body movements.
arXiv Detail & Related papers (2025-01-18T07:51:29Z) - MASA: Motion-aware Masked Autoencoder with Semantic Alignment for Sign Language Recognition [94.56755080185732]
We propose a Motion-Aware masked autoencoder with Semantic Alignment (MASA) that integrates rich motion cues and global semantic information.
Our framework can simultaneously learn local motion cues and global semantic features for comprehensive sign language representation.
arXiv Detail & Related papers (2024-05-31T08:06:05Z) - AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding [24.486705010561067]
The paper introduces AniTalker, a framework designed to generate lifelike talking faces from a single portrait.
AniTalker effectively captures a wide range of facial dynamics, including subtle expressions and head movements.
arXiv Detail & Related papers (2024-05-06T02:32:41Z) - From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations [107.88375243135579]
Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands.
We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures.
Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods.
arXiv Detail & Related papers (2024-01-03T18:55:16Z) - SpeechAct: Towards Generating Whole-body Motion from Speech [33.10601371020488]
This paper addresses the problem of generating whole-body motion from speech.<n>We present a novel hybrid point representation to achieve accurate and continuous motion generation.<n>We also propose a contrastive motion learning method to encourage the model to produce more distinctive representations.
arXiv Detail & Related papers (2023-11-29T07:57:30Z) - MoDi: Unconditional Motion Synthesis from Diverse Data [51.676055380546494]
We present MoDi, an unconditional generative model that synthesizes diverse motions.
Our model is trained in a completely unsupervised setting from a diverse, unstructured and unlabeled motion dataset.
We show that despite the lack of any structure in the dataset, the latent space can be semantically clustered.
arXiv Detail & Related papers (2022-06-16T09:06:25Z) - Hierarchical Style-based Networks for Motion Synthesis [150.226137503563]
We propose a self-supervised method for generating long-range, diverse and plausible behaviors to achieve a specific goal location.
Our proposed method learns to model the motion of human by decomposing a long-range generation task in a hierarchical manner.
On large-scale skeleton dataset, we show that the proposed method is able to synthesise long-range, diverse and plausible motion.
arXiv Detail & Related papers (2020-08-24T02:11:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.