HoloGest: Decoupled Diffusion and Motion Priors for Generating Holisticly Expressive Co-speech Gestures
- URL: http://arxiv.org/abs/2503.13229v1
- Date: Mon, 17 Mar 2025 14:42:31 GMT
- Title: HoloGest: Decoupled Diffusion and Motion Priors for Generating Holisticly Expressive Co-speech Gestures
- Authors: Yongkang Cheng, Shaoli Huang,
- Abstract summary: HoleGest is a novel neural network framework for automatic generation of high-quality, expressive co-speech gestures.<n>Our system learns a robust prior with low audio dependency and high motion reliance, enabling stable global motion and detailed finger movements.<n>Our model achieves a level of realism close to the ground truth, providing an immersive user experience.
- Score: 8.50717565369252
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Animating virtual characters with holistic co-speech gestures is a challenging but critical task. Previous systems have primarily focused on the weak correlation between audio and gestures, leading to physically unnatural outcomes that degrade the user experience. To address this problem, we introduce HoleGest, a novel neural network framework based on decoupled diffusion and motion priors for the automatic generation of high-quality, expressive co-speech gestures. Our system leverages large-scale human motion datasets to learn a robust prior with low audio dependency and high motion reliance, enabling stable global motion and detailed finger movements. To improve the generation efficiency of diffusion-based models, we integrate implicit joint constraints with explicit geometric and conditional constraints, capturing complex motion distributions between large strides. This integration significantly enhances generation speed while maintaining high-quality motion. Furthermore, we design a shared embedding space for gesture-transcription text alignment, enabling the generation of semantically correct gesture actions. Extensive experiments and user feedback demonstrate the effectiveness and potential applications of our model, with our method achieving a level of realism close to the ground truth, providing an immersive user experience. Our code, model, and demo are are available at https://cyk990422.github.io/HoloGest.github.io/.
Related papers
- Audio-driven Gesture Generation via Deviation Feature in the Latent Space [2.8952735126314733]
We introduce a weakly supervised framework that learns latent representation deviations, tailored for co-speech gesture video generation.
Our approach employs a diffusion model to integrate latent motion features, enabling more precise and nuanced gesture representation.
Experiments show our method significantly improves video quality, surpassing current state-of-the-art techniques.
arXiv Detail & Related papers (2025-03-27T15:37:16Z) - InterDance:Reactive 3D Dance Generation with Realistic Duet Interactions [67.37790144477503]
We propose InterDance, a large-scale duet dance dataset that significantly enhances motion quality, data scale, and the variety of dance genres.
We introduce a diffusion-based framework with an interaction refinement guidance strategy to optimize the realism of interactions progressively.
arXiv Detail & Related papers (2024-12-22T11:53:51Z) - Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis [27.43583075023949]
We introduce Ditto, a diffusion-based framework that enables controllable realtime talking head synthesis.<n>Our key innovation lies in bridging motion generation and photorealistic neural rendering through an explicit identity-agnostic motion space.<n>This design substantially reduces the complexity of diffusion learning while enabling precise control over the synthesized talking heads.
arXiv Detail & Related papers (2024-11-29T07:01:31Z) - ExpGest: Expressive Speaker Generation Using Diffusion Model and Hybrid Audio-Text Guidance [11.207513771079705]
We introduce ExpGest, a novel framework leveraging synchronized text and audio information to generate expressive full-body gestures.
Unlike AdaIN or one-hot encoding methods, we design a noise emotion classifier for optimizing adversarial direction noise.
We show that ExpGest achieves more expressive, natural, and controllable global motion in speakers compared to state-of-the-art models.
arXiv Detail & Related papers (2024-10-12T07:01:17Z) - Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model [17.98911328064481]
Co-speech gestures can achieve superior visual effects in human-machine interaction.
We present a novel motion-decoupled framework to generate co-speech gesture videos.
Our proposed framework significantly outperforms existing approaches in both motion and video-related evaluations.
arXiv Detail & Related papers (2024-04-02T11:40:34Z) - SpeechAct: Towards Generating Whole-body Motion from Speech [33.10601371020488]
This paper addresses the problem of generating whole-body motion from speech.
We present a novel hybrid point representation to achieve accurate and continuous motion generation.
We also propose a contrastive motion learning method to encourage the model to produce more distinctive representations.
arXiv Detail & Related papers (2023-11-29T07:57:30Z) - Universal Humanoid Motion Representations for Physics-Based Control [71.46142106079292]
We present a universal motion representation that encompasses a comprehensive range of motor skills for physics-based humanoid control.
We first learn a motion imitator that can imitate all of human motion from a large, unstructured motion dataset.
We then create our motion representation by distilling skills directly from the imitator.
arXiv Detail & Related papers (2023-10-06T20:48:43Z) - UnifiedGesture: A Unified Gesture Synthesis Model for Multiple Skeletons [16.52004713662265]
We present a novel diffusion model-based speech-driven gesture synthesis approach, trained on multiple gesture datasets with different skeletons.
We then capture the correlation between speech and gestures based on a diffusion model architecture using cross-local attention and self-attention.
Experiments show that UnifiedGesture outperforms recent approaches on speech-driven gesture generation in terms of CCA, FGD, and human-likeness.
arXiv Detail & Related papers (2023-09-13T16:07:25Z) - Priority-Centric Human Motion Generation in Discrete Latent Space [59.401128190423535]
We introduce a Priority-Centric Motion Discrete Diffusion Model (M2DM) for text-to-motion generation.
M2DM incorporates a global self-attention mechanism and a regularization term to counteract code collapse.
We also present a motion discrete diffusion model that employs an innovative noise schedule, determined by the significance of each motion token.
arXiv Detail & Related papers (2023-08-28T10:40:16Z) - MoFusion: A Framework for Denoising-Diffusion-based Motion Synthesis [73.52948992990191]
MoFusion is a new denoising-diffusion-based framework for high-quality conditional human motion synthesis.
We present ways to introduce well-known kinematic losses for motion plausibility within the motion diffusion framework.
We demonstrate the effectiveness of MoFusion compared to the state of the art on established benchmarks in the literature.
arXiv Detail & Related papers (2022-12-08T18:59:48Z) - Audio-Driven Co-Speech Gesture Video Generation [92.15661971086746]
We define and study this challenging problem of audio-driven co-speech gesture video generation.
Our key insight is that the co-speech gestures can be decomposed into common motion patterns and subtle rhythmic dynamics.
We propose a novel framework, Audio-driveN Gesture vIdeo gEneration (ANGIE), to effectively capture the reusable co-speech gesture patterns.
arXiv Detail & Related papers (2022-12-05T15:28:22Z) - Learning Hierarchical Cross-Modal Association for Co-Speech Gesture
Generation [107.10239561664496]
We propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation.
The proposed method renders realistic co-speech gestures and outperforms previous methods in a clear margin.
arXiv Detail & Related papers (2022-03-24T16:33:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.