Audio-driven Gesture Generation via Deviation Feature in the Latent Space
- URL: http://arxiv.org/abs/2503.21616v1
- Date: Thu, 27 Mar 2025 15:37:16 GMT
- Title: Audio-driven Gesture Generation via Deviation Feature in the Latent Space
- Authors: Jiahui Chen, Yang Huan, Runhua Shi, Chanfan Ding, Xiaoqi Mo, Siyu Xiong, Yinong He,
- Abstract summary: We introduce a weakly supervised framework that learns latent representation deviations, tailored for co-speech gesture video generation.<n>Our approach employs a diffusion model to integrate latent motion features, enabling more precise and nuanced gesture representation.<n> Experiments show our method significantly improves video quality, surpassing current state-of-the-art techniques.
- Score: 2.8952735126314733
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Gestures are essential for enhancing co-speech communication, offering visual emphasis and complementing verbal interactions. While prior work has concentrated on point-level motion or fully supervised data-driven methods, we focus on co-speech gestures, advocating for weakly supervised learning and pixel-level motion deviations. We introduce a weakly supervised framework that learns latent representation deviations, tailored for co-speech gesture video generation. Our approach employs a diffusion model to integrate latent motion features, enabling more precise and nuanced gesture representation. By leveraging weakly supervised deviations in latent space, we effectively generate hand gestures and mouth movements, crucial for realistic video production. Experiments show our method significantly improves video quality, surpassing current state-of-the-art techniques.
Related papers
- TokenMotion: Decoupled Motion Control via Token Disentanglement for Human-centric Video Generation [7.900728371180723]
We present TokenMotion, the first DiT-based video diffusion framework that enables fine-grained control over camera motion.
Our approach introduces a unified modeling framework utilizing a decouple-and-fuse strategy, bridged by a human-aware dynamic mask.
Our work represents a significant advancement in controllable video generation, with particular relevance for creative production applications.
arXiv Detail & Related papers (2025-04-11T00:41:25Z) - Understanding Co-speech Gestures in-the-wild [52.5993021523165]
We introduce a new framework for co-speech gesture understanding in the wild.
We propose three new tasks and benchmarks to evaluate a model's capability to comprehend gesture-text-speech associations.
We present a new approach that learns a tri-modal speech-text-video-gesture representation to solve these tasks.
arXiv Detail & Related papers (2025-03-28T17:55:52Z) - HoloGest: Decoupled Diffusion and Motion Priors for Generating Holisticly Expressive Co-speech Gestures [8.50717565369252]
HoleGest is a novel neural network framework for automatic generation of high-quality, expressive co-speech gestures.<n>Our system learns a robust prior with low audio dependency and high motion reliance, enabling stable global motion and detailed finger movements.<n>Our model achieves a level of realism close to the ground truth, providing an immersive user experience.
arXiv Detail & Related papers (2025-03-17T14:42:31Z) - EMO2: End-Effector Guided Audio-Driven Avatar Video Generation [17.816939983301474]
We propose a novel audio-driven talking head method capable of simultaneously generating highly expressive facial expressions and hand gestures.<n>In the first stage, we generate hand poses directly from audio input, leveraging the strong correlation between audio signals and hand movements.<n>In the second stage, we employ a diffusion model to synthesize video frames, incorporating the hand poses generated in the first stage to produce realistic facial expressions and body movements.
arXiv Detail & Related papers (2025-01-18T07:51:29Z) - Self-Supervised Learning of Deviation in Latent Representation for Co-speech Gesture Video Generation [8.84657964527764]
We explore the representation of gestures in co-speech with a focus on self-supervised representation and pixel-level motion deviation.
Our approach leverages self-supervised deviation in latent representation to facilitate hand gestures generation.
Results of our first experiment demonstrate that our method enhances the quality of generated videos.
arXiv Detail & Related papers (2024-09-26T09:33:20Z) - MotionClone: Training-Free Motion Cloning for Controllable Video Generation [41.621147782128396]
MotionClone is a training-free framework that enables motion cloning from reference videos to versatile motion-controlled video generation.
MotionClone exhibits proficiency in both global camera motion and local object motion, with notable superiority in terms of motion fidelity, textual alignment, and temporal consistency.
arXiv Detail & Related papers (2024-06-08T03:44:25Z) - Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model [17.98911328064481]
Co-speech gestures can achieve superior visual effects in human-machine interaction.
We present a novel motion-decoupled framework to generate co-speech gesture videos.
Our proposed framework significantly outperforms existing approaches in both motion and video-related evaluations.
arXiv Detail & Related papers (2024-04-02T11:40:34Z) - ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis.
Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities.
Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z) - EC^2: Emergent Communication for Embodied Control [72.99894347257268]
Embodied control requires agents to leverage multi-modal pre-training to quickly learn how to act in new environments.
We propose Emergent Communication for Embodied Control (EC2), a novel scheme to pre-train video-language representations for few-shot embodied control.
EC2 is shown to consistently outperform previous contrastive learning methods for both videos and texts as task inputs.
arXiv Detail & Related papers (2023-04-19T06:36:02Z) - Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM.
It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features.
S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z) - Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations
in Instructional Videos [78.34818195786846]
We introduce the task of spatially localizing narrated interactions in videos.
Key to our approach is the ability to learn to spatially localize interactions with self-supervision on a large corpus of videos with accompanying transcribed narrations.
We propose a multilayer cross-modal attention network that enables effective optimization of a contrastive loss during training.
arXiv Detail & Related papers (2021-10-20T14:45:13Z) - Hierarchical Contrastive Motion Learning for Video Action Recognition [100.9807616796383]
We present hierarchical contrastive motion learning, a new self-supervised learning framework to extract effective motion representations from raw video frames.
Our approach progressively learns a hierarchy of motion features that correspond to different abstraction levels in a network.
Our motion learning module is lightweight and flexible to be embedded into various backbone networks.
arXiv Detail & Related papers (2020-07-20T17:59:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.