Freeform Body Motion Generation from Speech
- URL: http://arxiv.org/abs/2203.02291v1
- Date: Fri, 4 Mar 2022 13:03:22 GMT
- Title: Freeform Body Motion Generation from Speech
- Authors: Jing Xu, Wei Zhang, Yalong Bai, Qibin Sun, Tao Mei
- Abstract summary: Body motion generation from speech is inherently difficult due to the non-deterministic mapping from speech to body motions.
We introduce a novel freeform motion generation model (FreeMo) by equipping a two-stream architecture.
Experiments demonstrate the superior performance against several baselines.
- Score: 53.50388964591343
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: People naturally conduct spontaneous body motions to enhance their speeches
while giving talks. Body motion generation from speech is inherently difficult
due to the non-deterministic mapping from speech to body motions. Most existing
works map speech to motion in a deterministic way by conditioning on certain
styles, leading to sub-optimal results. Motivated by studies in linguistics, we
decompose the co-speech motion into two complementary parts: pose modes and
rhythmic dynamics. Accordingly, we introduce a novel freeform motion generation
model (FreeMo) by equipping a two-stream architecture, i.e., a pose mode branch
for primary posture generation, and a rhythmic motion branch for rhythmic
dynamics synthesis. On one hand, diverse pose modes are generated by
conditional sampling in a latent space, guided by speech semantics. On the
other hand, rhythmic dynamics are synced with the speech prosody. Extensive
experiments demonstrate the superior performance against several baselines, in
terms of motion diversity, quality and syncing with speech. Code and
pre-trained models will be publicly available through
https://github.com/TheTempAccount/Co-Speech-Motion-Generation.
Related papers
- ExpGest: Expressive Speaker Generation Using Diffusion Model and Hybrid Audio-Text Guidance [11.207513771079705]
We introduce ExpGest, a novel framework leveraging synchronized text and audio information to generate expressive full-body gestures.
Unlike AdaIN or one-hot encoding methods, we design a noise emotion classifier for optimizing adversarial direction noise.
We show that ExpGest achieves more expressive, natural, and controllable global motion in speakers compared to state-of-the-art models.
arXiv Detail & Related papers (2024-10-12T07:01:17Z) - Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model [17.98911328064481]
Co-speech gestures can achieve superior visual effects in human-machine interaction.
We present a novel motion-decoupled framework to generate co-speech gesture videos.
Our proposed framework significantly outperforms existing approaches in both motion and video-related evaluations.
arXiv Detail & Related papers (2024-04-02T11:40:34Z) - ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis.
Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities.
Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z) - Freetalker: Controllable Speech and Text-Driven Gesture Generation Based
on Diffusion Models for Enhanced Speaker Naturalness [45.90256126021112]
We introduce FreeTalker, which is the first framework for the generation of both spontaneous (e.g., co-speech gesture) and non-spontaneous (e.g., moving around the podium) speaker motions.
Specifically, we train a diffusion-based model for speaker motion generation that employs unified representations of both speech-driven gestures and text-driven motions.
arXiv Detail & Related papers (2024-01-07T13:01:29Z) - Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a
Short Video [91.92782707888618]
We present a decomposition-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance.
We show that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization.
arXiv Detail & Related papers (2023-09-09T14:52:39Z) - QPGesture: Quantization-Based and Phase-Guided Motion Matching for
Natural Speech-Driven Gesture Generation [8.604430209445695]
Speech-driven gesture generation is highly challenging due to the random jitters of human motion.
We introduce a novel quantization-based and phase-guided motion-matching framework.
Our method outperforms recent approaches on speech-driven gesture generation.
arXiv Detail & Related papers (2023-05-18T16:31:25Z) - Audio-Driven Co-Speech Gesture Video Generation [92.15661971086746]
We define and study this challenging problem of audio-driven co-speech gesture video generation.
Our key insight is that the co-speech gestures can be decomposed into common motion patterns and subtle rhythmic dynamics.
We propose a novel framework, Audio-driveN Gesture vIdeo gEneration (ANGIE), to effectively capture the reusable co-speech gesture patterns.
arXiv Detail & Related papers (2022-12-05T15:28:22Z) - GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain
Text-to-Speech Synthesis [68.42632589736881]
This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.
GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components.
Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity.
arXiv Detail & Related papers (2022-05-15T08:16:02Z) - Learning Hierarchical Cross-Modal Association for Co-Speech Gesture
Generation [107.10239561664496]
We propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation.
The proposed method renders realistic co-speech gestures and outperforms previous methods in a clear margin.
arXiv Detail & Related papers (2022-03-24T16:33:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.