QPGesture: Quantization-Based and Phase-Guided Motion Matching for
Natural Speech-Driven Gesture Generation
- URL: http://arxiv.org/abs/2305.11094v1
- Date: Thu, 18 May 2023 16:31:25 GMT
- Title: QPGesture: Quantization-Based and Phase-Guided Motion Matching for
Natural Speech-Driven Gesture Generation
- Authors: Sicheng Yang, Zhiyong Wu, Minglei Li, Zhensong Zhang, Lei Hao, Weihong
Bao, Haolin Zhuang
- Abstract summary: Speech-driven gesture generation is highly challenging due to the random jitters of human motion.
We introduce a novel quantization-based and phase-guided motion-matching framework.
Our method outperforms recent approaches on speech-driven gesture generation.
- Score: 8.604430209445695
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech-driven gesture generation is highly challenging due to the random
jitters of human motion. In addition, there is an inherent asynchronous
relationship between human speech and gestures. To tackle these challenges, we
introduce a novel quantization-based and phase-guided motion-matching
framework. Specifically, we first present a gesture VQ-VAE module to learn a
codebook to summarize meaningful gesture units. With each code representing a
unique gesture, random jittering problems are alleviated effectively. We then
use Levenshtein distance to align diverse gestures with different speech.
Levenshtein distance based on audio quantization as a similarity metric of
corresponding speech of gestures helps match more appropriate gestures with
speech, and solves the alignment problem of speech and gestures well. Moreover,
we introduce phase to guide the optimal gesture matching based on the semantics
of context or rhythm of audio. Phase guides when text-based or speech-based
gestures should be performed to make the generated gestures more natural.
Extensive experiments show that our method outperforms recent approaches on
speech-driven gesture generation. Our code, database, pre-trained models, and
demos are available at https://github.com/YoungSeng/QPGesture.
Related papers
- ExpGest: Expressive Speaker Generation Using Diffusion Model and Hybrid Audio-Text Guidance [11.207513771079705]
We introduce ExpGest, a novel framework leveraging synchronized text and audio information to generate expressive full-body gestures.
Unlike AdaIN or one-hot encoding methods, we design a noise emotion classifier for optimizing adversarial direction noise.
We show that ExpGest achieves more expressive, natural, and controllable global motion in speakers compared to state-of-the-art models.
arXiv Detail & Related papers (2024-10-12T07:01:17Z) - ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis.
Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities.
Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z) - Freetalker: Controllable Speech and Text-Driven Gesture Generation Based
on Diffusion Models for Enhanced Speaker Naturalness [45.90256126021112]
We introduce FreeTalker, which is the first framework for the generation of both spontaneous (e.g., co-speech gesture) and non-spontaneous (e.g., moving around the podium) speaker motions.
Specifically, we train a diffusion-based model for speaker motion generation that employs unified representations of both speech-driven gestures and text-driven motions.
arXiv Detail & Related papers (2024-01-07T13:01:29Z) - EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling [57.08286593059137]
We propose EMAGE, a framework to generate full-body human gestures from audio and masked gestures.
We first introduce BEAT2 (BEAT-SMPLX-FLAME), a new mesh-level holistic co-speech dataset.
Experiments demonstrate that EMAGE generates holistic gestures with state-of-the-art performance.
arXiv Detail & Related papers (2023-12-31T02:25:41Z) - LivelySpeaker: Towards Semantic-Aware Co-Speech Gesture Generation [41.42316077949012]
We introduce LivelySpeaker, a framework that realizes semantics-aware co-speech gesture generation.
Our method decouples the task into two stages: script-based gesture generation and audio-guided rhythm refinement.
Our novel two-stage generation framework also enables several applications, such as changing the gesticulation style.
arXiv Detail & Related papers (2023-09-17T15:06:11Z) - Audio-Driven Co-Speech Gesture Video Generation [92.15661971086746]
We define and study this challenging problem of audio-driven co-speech gesture video generation.
Our key insight is that the co-speech gestures can be decomposed into common motion patterns and subtle rhythmic dynamics.
We propose a novel framework, Audio-driveN Gesture vIdeo gEneration (ANGIE), to effectively capture the reusable co-speech gesture patterns.
arXiv Detail & Related papers (2022-12-05T15:28:22Z) - Learning Hierarchical Cross-Modal Association for Co-Speech Gesture
Generation [107.10239561664496]
We propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation.
The proposed method renders realistic co-speech gestures and outperforms previous methods in a clear margin.
arXiv Detail & Related papers (2022-03-24T16:33:29Z) - Freeform Body Motion Generation from Speech [53.50388964591343]
Body motion generation from speech is inherently difficult due to the non-deterministic mapping from speech to body motions.
We introduce a novel freeform motion generation model (FreeMo) by equipping a two-stream architecture.
Experiments demonstrate the superior performance against several baselines.
arXiv Detail & Related papers (2022-03-04T13:03:22Z) - Speech Drives Templates: Co-Speech Gesture Synthesis with Learned
Templates [30.32106465591015]
Co-speech gesture generation is to synthesize a gesture sequence that not only looks real but also matches with the input speech audio.
Our method generates the movements of a complete upper body, including arms, hands, and the head.
arXiv Detail & Related papers (2021-08-18T07:53:36Z) - Gesticulator: A framework for semantically-aware speech-driven gesture
generation [17.284154896176553]
We present a model designed to produce arbitrary beat and semantic gestures together.
Our deep-learning based model takes both acoustic and semantic representations of speech as input, and generates gestures as a sequence of joint angle rotations as output.
The resulting gestures can be applied to both virtual agents and humanoid robots.
arXiv Detail & Related papers (2020-01-25T14:42:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.