Speech Drives Templates: Co-Speech Gesture Synthesis with Learned
Templates
- URL: http://arxiv.org/abs/2108.08020v1
- Date: Wed, 18 Aug 2021 07:53:36 GMT
- Title: Speech Drives Templates: Co-Speech Gesture Synthesis with Learned
Templates
- Authors: Shenhan Qian, Zhi Tu, YiHao Zhi, Wen Liu, Shenghua Gao
- Abstract summary: Co-speech gesture generation is to synthesize a gesture sequence that not only looks real but also matches with the input speech audio.
Our method generates the movements of a complete upper body, including arms, hands, and the head.
- Score: 30.32106465591015
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Co-speech gesture generation is to synthesize a gesture sequence that not
only looks real but also matches with the input speech audio. Our method
generates the movements of a complete upper body, including arms, hands, and
the head. Although recent data-driven methods achieve great success, challenges
still exist, such as limited variety, poor fidelity, and lack of objective
metrics. Motivated by the fact that the speech cannot fully determine the
gesture, we design a method that learns a set of gesture template vectors to
model the latent conditions, which relieve the ambiguity. For our method, the
template vector determines the general appearance of a generated gesture
sequence, while the speech audio drives subtle movements of the body, both
indispensable for synthesizing a realistic gesture sequence. Due to the
intractability of an objective metric for gesture-speech synchronization, we
adopt the lip-sync error as a proxy metric to tune and evaluate the
synchronization ability of our model. Extensive experiments show the
superiority of our method in both objective and subjective evaluations on
fidelity and synchronization.
Related papers
- Emphasizing Semantic Consistency of Salient Posture for Speech-Driven Gesture Generation [44.78811546051805]
Speech-driven gesture generation aims at synthesizing a gesture sequence synchronized with the input speech signal.
Previous methods leverage neural networks to directly map a compact audio representation to the gesture sequence.
We propose a novel speech-driven gesture generation method by emphasizing the semantic consistency of salient posture.
arXiv Detail & Related papers (2024-10-17T17:22:59Z) - Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs [67.27840327499625]
We present a multimodal learning-based method to simultaneously synthesize co-speech facial expressions and upper-body gestures for digital characters.
Our approach learns from sparse face landmarks and upper-body joints, estimated directly from video data, to generate plausible emotive character motions.
arXiv Detail & Related papers (2024-06-26T04:53:11Z) - ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis.
Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities.
Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z) - LivelySpeaker: Towards Semantic-Aware Co-Speech Gesture Generation [41.42316077949012]
We introduce LivelySpeaker, a framework that realizes semantics-aware co-speech gesture generation.
Our method decouples the task into two stages: script-based gesture generation and audio-guided rhythm refinement.
Our novel two-stage generation framework also enables several applications, such as changing the gesticulation style.
arXiv Detail & Related papers (2023-09-17T15:06:11Z) - QPGesture: Quantization-Based and Phase-Guided Motion Matching for
Natural Speech-Driven Gesture Generation [8.604430209445695]
Speech-driven gesture generation is highly challenging due to the random jitters of human motion.
We introduce a novel quantization-based and phase-guided motion-matching framework.
Our method outperforms recent approaches on speech-driven gesture generation.
arXiv Detail & Related papers (2023-05-18T16:31:25Z) - Generating Holistic 3D Human Motion from Speech [97.11392166257791]
We build a high-quality dataset of 3D holistic body meshes with synchronous speech.
We then define a novel speech-to-motion generation framework in which the face, body, and hands are modeled separately.
arXiv Detail & Related papers (2022-12-08T17:25:19Z) - Learning Hierarchical Cross-Modal Association for Co-Speech Gesture
Generation [107.10239561664496]
We propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation.
The proposed method renders realistic co-speech gestures and outperforms previous methods in a clear margin.
arXiv Detail & Related papers (2022-03-24T16:33:29Z) - Freeform Body Motion Generation from Speech [53.50388964591343]
Body motion generation from speech is inherently difficult due to the non-deterministic mapping from speech to body motions.
We introduce a novel freeform motion generation model (FreeMo) by equipping a two-stream architecture.
Experiments demonstrate the superior performance against several baselines.
arXiv Detail & Related papers (2022-03-04T13:03:22Z) - Towards Multi-Scale Style Control for Expressive Speech Synthesis [60.08928435252417]
The proposed method employs a multi-scale reference encoder to extract both the global-scale utterance-level and the local-scale quasi-phoneme-level style features of the target speech.
During training time, the multi-scale style model could be jointly trained with the speech synthesis model in an end-to-end fashion.
arXiv Detail & Related papers (2021-04-08T05:50:09Z) - Sequence-to-Sequence Predictive Model: From Prosody To Communicative
Gestures [2.578242050187029]
We develop a model based on a recurrent neural network with attention mechanism.
We find that the model can predict better certain gesture classes than others.
We also find that a model trained on the data of one given speaker also works for the other speaker of the same conversation.
arXiv Detail & Related papers (2020-08-17T21:55:22Z) - Gesticulator: A framework for semantically-aware speech-driven gesture
generation [17.284154896176553]
We present a model designed to produce arbitrary beat and semantic gestures together.
Our deep-learning based model takes both acoustic and semantic representations of speech as input, and generates gestures as a sequence of joint angle rotations as output.
The resulting gestures can be applied to both virtual agents and humanoid robots.
arXiv Detail & Related papers (2020-01-25T14:42:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.