Speech Gesture Generation from the Trimodal Context of Text, Audio, and
Speaker Identity
- URL: http://arxiv.org/abs/2009.02119v1
- Date: Fri, 4 Sep 2020 11:42:45 GMT
- Title: Speech Gesture Generation from the Trimodal Context of Text, Audio, and
Speaker Identity
- Authors: Youngwoo Yoon, Bok Cha, Joo-Haeng Lee, Minsu Jang, Jaeyeon Lee,
Jaehong Kim, Geehyuk Lee
- Abstract summary: We present an automatic gesture generation model that uses the multimodal context of speech text, audio, and speaker identity to reliably generate gestures.
Experiments with the introduced metric and subjective human evaluation showed that the proposed gesture generation model is better than existing end-to-end generation models.
- Score: 21.61168067832304
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For human-like agents, including virtual avatars and social robots, making
proper gestures while speaking is crucial in human--agent interaction.
Co-speech gestures enhance interaction experiences and make the agents look
alive. However, it is difficult to generate human-like gestures due to the lack
of understanding of how people gesture. Data-driven approaches attempt to learn
gesticulation skills from human demonstrations, but the ambiguous and
individual nature of gestures hinders learning. In this paper, we present an
automatic gesture generation model that uses the multimodal context of speech
text, audio, and speaker identity to reliably generate gestures. By
incorporating a multimodal context and an adversarial training scheme, the
proposed model outputs gestures that are human-like and that match with speech
content and rhythm. We also introduce a new quantitative evaluation metric for
gesture generation models. Experiments with the introduced metric and
subjective human evaluation showed that the proposed gesture generation model
is better than existing end-to-end generation models. We further confirm that
our model is able to work with synthesized audio in a scenario where contexts
are constrained, and show that different gesture styles can be generated for
the same speech by specifying different speaker identities in the style
embedding space that is learned from videos of various speakers. All the code
and data is available at
https://github.com/ai4r/Gesture-Generation-from-Trimodal-Context.
Related papers
- ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis.
Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities.
Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z) - Freetalker: Controllable Speech and Text-Driven Gesture Generation Based
on Diffusion Models for Enhanced Speaker Naturalness [45.90256126021112]
We introduce FreeTalker, which is the first framework for the generation of both spontaneous (e.g., co-speech gesture) and non-spontaneous (e.g., moving around the podium) speaker motions.
Specifically, we train a diffusion-based model for speaker motion generation that employs unified representations of both speech-driven gestures and text-driven motions.
arXiv Detail & Related papers (2024-01-07T13:01:29Z) - Audio is all in one: speech-driven gesture synthetics using WavLM pre-trained model [2.827070255699381]
diffmotion-v2 is a speech-conditional diffusion-based generative model with WavLM pre-trained model.
It can produce individual and stylized full-body co-speech gestures only using raw speech audio.
arXiv Detail & Related papers (2023-08-11T08:03:28Z) - Generating Holistic 3D Human Motion from Speech [97.11392166257791]
We build a high-quality dataset of 3D holistic body meshes with synchronous speech.
We then define a novel speech-to-motion generation framework in which the face, body, and hands are modeled separately.
arXiv Detail & Related papers (2022-12-08T17:25:19Z) - Audio-Driven Co-Speech Gesture Video Generation [92.15661971086746]
We define and study this challenging problem of audio-driven co-speech gesture video generation.
Our key insight is that the co-speech gestures can be decomposed into common motion patterns and subtle rhythmic dynamics.
We propose a novel framework, Audio-driveN Gesture vIdeo gEneration (ANGIE), to effectively capture the reusable co-speech gesture patterns.
arXiv Detail & Related papers (2022-12-05T15:28:22Z) - Zero-Shot Style Transfer for Gesture Animation driven by Text and Speech
using Adversarial Disentanglement of Multimodal Style Encoding [3.2116198597240846]
We propose an efficient yet effective machine learning approach to synthesize gestures driven by prosodic features and text in the style of different speakers.
Our model performs zero shot multimodal style transfer driven by multimodal data from the PATS database containing videos of various speakers.
arXiv Detail & Related papers (2022-08-03T08:49:55Z) - Learning Hierarchical Cross-Modal Association for Co-Speech Gesture
Generation [107.10239561664496]
We propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation.
The proposed method renders realistic co-speech gestures and outperforms previous methods in a clear margin.
arXiv Detail & Related papers (2022-03-24T16:33:29Z) - Style Transfer for Co-Speech Gesture Animation: A Multi-Speaker
Conditional-Mixture Approach [46.50460811211031]
Key challenge is to learn a model that generates gestures for a speaking agent 'A' in the gesturing style of a target speaker 'B'
We propose Mix-StAGE, which trains a single model for multiple speakers while learning unique style embeddings for each speaker's gestures.
As Mix-StAGE disentangles style and content of gestures, gesturing styles for the same input speech can be altered by simply switching the style embeddings.
arXiv Detail & Related papers (2020-07-24T15:01:02Z) - Gesticulator: A framework for semantically-aware speech-driven gesture
generation [17.284154896176553]
We present a model designed to produce arbitrary beat and semantic gestures together.
Our deep-learning based model takes both acoustic and semantic representations of speech as input, and generates gestures as a sequence of joint angle rotations as output.
The resulting gestures can be applied to both virtual agents and humanoid robots.
arXiv Detail & Related papers (2020-01-25T14:42:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.