Real-time Gesture Animation Generation from Speech for Virtual Human
Interaction
- URL: http://arxiv.org/abs/2208.03244v1
- Date: Fri, 5 Aug 2022 15:56:34 GMT
- Title: Real-time Gesture Animation Generation from Speech for Virtual Human
Interaction
- Authors: Manuel Rebol, Christian G\"utl, Krzysztof Pietroszek
- Abstract summary: We propose a real-time system for synthesizing gestures directly from speech.
Our data-driven approach is based on Generative Adversarial Neural Networks.
Model generates speaker-specific gestures by taking consecutive audio input chunks of two seconds in length.
- Score: 9.453554184019108
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a real-time system for synthesizing gestures directly from speech.
Our data-driven approach is based on Generative Adversarial Neural Networks to
model the speech-gesture relationship. We utilize the large amount of speaker
video data available online to train our 3D gesture model. Our model generates
speaker-specific gestures by taking consecutive audio input chunks of two
seconds in length. We animate the predicted gestures on a virtual avatar. We
achieve a delay below three seconds between the time of audio input and gesture
animation. Code and videos are available at
https://github.com/mrebol/Gestures-From-Speech
Related papers
- CoCoGesture: Toward Coherent Co-speech 3D Gesture Generation in the Wild [44.401536230814465]
CoCoGesture is a novel framework enabling vivid and diverse gesture synthesis from unseen human speech prompts.
Our key insight is built upon the custom-designed pretrain-fintune training paradigm.
Our proposed CoCoGesture outperforms the state-of-the-art methods on the zero-shot speech-to-gesture generation.
arXiv Detail & Related papers (2024-05-27T06:47:14Z) - Generating Holistic 3D Human Motion from Speech [97.11392166257791]
We build a high-quality dataset of 3D holistic body meshes with synchronous speech.
We then define a novel speech-to-motion generation framework in which the face, body, and hands are modeled separately.
arXiv Detail & Related papers (2022-12-08T17:25:19Z) - Audio-Driven Co-Speech Gesture Video Generation [92.15661971086746]
We define and study this challenging problem of audio-driven co-speech gesture video generation.
Our key insight is that the co-speech gestures can be decomposed into common motion patterns and subtle rhythmic dynamics.
We propose a novel framework, Audio-driveN Gesture vIdeo gEneration (ANGIE), to effectively capture the reusable co-speech gesture patterns.
arXiv Detail & Related papers (2022-12-05T15:28:22Z) - Learning Hierarchical Cross-Modal Association for Co-Speech Gesture
Generation [107.10239561664496]
We propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation.
The proposed method renders realistic co-speech gestures and outperforms previous methods in a clear margin.
arXiv Detail & Related papers (2022-03-24T16:33:29Z) - Learning Speech-driven 3D Conversational Gestures from Video [106.15628979352738]
We propose the first approach to automatically and jointly synthesize both the synchronous 3D conversational body and hand gestures.
Our algorithm uses a CNN architecture that leverages the inherent correlation between facial expression and hand gestures.
We also contribute a new way to create a large corpus of more than 33 hours of annotated body, hand, and face data from in-the-wild videos of talking people.
arXiv Detail & Related papers (2021-02-13T01:05:39Z) - Audio- and Gaze-driven Facial Animation of Codec Avatars [149.0094713268313]
We describe the first approach to animate Codec Avatars in real-time using audio and/or eye tracking.
Our goal is to display expressive conversations between individuals that exhibit important social signals.
arXiv Detail & Related papers (2020-08-11T22:28:48Z) - Speech2Video Synthesis with 3D Skeleton Regularization and Expressive
Body Poses [36.00309828380724]
We propose a novel approach to convert given speech audio to a photo-realistic speaking video of a specific person.
We achieve this by first generating 3D skeleton movements from the audio sequence using a recurrent neural network (RNN)
To make the skeleton movement realistic and expressive, we embed the knowledge of an articulated 3D human skeleton and a learned dictionary of personal speech iconic gestures into the generation process.
arXiv Detail & Related papers (2020-07-17T19:30:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.