Real-time Gesture Animation Generation from Speech for Virtual Human
Interaction
- URL: http://arxiv.org/abs/2208.03244v1
- Date: Fri, 5 Aug 2022 15:56:34 GMT
- Title: Real-time Gesture Animation Generation from Speech for Virtual Human
Interaction
- Authors: Manuel Rebol, Christian G\"utl, Krzysztof Pietroszek
- Abstract summary: We propose a real-time system for synthesizing gestures directly from speech.
Our data-driven approach is based on Generative Adversarial Neural Networks.
Model generates speaker-specific gestures by taking consecutive audio input chunks of two seconds in length.
- Score: 9.453554184019108
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a real-time system for synthesizing gestures directly from speech.
Our data-driven approach is based on Generative Adversarial Neural Networks to
model the speech-gesture relationship. We utilize the large amount of speaker
video data available online to train our 3D gesture model. Our model generates
speaker-specific gestures by taking consecutive audio input chunks of two
seconds in length. We animate the predicted gestures on a virtual avatar. We
achieve a delay below three seconds between the time of audio input and gesture
animation. Code and videos are available at
https://github.com/mrebol/Gestures-From-Speech
Related papers
- AV-Flow: Transforming Text to Audio-Visual Human-like Interactions [101.31009576033776]
AV-Flow is an audio-visual generative model that animates photo-realistic 4D talking avatars given only text input.
We demonstrate human-like speech synthesis, synchronized lip motion, lively facial expressions and head pose.
arXiv Detail & Related papers (2025-02-18T18:56:18Z) - GaussianSpeech: Audio-Driven Gaussian Avatars [76.10163891172192]
We introduce GaussianSpeech, a novel approach that synthesizes high-fidelity animation sequences of photo-realistic, personalized 3D human head avatars from spoken audio.
We propose a compact and efficient 3DGS-based avatar representation that generates expression-dependent color and leverages wrinkle- and perceptually-based losses to synthesize facial details.
arXiv Detail & Related papers (2024-11-27T18:54:08Z) - CoCoGesture: Toward Coherent Co-speech 3D Gesture Generation in the Wild [42.09889990430308]
CoCoGesture is a novel framework enabling vivid and diverse gesture synthesis from unseen human speech prompts.
Our key insight is built upon the custom-designed pretrain-fintune training paradigm.
Our proposed CoCoGesture outperforms the state-of-the-art methods on the zero-shot speech-to-gesture generation.
arXiv Detail & Related papers (2024-05-27T06:47:14Z) - Audio-Driven Co-Speech Gesture Video Generation [92.15661971086746]
We define and study this challenging problem of audio-driven co-speech gesture video generation.
Our key insight is that the co-speech gestures can be decomposed into common motion patterns and subtle rhythmic dynamics.
We propose a novel framework, Audio-driveN Gesture vIdeo gEneration (ANGIE), to effectively capture the reusable co-speech gesture patterns.
arXiv Detail & Related papers (2022-12-05T15:28:22Z) - Learning Hierarchical Cross-Modal Association for Co-Speech Gesture
Generation [107.10239561664496]
We propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation.
The proposed method renders realistic co-speech gestures and outperforms previous methods in a clear margin.
arXiv Detail & Related papers (2022-03-24T16:33:29Z) - Learning Speech-driven 3D Conversational Gestures from Video [106.15628979352738]
We propose the first approach to automatically and jointly synthesize both the synchronous 3D conversational body and hand gestures.
Our algorithm uses a CNN architecture that leverages the inherent correlation between facial expression and hand gestures.
We also contribute a new way to create a large corpus of more than 33 hours of annotated body, hand, and face data from in-the-wild videos of talking people.
arXiv Detail & Related papers (2021-02-13T01:05:39Z) - Audio- and Gaze-driven Facial Animation of Codec Avatars [149.0094713268313]
We describe the first approach to animate Codec Avatars in real-time using audio and/or eye tracking.
Our goal is to display expressive conversations between individuals that exhibit important social signals.
arXiv Detail & Related papers (2020-08-11T22:28:48Z) - Speech2Video Synthesis with 3D Skeleton Regularization and Expressive
Body Poses [36.00309828380724]
We propose a novel approach to convert given speech audio to a photo-realistic speaking video of a specific person.
We achieve this by first generating 3D skeleton movements from the audio sequence using a recurrent neural network (RNN)
To make the skeleton movement realistic and expressive, we embed the knowledge of an articulated 3D human skeleton and a learned dictionary of personal speech iconic gestures into the generation process.
arXiv Detail & Related papers (2020-07-17T19:30:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.