Related papers: Real-time Gesture Animation Generation from Speech for Virtual Human Interaction

Real-time Gesture Animation Generation from Speech for Virtual Human Interaction

URL: http://arxiv.org/abs/2208.03244v1
Date: Fri, 5 Aug 2022 15:56:34 GMT
Title: Real-time Gesture Animation Generation from Speech for Virtual Human Interaction
Authors: Manuel Rebol, Christian G\"utl, Krzysztof Pietroszek
Abstract summary: We propose a real-time system for synthesizing gestures directly from speech. Our data-driven approach is based on Generative Adversarial Neural Networks. Model generates speaker-specific gestures by taking consecutive audio input chunks of two seconds in length.
Score: 9.453554184019108
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose a real-time system for synthesizing gestures directly from speech. Our data-driven approach is based on Generative Adversarial Neural Networks to model the speech-gesture relationship. We utilize the large amount of speaker video data available online to train our 3D gesture model. Our model generates speaker-specific gestures by taking consecutive audio input chunks of two seconds in length. We animate the predicted gestures on a virtual avatar. We achieve a delay below three seconds between the time of audio input and gesture animation. Code and videos are available at https://github.com/mrebol/Gestures-From-Speech

Related papers

AV-Flow: Transforming Text to Audio-Visual Human-like Interactions [101.31009576033776]
AV-Flow is an audio-visual generative model that animates photo-realistic 4D talking avatars given only text input. We demonstrate human-like speech synthesis, synchronized lip motion, lively facial expressions and head pose.
arXiv Detail & Related papers (2025-02-18T18:56:18Z)
GaussianSpeech: Audio-Driven Gaussian Avatars [76.10163891172192]
We introduce GaussianSpeech, a novel approach that synthesizes high-fidelity animation sequences of photo-realistic, personalized 3D human head avatars from spoken audio. We propose a compact and efficient 3DGS-based avatar representation that generates expression-dependent color and leverages wrinkle- and perceptually-based losses to synthesize facial details.
arXiv Detail & Related papers (2024-11-27T18:54:08Z)
CoCoGesture: Toward Coherent Co-speech 3D Gesture Generation in the Wild [44.401536230814465]
CoCoGesture is a novel framework enabling vivid and diverse gesture synthesis from unseen human speech prompts. Our key insight is built upon the custom-designed pretrain-fintune training paradigm. Our proposed CoCoGesture outperforms the state-of-the-art methods on the zero-shot speech-to-gesture generation.
arXiv Detail & Related papers (2024-05-27T06:47:14Z)
Generating Holistic 3D Human Motion from Speech [97.11392166257791]
We build a high-quality dataset of 3D holistic body meshes with synchronous speech. We then define a novel speech-to-motion generation framework in which the face, body, and hands are modeled separately.
arXiv Detail & Related papers (2022-12-08T17:25:19Z)
Audio-Driven Co-Speech Gesture Video Generation [92.15661971086746]
We define and study this challenging problem of audio-driven co-speech gesture video generation. Our key insight is that the co-speech gestures can be decomposed into common motion patterns and subtle rhythmic dynamics. We propose a novel framework, Audio-driveN Gesture vIdeo gEneration (ANGIE), to effectively capture the reusable co-speech gesture patterns.
arXiv Detail & Related papers (2022-12-05T15:28:22Z)
Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation [107.10239561664496]
We propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation. The proposed method renders realistic co-speech gestures and outperforms previous methods in a clear margin.
arXiv Detail & Related papers (2022-03-24T16:33:29Z)
Learning Speech-driven 3D Conversational Gestures from Video [106.15628979352738]
We propose the first approach to automatically and jointly synthesize both the synchronous 3D conversational body and hand gestures. Our algorithm uses a CNN architecture that leverages the inherent correlation between facial expression and hand gestures. We also contribute a new way to create a large corpus of more than 33 hours of annotated body, hand, and face data from in-the-wild videos of talking people.
arXiv Detail & Related papers (2021-02-13T01:05:39Z)
Audio- and Gaze-driven Facial Animation of Codec Avatars [149.0094713268313]
We describe the first approach to animate Codec Avatars in real-time using audio and/or eye tracking. Our goal is to display expressive conversations between individuals that exhibit important social signals.
arXiv Detail & Related papers (2020-08-11T22:28:48Z)
Speech2Video Synthesis with 3D Skeleton Regularization and Expressive Body Poses [36.00309828380724]
We propose a novel approach to convert given speech audio to a photo-realistic speaking video of a specific person. We achieve this by first generating 3D skeleton movements from the audio sequence using a recurrent neural network (RNN) To make the skeleton movement realistic and expressive, we embed the knowledge of an articulated 3D human skeleton and a learned dictionary of personal speech iconic gestures into the generation process.
arXiv Detail & Related papers (2020-07-17T19:30:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.