Speech2Video Synthesis with 3D Skeleton Regularization and Expressive
Body Poses
- URL: http://arxiv.org/abs/2007.09198v5
- Date: Thu, 8 Oct 2020 23:19:03 GMT
- Title: Speech2Video Synthesis with 3D Skeleton Regularization and Expressive
Body Poses
- Authors: Miao Liao, Sibo Zhang, Peng Wang, Hao Zhu, Xinxin Zuo, and Ruigang
Yang
- Abstract summary: We propose a novel approach to convert given speech audio to a photo-realistic speaking video of a specific person.
We achieve this by first generating 3D skeleton movements from the audio sequence using a recurrent neural network (RNN)
To make the skeleton movement realistic and expressive, we embed the knowledge of an articulated 3D human skeleton and a learned dictionary of personal speech iconic gestures into the generation process.
- Score: 36.00309828380724
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a novel approach to convert given speech audio to a
photo-realistic speaking video of a specific person, where the output video has
synchronized, realistic, and expressive rich body dynamics. We achieve this by
first generating 3D skeleton movements from the audio sequence using a
recurrent neural network (RNN), and then synthesizing the output video via a
conditional generative adversarial network (GAN). To make the skeleton movement
realistic and expressive, we embed the knowledge of an articulated 3D human
skeleton and a learned dictionary of personal speech iconic gestures into the
generation process in both learning and testing pipelines. The former prevents
the generation of unreasonable body distortion, while the later helps our model
quickly learn meaningful body movement through a few recorded videos. To
produce photo-realistic and high-resolution video with motion details, we
propose to insert part attention mechanisms in the conditional GAN, where each
detailed part, e.g. head and hand, is automatically zoomed in to have their own
discriminators. To validate our approach, we collect a dataset with 20
high-quality videos from 1 male and 1 female model reading various documents
under different topics. Compared with previous SoTA pipelines handling similar
tasks, our approach achieves better results by a user study.
Related papers
- Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs [67.27840327499625]
We present a multimodal learning-based method to simultaneously synthesize co-speech facial expressions and upper-body gestures for digital characters.
Our approach learns from sparse face landmarks and upper-body joints, estimated directly from video data, to generate plausible emotive character motions.
arXiv Detail & Related papers (2024-06-26T04:53:11Z) - FaceDiffuser: Speech-Driven 3D Facial Animation Synthesis Using
Diffusion [0.0]
We present FaceDiffuser, a non-deterministic deep learning model to generate speech-driven facial animations.
Our method is based on the diffusion technique and uses the pre-trained large speech representation model HuBERT to encode the audio input.
We also introduce a new in-house dataset that is based on a blendshape based rigged character.
arXiv Detail & Related papers (2023-09-20T13:33:00Z) - Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a
Short Video [91.92782707888618]
We present a decomposition-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance.
We show that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization.
arXiv Detail & Related papers (2023-09-09T14:52:39Z) - Generating Holistic 3D Human Motion from Speech [97.11392166257791]
We build a high-quality dataset of 3D holistic body meshes with synchronous speech.
We then define a novel speech-to-motion generation framework in which the face, body, and hands are modeled separately.
arXiv Detail & Related papers (2022-12-08T17:25:19Z) - A Novel Speech-Driven Lip-Sync Model with CNN and LSTM [12.747541089354538]
We present a combined deep neural network of one-dimensional convolutions and LSTM to generate displacement of a 3D template face model from variable-length speech input.
In order to enhance the robustness of the network to different sound signals, we adapt a trained speech recognition model to extract speech feature.
We show that our model is able to generate smooth and natural lip movements synchronized with speech.
arXiv Detail & Related papers (2022-05-02T13:57:50Z) - Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation [12.552355581481999]
We first present a live system that generates personalized photorealistic talking-head animation only driven by audio signals at over 30 fps.
The first stage is a deep neural network that extracts deep audio features along with a manifold projection to project the features to the target person's speech space.
In the second stage, we learn facial dynamics and motions from the projected audio features.
In the final stage, we generate conditional feature maps from previous predictions and send them with a candidate image set to an image-to-image translation network to synthesize photorealistic renderings.
arXiv Detail & Related papers (2021-09-22T08:47:43Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z) - Learning Speech-driven 3D Conversational Gestures from Video [106.15628979352738]
We propose the first approach to automatically and jointly synthesize both the synchronous 3D conversational body and hand gestures.
Our algorithm uses a CNN architecture that leverages the inherent correlation between facial expression and hand gestures.
We also contribute a new way to create a large corpus of more than 33 hours of annotated body, hand, and face data from in-the-wild videos of talking people.
arXiv Detail & Related papers (2021-02-13T01:05:39Z) - Neural Human Video Rendering by Learning Dynamic Textures and
Rendering-to-Video Translation [99.64565200170897]
We propose a novel human video synthesis method by explicitly disentangling the learning of time-coherent fine-scale details from the embedding of the human in 2D screen space.
We show several applications of our approach, such as human reenactment and novel view synthesis from monocular video, where we show significant improvement over the state of the art both qualitatively and quantitatively.
arXiv Detail & Related papers (2020-01-14T18:06:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.