Audio2Head: Audio-driven One-shot Talking-head Generation with Natural
Head Motion
- URL: http://arxiv.org/abs/2107.09293v1
- Date: Tue, 20 Jul 2021 07:22:42 GMT
- Title: Audio2Head: Audio-driven One-shot Talking-head Generation with Natural
Head Motion
- Authors: Suzhen Wang, Lincheng Li, Yu Ding, Changjie Fan, Xin Yu
- Abstract summary: We propose an audio-driven talking-head method to generate photo-realistic talking-head videos from a single reference image.
We first design a head pose predictor by modeling rigid 6D head movements with a motion-aware recurrent neural network (RNN)
Then, we develop a motion field generator to produce the dense motion fields from input audio, head poses, and a reference image.
- Score: 34.406907667904996
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We propose an audio-driven talking-head method to generate photo-realistic
talking-head videos from a single reference image. In this work, we tackle two
key challenges: (i) producing natural head motions that match speech prosody,
and (ii) maintaining the appearance of a speaker in a large head motion while
stabilizing the non-face regions. We first design a head pose predictor by
modeling rigid 6D head movements with a motion-aware recurrent neural network
(RNN). In this way, the predicted head poses act as the low-frequency holistic
movements of a talking head, thus allowing our latter network to focus on
detailed facial movement generation. To depict the entire image motions arising
from audio, we exploit a keypoint based dense motion field representation.
Then, we develop a motion field generator to produce the dense motion fields
from input audio, head poses, and a reference image. As this keypoint based
representation models the motions of facial regions, head, and backgrounds
integrally, our method can better constrain the spatial and temporal
consistency of the generated videos. Finally, an image generation network is
employed to render photo-realistic talking-head videos from the estimated
keypoint based motion fields and the input reference image. Extensive
experiments demonstrate that our method produces videos with plausible head
motions, synchronized facial expressions, and stable backgrounds and
outperforms the state-of-the-art.
Related papers
- From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations [107.88375243135579]
Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands.
We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures.
Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods.
arXiv Detail & Related papers (2024-01-03T18:55:16Z) - FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models [85.16273912625022]
We introduce FaceTalk, a novel generative approach designed for synthesizing high-fidelity 3D motion sequences of talking human heads from audio signal.
To the best of our knowledge, this is the first work to propose a generative approach for realistic and high-quality motion synthesis of human heads.
arXiv Detail & Related papers (2023-12-13T19:01:07Z) - VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D
Hybrid Prior [28.737324182301652]
We propose a two-stage generic framework that supports generating high-visual quality talking head videos.
In the first stage, we map the audio to mesh by learning two motions, including non-rigid expression motion and rigid head motion.
In the second stage, we proposed a dual branch motion-vae and a generator to transform the meshes into dense motion and synthesize high-quality video frame-by-frame.
arXiv Detail & Related papers (2023-12-04T12:25:37Z) - FONT: Flow-guided One-shot Talking Head Generation with Natural Head
Motions [14.205344055665414]
Flow-guided One-shot model achieves NaTural head motions over generated talking heads.
Head pose prediction module is designed to generate head pose sequences from the source face and driving audio.
arXiv Detail & Related papers (2023-03-31T03:25:06Z) - Pose-Controllable 3D Facial Animation Synthesis using Hierarchical
Audio-Vertex Attention [52.63080543011595]
A novel pose-controllable 3D facial animation synthesis method is proposed by utilizing hierarchical audio-vertex attention.
The proposed method can produce more realistic facial expressions and head posture movements.
arXiv Detail & Related papers (2023-02-24T09:36:31Z) - Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation [54.68893964373141]
Talking face generation has historically struggled to produce head movements and natural facial expressions without guidance from additional reference videos.
Recent developments in diffusion-based generative models allow for more realistic and stable data synthesis.
We present an autoregressive diffusion model that requires only one identity image and audio sequence to generate a video of a realistic talking human head.
arXiv Detail & Related papers (2023-01-06T14:16:54Z) - Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation [12.552355581481999]
We first present a live system that generates personalized photorealistic talking-head animation only driven by audio signals at over 30 fps.
The first stage is a deep neural network that extracts deep audio features along with a manifold projection to project the features to the target person's speech space.
In the second stage, we learn facial dynamics and motions from the projected audio features.
In the final stage, we generate conditional feature maps from previous predictions and send them with a candidate image set to an image-to-image translation network to synthesize photorealistic renderings.
arXiv Detail & Related papers (2021-09-22T08:47:43Z) - MakeItTalk: Speaker-Aware Talking-Head Animation [49.77977246535329]
We present a method that generates expressive talking heads from a single facial image with audio as the only input.
Based on this intermediate representation, our method is able to synthesize photorealistic videos of entire talking heads with full range of motion.
arXiv Detail & Related papers (2020-04-27T17:56:15Z) - Audio-driven Talking Face Video Generation with Learning-based
Personalized Head Pose [67.31838207805573]
We propose a deep neural network model that takes an audio signal A of a source person and a short video V of a target person as input.
We outputs a synthesized high-quality talking face video with personalized head pose.
Our method can generate high-quality talking face videos with more distinguishing head movement effects than state-of-the-art methods.
arXiv Detail & Related papers (2020-02-24T10:02:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.