VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D
Hybrid Prior
- URL: http://arxiv.org/abs/2312.01841v2
- Date: Thu, 7 Dec 2023 03:14:22 GMT
- Title: VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D
Hybrid Prior
- Authors: Xusen Sun, Longhao Zhang, Hao Zhu, Peng Zhang, Bang Zhang, Xinya Ji,
Kangneng Zhou, Daiheng Gao, Liefeng Bo, Xun Cao
- Abstract summary: We propose a two-stage generic framework that supports generating high-visual quality talking head videos.
In the first stage, we map the audio to mesh by learning two motions, including non-rigid expression motion and rigid head motion.
In the second stage, we proposed a dual branch motion-vae and a generator to transform the meshes into dense motion and synthesize high-quality video frame-by-frame.
- Score: 28.737324182301652
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio-driven talking head generation has drawn much attention in recent
years, and many efforts have been made in lip-sync, expressive facial
expressions, natural head pose generation, and high video quality. However, no
model has yet led or tied on all these metrics due to the one-to-many mapping
between audio and motion. In this paper, we propose VividTalk, a two-stage
generic framework that supports generating high-visual quality talking head
videos with all the above properties. Specifically, in the first stage, we map
the audio to mesh by learning two motions, including non-rigid expression
motion and rigid head motion. For expression motion, both blendshape and vertex
are adopted as the intermediate representation to maximize the representation
ability of the model. For natural head motion, a novel learnable head pose
codebook with a two-phase training mechanism is proposed. In the second stage,
we proposed a dual branch motion-vae and a generator to transform the meshes
into dense motion and synthesize high-quality video frame-by-frame. Extensive
experiments show that the proposed VividTalk can generate high-visual quality
talking head videos with lip-sync and realistic enhanced by a large margin, and
outperforms previous state-of-the-art works in objective and subjective
comparisons.
Related papers
- DAWN: Dynamic Frame Avatar with Non-autoregressive Diffusion Framework for Talking Head Video Generation [50.66658181705527]
We present DAWN, a framework that enables all-at-once generation of dynamic-length video sequences.
DAWN consists of two main components: (1) audio-driven holistic facial dynamics generation in the latent motion space, and (2) audio-driven head pose and blink generation.
Our method generates authentic and vivid videos with precise lip motions, and natural pose/blink movements.
arXiv Detail & Related papers (2024-10-17T16:32:36Z) - Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation [22.159117464397806]
We introduce a two-stage diffusion-based model for talking head generation.
The first stage involves generating synchronized facial landmarks based on the given speech.
In the second stage, these generated landmarks serve as a condition in the denoising process, aiming to optimize mouth jitter issues and generate high-fidelity, well-synchronized, and temporally coherent talking head videos.
arXiv Detail & Related papers (2024-08-03T10:19:38Z) - FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models [85.16273912625022]
We introduce FaceTalk, a novel generative approach designed for synthesizing high-fidelity 3D motion sequences of talking human heads from audio signal.
To the best of our knowledge, this is the first work to propose a generative approach for realistic and high-quality motion synthesis of human heads.
arXiv Detail & Related papers (2023-12-13T19:01:07Z) - Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a
Short Video [91.92782707888618]
We present a decomposition-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance.
We show that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization.
arXiv Detail & Related papers (2023-09-09T14:52:39Z) - A Comprehensive Multi-scale Approach for Speech and Dynamics Synchrony
in Talking Head Generation [0.0]
We propose a multi-scale audio-visual synchrony loss and a multi-scale autoregressive GAN to better handle short and long-term correlation between speech and head motion.
Our generator operates in the facial landmark domain, which is a standard low-dimensional head representation.
arXiv Detail & Related papers (2023-07-04T08:29:59Z) - High-Fidelity and Freely Controllable Talking Head Video Generation [31.08828907637289]
We propose a novel model that produces high-fidelity talking head videos with free control over head pose and expression.
We introduce a novel motion-aware multi-scale feature alignment module to effectively transfer the motion without face distortion.
We evaluate our model on challenging datasets and demonstrate its state-of-the-art performance.
arXiv Detail & Related papers (2023-04-20T09:02:41Z) - Audio2Head: Audio-driven One-shot Talking-head Generation with Natural
Head Motion [34.406907667904996]
We propose an audio-driven talking-head method to generate photo-realistic talking-head videos from a single reference image.
We first design a head pose predictor by modeling rigid 6D head movements with a motion-aware recurrent neural network (RNN)
Then, we develop a motion field generator to produce the dense motion fields from input audio, head poses, and a reference image.
arXiv Detail & Related papers (2021-07-20T07:22:42Z) - Pose-Controllable Talking Face Generation by Implicitly Modularized
Audio-Visual Representation [96.66010515343106]
We propose a clean yet effective framework to generate pose-controllable talking faces.
We operate on raw face images, using only a single photo as an identity reference.
Our model has multiple advanced capabilities including extreme view robustness and talking face frontalization.
arXiv Detail & Related papers (2021-04-22T15:10:26Z) - Talking-head Generation with Rhythmic Head Motion [46.6897675583319]
We propose a 3D-aware generative network with a hybrid embedding module and a non-linear composition module.
Our approach achieves controllable, photo-realistic, and temporally coherent talking-head videos with natural head movements.
arXiv Detail & Related papers (2020-07-16T18:13:40Z) - Audio-driven Talking Face Video Generation with Learning-based
Personalized Head Pose [67.31838207805573]
We propose a deep neural network model that takes an audio signal A of a source person and a short video V of a target person as input.
We outputs a synthesized high-quality talking face video with personalized head pose.
Our method can generate high-quality talking face videos with more distinguishing head movement effects than state-of-the-art methods.
arXiv Detail & Related papers (2020-02-24T10:02:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.