FONT: Flow-guided One-shot Talking Head Generation with Natural Head
Motions
- URL: http://arxiv.org/abs/2303.17789v1
- Date: Fri, 31 Mar 2023 03:25:06 GMT
- Title: FONT: Flow-guided One-shot Talking Head Generation with Natural Head
Motions
- Authors: Jin Liu, Xi Wang, Xiaomeng Fu, Yesheng Chai, Cai Yu, Jiao Dai, Jizhong
Han
- Abstract summary: Flow-guided One-shot model achieves NaTural head motions over generated talking heads.
Head pose prediction module is designed to generate head pose sequences from the source face and driving audio.
- Score: 14.205344055665414
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: One-shot talking head generation has received growing attention in recent
years, with various creative and practical applications. An ideal natural and
vivid generated talking head video should contain natural head pose changes.
However, it is challenging to map head pose sequences from driving audio since
there exists a natural gap between audio-visual modalities. In this work, we
propose a Flow-guided One-shot model that achieves NaTural head motions(FONT)
over generated talking heads. Specifically, the head pose prediction module is
designed to generate head pose sequences from the source face and driving
audio. We add the random sampling operation and the structural similarity
constraint to model the diversity in the one-to-many mapping between
audio-visual modality, thus predicting natural head poses. Then we develop a
keypoint predictor that produces unsupervised keypoints from the source face,
driving audio and pose sequences to describe the facial structure information.
Finally, a flow-guided occlusion-aware generator is employed to produce
photo-realistic talking head videos from the estimated keypoints and source
face. Extensive experimental results prove that FONT generates talking heads
with natural head poses and synchronized mouth shapes, outperforming other
compared methods.
Related papers
- PoseTalk: Text-and-Audio-based Pose Control and Motion Refinement for One-Shot Talking Head Generation [17.158581488104186]
Previous audio-driven talking head generation (THG) methods generate head poses from driving audio.
We propose textbfPoseTalk, a THG system that can freely generate lip-synchronized talking head videos with free head poses conditioned on text prompts and audio.
arXiv Detail & Related papers (2024-09-04T12:30:25Z) - OSM-Net: One-to-Many One-shot Talking Head Generation with Spontaneous
Head Motions [14.220727407255966]
One-shot talking head generation has no explicit head movement reference.
We propose OSM-Net, a textitone-to-many one-shot talking head generation network with natural head motions.
arXiv Detail & Related papers (2023-09-28T03:51:54Z) - DisCoHead: Audio-and-Video-Driven Talking Head Generation by
Disentangled Control of Head Pose and Facial Expressions [21.064765388027727]
DisCoHead is a novel method to disentangle and control head pose and facial expressions without supervision.
DisCoHead successfully generates realistic audio-and-video-driven talking heads and outperforms state-of-the-art methods.
arXiv Detail & Related papers (2023-03-14T08:22:18Z) - OPT: One-shot Pose-Controllable Talking Head Generation [14.205344055665414]
One-shot talking head generation produces lip-sync talking heads based on arbitrary audio and one source face.
We present One-shot Pose-controllable Talking head generation network (OPT)
OPT generates high-quality pose-controllable talking heads with no identity mismatch problem, outperforming previous SOTA methods.
arXiv Detail & Related papers (2023-02-16T10:26:52Z) - GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face
Synthesis [62.297513028116576]
GeneFace is a general and high-fidelity NeRF-based talking face generation method.
A head-aware torso-NeRF is proposed to eliminate the head-torso problem.
arXiv Detail & Related papers (2023-01-31T05:56:06Z) - Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation [54.68893964373141]
Talking face generation has historically struggled to produce head movements and natural facial expressions without guidance from additional reference videos.
Recent developments in diffusion-based generative models allow for more realistic and stable data synthesis.
We present an autoregressive diffusion model that requires only one identity image and audio sequence to generate a video of a realistic talking human head.
arXiv Detail & Related papers (2023-01-06T14:16:54Z) - DialogueNeRF: Towards Realistic Avatar Face-to-Face Conversation Video
Generation [54.84137342837465]
Face-to-face conversations account for the vast majority of daily conversations.
Most existing methods focused on single-person talking head generation.
We propose a novel unified framework based on neural radiance field (NeRF)
arXiv Detail & Related papers (2022-03-15T14:16:49Z) - DFA-NeRF: Personalized Talking Head Generation via Disentangled Face
Attributes Neural Rendering [69.9557427451339]
We propose a framework based on neural radiance field to pursue high-fidelity talking head generation.
Specifically, neural radiance field takes lip movements features and personalized attributes as two disentangled conditions.
We show that our method achieves significantly better results than state-of-the-art methods.
arXiv Detail & Related papers (2022-01-03T18:23:38Z) - Audio2Head: Audio-driven One-shot Talking-head Generation with Natural
Head Motion [34.406907667904996]
We propose an audio-driven talking-head method to generate photo-realistic talking-head videos from a single reference image.
We first design a head pose predictor by modeling rigid 6D head movements with a motion-aware recurrent neural network (RNN)
Then, we develop a motion field generator to produce the dense motion fields from input audio, head poses, and a reference image.
arXiv Detail & Related papers (2021-07-20T07:22:42Z) - Pose-Controllable Talking Face Generation by Implicitly Modularized
Audio-Visual Representation [96.66010515343106]
We propose a clean yet effective framework to generate pose-controllable talking faces.
We operate on raw face images, using only a single photo as an identity reference.
Our model has multiple advanced capabilities including extreme view robustness and talking face frontalization.
arXiv Detail & Related papers (2021-04-22T15:10:26Z) - Audio-driven Talking Face Video Generation with Learning-based
Personalized Head Pose [67.31838207805573]
We propose a deep neural network model that takes an audio signal A of a source person and a short video V of a target person as input.
We outputs a synthesized high-quality talking face video with personalized head pose.
Our method can generate high-quality talking face videos with more distinguishing head movement effects than state-of-the-art methods.
arXiv Detail & Related papers (2020-02-24T10:02:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.