DisCoHead: Audio-and-Video-Driven Talking Head Generation by
Disentangled Control of Head Pose and Facial Expressions
- URL: http://arxiv.org/abs/2303.07697v1
- Date: Tue, 14 Mar 2023 08:22:18 GMT
- Title: DisCoHead: Audio-and-Video-Driven Talking Head Generation by
Disentangled Control of Head Pose and Facial Expressions
- Authors: Geumbyeol Hwang, Sunwon Hong, Seunghyun Lee, Sungwoo Park, Gyeongsu
Chae
- Abstract summary: DisCoHead is a novel method to disentangle and control head pose and facial expressions without supervision.
DisCoHead successfully generates realistic audio-and-video-driven talking heads and outperforms state-of-the-art methods.
- Score: 21.064765388027727
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For realistic talking head generation, creating natural head motion while
maintaining accurate lip synchronization is essential. To fulfill this
challenging task, we propose DisCoHead, a novel method to disentangle and
control head pose and facial expressions without supervision. DisCoHead uses a
single geometric transformation as a bottleneck to isolate and extract head
motion from a head-driving video. Either an affine or a thin-plate spline
transformation can be used and both work well as geometric bottlenecks. We
enhance the efficiency of DisCoHead by integrating a dense motion estimator and
the encoder of a generator which are originally separate modules. Taking a step
further, we also propose a neural mix approach where dense motion is estimated
and applied implicitly by the encoder. After applying the disentangled head
motion to a source identity, DisCoHead controls the mouth region according to
speech audio, and it blinks eyes and moves eyebrows following a separate
driving video of the eye region, via the weight modulation of convolutional
neural networks. The experiments using multiple datasets show that DisCoHead
successfully generates realistic audio-and-video-driven talking heads and
outperforms state-of-the-art methods. Project page:
https://deepbrainai-research.github.io/discohead/
Related papers
- GaussianHeads: End-to-End Learning of Drivable Gaussian Head Avatars from Coarse-to-fine Representations [54.94362657501809]
We propose a new method to generate highly dynamic and deformable human head avatars from multi-view imagery in real-time.
At the core of our method is a hierarchical representation of head models that allows to capture the complex dynamics of facial expressions and head movements.
We train this coarse-to-fine facial avatar model along with the head pose as a learnable parameter in an end-to-end framework.
arXiv Detail & Related papers (2024-09-18T13:05:43Z) - PoseTalk: Text-and-Audio-based Pose Control and Motion Refinement for One-Shot Talking Head Generation [17.158581488104186]
Previous audio-driven talking head generation (THG) methods generate head poses from driving audio.
We propose textbfPoseTalk, a THG system that can freely generate lip-synchronized talking head videos with free head poses conditioned on text prompts and audio.
arXiv Detail & Related papers (2024-09-04T12:30:25Z) - OSM-Net: One-to-Many One-shot Talking Head Generation with Spontaneous
Head Motions [14.220727407255966]
One-shot talking head generation has no explicit head movement reference.
We propose OSM-Net, a textitone-to-many one-shot talking head generation network with natural head motions.
arXiv Detail & Related papers (2023-09-28T03:51:54Z) - FONT: Flow-guided One-shot Talking Head Generation with Natural Head
Motions [14.205344055665414]
Flow-guided One-shot model achieves NaTural head motions over generated talking heads.
Head pose prediction module is designed to generate head pose sequences from the source face and driving audio.
arXiv Detail & Related papers (2023-03-31T03:25:06Z) - HS-Diffusion: Semantic-Mixing Diffusion for Head Swapping [150.06405071177048]
We propose a semantic-mixing diffusion model for head swapping (HS-Diffusion)
We blend the semantic layouts of source head and source body, and then inpaint the transition region by the semantic layout generator.
We construct a new image-based head swapping benchmark and design two tailor-designed metrics.
arXiv Detail & Related papers (2022-12-13T10:04:01Z) - DFA-NeRF: Personalized Talking Head Generation via Disentangled Face
Attributes Neural Rendering [69.9557427451339]
We propose a framework based on neural radiance field to pursue high-fidelity talking head generation.
Specifically, neural radiance field takes lip movements features and personalized attributes as two disentangled conditions.
We show that our method achieves significantly better results than state-of-the-art methods.
arXiv Detail & Related papers (2022-01-03T18:23:38Z) - Audio2Head: Audio-driven One-shot Talking-head Generation with Natural
Head Motion [34.406907667904996]
We propose an audio-driven talking-head method to generate photo-realistic talking-head videos from a single reference image.
We first design a head pose predictor by modeling rigid 6D head movements with a motion-aware recurrent neural network (RNN)
Then, we develop a motion field generator to produce the dense motion fields from input audio, head poses, and a reference image.
arXiv Detail & Related papers (2021-07-20T07:22:42Z) - HeadGAN: One-shot Neural Head Synthesis and Editing [70.30831163311296]
HeadGAN is a system that synthesises on 3D face representations and adapted to the facial geometry of any reference image.
The 3D face representation enables HeadGAN to be further used as an efficient method for compression and reconstruction and a tool for expression and pose editing.
arXiv Detail & Related papers (2020-12-15T12:51:32Z) - Talking-head Generation with Rhythmic Head Motion [46.6897675583319]
We propose a 3D-aware generative network with a hybrid embedding module and a non-linear composition module.
Our approach achieves controllable, photo-realistic, and temporally coherent talking-head videos with natural head movements.
arXiv Detail & Related papers (2020-07-16T18:13:40Z) - Audio-driven Talking Face Video Generation with Learning-based
Personalized Head Pose [67.31838207805573]
We propose a deep neural network model that takes an audio signal A of a source person and a short video V of a target person as input.
We outputs a synthesized high-quality talking face video with personalized head pose.
Our method can generate high-quality talking face videos with more distinguishing head movement effects than state-of-the-art methods.
arXiv Detail & Related papers (2020-02-24T10:02:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.