Talking Head Generation with Probabilistic Audio-to-Visual Diffusion
Priors
- URL: http://arxiv.org/abs/2212.04248v1
- Date: Wed, 7 Dec 2022 17:55:41 GMT
- Title: Talking Head Generation with Probabilistic Audio-to-Visual Diffusion
Priors
- Authors: Zhentao Yu, Zixin Yin, Deyu Zhou, Duomin Wang, Finn Wong, Baoyuan Wang
- Abstract summary: We introduce a simple and novel framework for one-shot audio-driven talking head generation.
We probabilistically sample all the holistic lip-irrelevant facial motions to semantically match the input audio.
Thanks to the probabilistic nature of the diffusion prior, one big advantage of our framework is it can synthesize diverse facial motion sequences.
- Score: 18.904856604045264
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we introduce a simple and novel framework for one-shot
audio-driven talking head generation. Unlike prior works that require
additional driving sources for controlled synthesis in a deterministic manner,
we instead probabilistically sample all the holistic lip-irrelevant facial
motions (i.e. pose, expression, blink, gaze, etc.) to semantically match the
input audio while still maintaining both the photo-realism of audio-lip
synchronization and the overall naturalness. This is achieved by our newly
proposed audio-to-visual diffusion prior trained on top of the mapping between
audio and disentangled non-lip facial representations. Thanks to the
probabilistic nature of the diffusion prior, one big advantage of our framework
is it can synthesize diverse facial motion sequences given the same audio clip,
which is quite user-friendly for many real applications. Through comprehensive
evaluations on public benchmarks, we conclude that (1) our diffusion prior
outperforms auto-regressive prior significantly on almost all the concerned
metrics; (2) our overall system is competitive with prior works in terms of
audio-lip synchronization but can effectively sample rich and natural-looking
lip-irrelevant facial motions while still semantically harmonized with the
audio input.
Related papers
- RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network [63.77823518278202]
RealTalk is an audio-to-expression transformer and a high-fidelity expression-to-face framework.
In the first component, we consider both identity and intra-personal variation features related to speaking lip movements.
In the second component, we design a lightweight facial identity alignment (FIA) module.
This novel design allows us to generate fine details in real-time, without depending on sophisticated and inefficient feature alignment modules.
arXiv Detail & Related papers (2024-06-26T12:09:59Z) - Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation [29.87407471246318]
This research delves into the complexities of synchronizing facial movements and creating visually appealing, temporally consistent animations.
Our innovative approach embraces the end-to-end diffusion paradigm and introduces a hierarchical audio-driven visual synthesis module.
The proposed hierarchical audio-driven visual synthesis offers adaptive control over expression and pose diversity, enabling more effective personalization tailored to different identities.
arXiv Detail & Related papers (2024-06-13T04:33:20Z) - FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio [45.71036380866305]
We abstract the process of people hearing speech, extracting meaningful cues, and creating dynamically audio-consistent talking faces from a single audio.
Specifically, it involves two critical challenges: one is to effectively decouple identity, content, and emotion from entangled audio, and the other is to maintain intra-video diversity and inter-video consistency.
We introduce the Controllable Coherent Frame generation, which involves the flexible integration of three trainable adapters with frozen Latent Diffusion Models.
arXiv Detail & Related papers (2024-03-04T09:59:48Z) - DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven
Portraits Animation [78.08004432704826]
We model the Talking head generation as an audio-driven temporally coherent denoising process (DiffTalk)
In this paper, we investigate the control mechanism of the talking face, and incorporate reference face images and landmarks as conditions for personality-aware generalized synthesis.
Our DiffTalk can be gracefully tailored for higher-resolution synthesis with negligible extra computational cost.
arXiv Detail & Related papers (2023-01-10T05:11:25Z) - StyleTalker: One-shot Style-based Audio-driven Talking Head Video Generation [47.06075725469252]
StyleTalker is an audio-driven talking head generation model.
It can synthesize a video of a talking person from a single reference image.
Our model is able to synthesize talking head videos with impressive perceptual quality.
arXiv Detail & Related papers (2022-08-23T12:49:01Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Towards Realistic Visual Dubbing with Heterogeneous Sources [22.250010330418398]
Few-shot visual dubbing involves synchronizing the lip movements with arbitrary speech input for any talking head.
We propose a simple yet efficient two-stage framework with a higher flexibility of mining heterogeneous data.
Our method makes it possible to independently utilize the training corpus for two-stage sub-networks.
arXiv Detail & Related papers (2022-01-17T07:57:24Z) - Joint Learning of Visual-Audio Saliency Prediction and Sound Source
Localization on Multi-face Videos [101.83513408195692]
We propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video.
The proposed method outperforms 12 state-of-the-art saliency prediction methods, and achieves competitive results in sound source localization.
arXiv Detail & Related papers (2021-11-05T14:35:08Z) - MeshTalk: 3D Face Animation from Speech using Cross-Modality
Disentanglement [142.9900055577252]
We propose a generic audio-driven facial animation approach that achieves highly realistic motion synthesis results for the entire face.
Our approach ensures highly accurate lip motion, while also plausible animation of the parts of the face that are uncorrelated to the audio signal, such as eye blinks and eye brow motion.
arXiv Detail & Related papers (2021-04-16T17:05:40Z) - VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency [111.55430893354769]
Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers.
Our approach jointly learns audio-visual speech separation and cross-modal speaker embeddings from unlabeled video.
It yields state-of-the-art results on five benchmark datasets for audio-visual speech separation and enhancement.
arXiv Detail & Related papers (2021-01-08T18:25:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.