DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven
Portraits Animation
- URL: http://arxiv.org/abs/2301.03786v2
- Date: Thu, 20 Apr 2023 08:51:11 GMT
- Title: DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven
Portraits Animation
- Authors: Shuai Shen, Wenliang Zhao, Zibin Meng, Wanhua Li, Zheng Zhu, Jie Zhou,
Jiwen Lu
- Abstract summary: We model the Talking head generation as an audio-driven temporally coherent denoising process (DiffTalk)
In this paper, we investigate the control mechanism of the talking face, and incorporate reference face images and landmarks as conditions for personality-aware generalized synthesis.
Our DiffTalk can be gracefully tailored for higher-resolution synthesis with negligible extra computational cost.
- Score: 78.08004432704826
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Talking head synthesis is a promising approach for the video production
industry. Recently, a lot of effort has been devoted in this research area to
improve the generation quality or enhance the model generalization. However,
there are few works able to address both issues simultaneously, which is
essential for practical applications. To this end, in this paper, we turn
attention to the emerging powerful Latent Diffusion Models, and model the
Talking head generation as an audio-driven temporally coherent denoising
process (DiffTalk). More specifically, instead of employing audio signals as
the single driving factor, we investigate the control mechanism of the talking
face, and incorporate reference face images and landmarks as conditions for
personality-aware generalized synthesis. In this way, the proposed DiffTalk is
capable of producing high-quality talking head videos in synchronization with
the source audio, and more importantly, it can be naturally generalized across
different identities without any further fine-tuning. Additionally, our
DiffTalk can be gracefully tailored for higher-resolution synthesis with
negligible extra computational cost. Extensive experiments show that the
proposed DiffTalk efficiently synthesizes high-fidelity audio-driven talking
head videos for generalized novel identities. For more video results, please
refer to \url{https://sstzal.github.io/DiffTalk/}.
Related papers
- LaDTalk: Latent Denoising for Synthesizing Talking Head Videos with High Frequency Details [14.22392871407274]
We present an effective post-processing approach to synthesize photo-realistic talking head videos.
Specifically, we employ a pretrained Wav2Lip model as our foundation model, leveraging its robust audio-lip alignment capabilities.
Results indicate that our method, LaDTalk, achieves new state-of-the-art video quality and out-of-domain lip synchronization performance.
arXiv Detail & Related papers (2024-10-01T18:32:02Z) - High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model [89.29655924125461]
We propose a novel landmark-based diffusion model for talking face generation.
We first establish the less ambiguous mapping from audio to landmark motion of lip and jaw.
Then, we introduce an innovative conditioning module called TalkFormer to align the synthesized motion with the motion represented by landmarks.
arXiv Detail & Related papers (2024-08-10T02:58:28Z) - Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion
Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users.
Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry.
In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z) - AE-NeRF: Audio Enhanced Neural Radiance Field for Few Shot Talking Head
Synthesis [42.203900183584665]
We present Audio Enhanced Neural Radiance Field (AE-NeRF) to generate realistic portraits of a new speaker with fewshot dataset.
AE-NeRF surpasses the state-of-the-art on image fidelity, audio-lip synchronization, and generalization ability, even in limited training set or training iterations.
arXiv Detail & Related papers (2023-12-18T04:14:38Z) - Neural Text to Articulate Talk: Deep Text to Audiovisual Speech
Synthesis achieving both Auditory and Photo-realism [26.180371869137257]
State of the art in talking face generation focuses mainly on lip-syncing, being conditioned on audio clips.
NEUral Text to ARticulate Talk (NEUTART) is a talking face generator that uses a joint audiovisual feature space.
Model produces photorealistic talking face videos with human-like articulation and well-synced audiovisual streams.
arXiv Detail & Related papers (2023-12-11T18:41:55Z) - DiffTalker: Co-driven audio-image diffusion for talking faces via
intermediate landmarks [34.80705897511651]
We present DiffTalker, a novel model designed to generate lifelike talking faces through audio and landmark co-driving.
Experiments showcase DiffTalker's superior performance in producing clear and geometrically accurate talking faces.
arXiv Detail & Related papers (2023-09-14T08:22:34Z) - ReliTalk: Relightable Talking Portrait Generation from a Single Video [62.47116237654984]
ReliTalk is a novel framework for relightable audio-driven talking portrait generation from monocular videos.
Our key insight is to decompose the portrait's reflectance from implicitly learned audio-driven facial normals and images.
arXiv Detail & Related papers (2023-09-05T17:59:42Z) - DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided
Speaker Embedding [52.84475402151201]
We present a vision-guided speaker embedding extractor using a self-supervised pre-trained model and prompt tuning technique.
We further develop a diffusion-based video-to-speech synthesis model, so called DiffV2S, conditioned on those speaker embeddings and the visual representation extracted from the input video.
Our experimental results show that DiffV2S achieves the state-of-the-art performance compared to the previous video-to-speech synthesis technique.
arXiv Detail & Related papers (2023-08-15T14:07:41Z) - Talking Head Generation with Probabilistic Audio-to-Visual Diffusion
Priors [18.904856604045264]
We introduce a simple and novel framework for one-shot audio-driven talking head generation.
We probabilistically sample all the holistic lip-irrelevant facial motions to semantically match the input audio.
Thanks to the probabilistic nature of the diffusion prior, one big advantage of our framework is it can synthesize diverse facial motion sequences.
arXiv Detail & Related papers (2022-12-07T17:55:41Z) - AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis [55.24336227884039]
We present a novel framework to generate high-fidelity talking head video.
We use neural scene representation networks to bridge the gap between audio input and video output.
Our framework can (1) produce high-fidelity and natural results, and (2) support free adjustment of audio signals, viewing directions, and background images.
arXiv Detail & Related papers (2021-03-20T02:58:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.