Related papers: DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation

DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation

URL: http://arxiv.org/abs/2301.03786v2
Date: Thu, 20 Apr 2023 08:51:11 GMT
Title: DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation
Authors: Shuai Shen, Wenliang Zhao, Zibin Meng, Wanhua Li, Zheng Zhu, Jie Zhou, Jiwen Lu
Abstract summary: We model the Talking head generation as an audio-driven temporally coherent denoising process (DiffTalk) In this paper, we investigate the control mechanism of the talking face, and incorporate reference face images and landmarks as conditions for personality-aware generalized synthesis. Our DiffTalk can be gracefully tailored for higher-resolution synthesis with negligible extra computational cost.
Score: 78.08004432704826
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Talking head synthesis is a promising approach for the video production industry. Recently, a lot of effort has been devoted in this research area to improve the generation quality or enhance the model generalization. However, there are few works able to address both issues simultaneously, which is essential for practical applications. To this end, in this paper, we turn attention to the emerging powerful Latent Diffusion Models, and model the Talking head generation as an audio-driven temporally coherent denoising process (DiffTalk). More specifically, instead of employing audio signals as the single driving factor, we investigate the control mechanism of the talking face, and incorporate reference face images and landmarks as conditions for personality-aware generalized synthesis. In this way, the proposed DiffTalk is capable of producing high-quality talking head videos in synchronization with the source audio, and more importantly, it can be naturally generalized across different identities without any further fine-tuning. Additionally, our DiffTalk can be gracefully tailored for higher-resolution synthesis with negligible extra computational cost. Extensive experiments show that the proposed DiffTalk efficiently synthesizes high-fidelity audio-driven talking head videos for generalized novel identities. For more video results, please refer to \url{https://sstzal.github.io/DiffTalk/}.

Related papers

ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing [52.33281620699459]
ThinkSound is a novel framework that leverages Chain-of-Thought (CoT) reasoning to enable stepwise, interactive audio generation and editing for videos.<n>Our approach decomposes the process into three complementary stages: semantically coherent, interactive object-centric refinement through precise user interactions, and targeted editing guided by natural language instructions.<n> Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics.
arXiv Detail & Related papers (2025-06-26T16:32:06Z)
OmniTalker: Real-Time Text-Driven Talking Head Generation with In-Context Audio-Visual Style Replication [19.688375369516923]
We introduce an end-to-end unified framework that simultaneously generates synchronized speech and talking head videos from text and reference video in real-time zero-shot scenarios. Our method surpasses existing approaches in generation quality, particularly excelling in style preservation and audio-video synchronization.
arXiv Detail & Related papers (2025-04-03T09:48:13Z)
LaDTalk: Latent Denoising for Synthesizing Talking Head Videos with High Frequency Details [14.22392871407274]
We present an effective post-processing approach to synthesize photo-realistic talking head videos. Specifically, we employ a pretrained Wav2Lip model as our foundation model, leveraging its robust audio-lip alignment capabilities. Results indicate that our method, LaDTalk, achieves new state-of-the-art video quality and out-of-domain lip synchronization performance.
arXiv Detail & Related papers (2024-10-01T18:32:02Z)
KAN-Based Fusion of Dual-Domain for Audio-Driven Facial Landmarks Generation [8.111156834055821]
Reconstructing a talking face using audio significantly contributes to fields such as education, healthcare, online conversations, virtual assistants, and virtual reality. Recently, researchers have proposed a new approach of constructing the entire face, including face pose, neck, and shoulders. We propose the KFusion of Dual-Domain model, a robust model that generates landmarks from audio.
arXiv Detail & Related papers (2024-09-09T05:20:02Z)
High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model [89.29655924125461]
We propose a novel landmark-based diffusion model for talking face generation. We first establish the less ambiguous mapping from audio to landmark motion of lip and jaw. Then, we introduce an innovative conditioning module called TalkFormer to align the synthesized motion with the motion represented by landmarks.
arXiv Detail & Related papers (2024-08-10T02:58:28Z)
Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users. Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry. In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z)
AE-NeRF: Audio Enhanced Neural Radiance Field for Few Shot Talking Head Synthesis [42.203900183584665]
We present Audio Enhanced Neural Radiance Field (AE-NeRF) to generate realistic portraits of a new speaker with fewshot dataset. AE-NeRF surpasses the state-of-the-art on image fidelity, audio-lip synchronization, and generalization ability, even in limited training set or training iterations.
arXiv Detail & Related papers (2023-12-18T04:14:38Z)
Neural Text to Articulate Talk: Deep Text to Audiovisual Speech Synthesis achieving both Auditory and Photo-realism [26.180371869137257]
State of the art in talking face generation focuses mainly on lip-syncing, being conditioned on audio clips. NEUral Text to ARticulate Talk (NEUTART) is a talking face generator that uses a joint audiovisual feature space. Model produces photorealistic talking face videos with human-like articulation and well-synced audiovisual streams.
arXiv Detail & Related papers (2023-12-11T18:41:55Z)
DiffTalker: Co-driven audio-image diffusion for talking faces via intermediate landmarks [34.80705897511651]
We present DiffTalker, a novel model designed to generate lifelike talking faces through audio and landmark co-driving. Experiments showcase DiffTalker's superior performance in producing clear and geometrically accurate talking faces.
arXiv Detail & Related papers (2023-09-14T08:22:34Z)
ReliTalk: Relightable Talking Portrait Generation from a Single Video [62.47116237654984]
ReliTalk is a novel framework for relightable audio-driven talking portrait generation from monocular videos. Our key insight is to decompose the portrait's reflectance from implicitly learned audio-driven facial normals and images.
arXiv Detail & Related papers (2023-09-05T17:59:42Z)
DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided Speaker Embedding [52.84475402151201]
We present a vision-guided speaker embedding extractor using a self-supervised pre-trained model and prompt tuning technique. We further develop a diffusion-based video-to-speech synthesis model, so called DiffV2S, conditioned on those speaker embeddings and the visual representation extracted from the input video. Our experimental results show that DiffV2S achieves the state-of-the-art performance compared to the previous video-to-speech synthesis technique.
arXiv Detail & Related papers (2023-08-15T14:07:41Z)
DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder [55.58582254514431]
We propose DAE-Talker to synthesize full video frames and produce natural head movements that align with the content of speech. We also introduce pose modelling in speech2latent for pose controllability. Our experiments show that DAE-Talker outperforms existing popular methods in lip-sync, video fidelity, and pose naturalness.
arXiv Detail & Related papers (2023-03-30T17:18:31Z)
Talking Head Generation with Probabilistic Audio-to-Visual Diffusion Priors [18.904856604045264]
We introduce a simple and novel framework for one-shot audio-driven talking head generation. We probabilistically sample all the holistic lip-irrelevant facial motions to semantically match the input audio. Thanks to the probabilistic nature of the diffusion prior, one big advantage of our framework is it can synthesize diverse facial motion sequences.
arXiv Detail & Related papers (2022-12-07T17:55:41Z)
AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis [55.24336227884039]
We present a novel framework to generate high-fidelity talking head video. We use neural scene representation networks to bridge the gap between audio input and video output. Our framework can (1) produce high-fidelity and natural results, and (2) support free adjustment of audio signals, viewing directions, and background images.
arXiv Detail & Related papers (2021-03-20T02:58:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.