DiffTalker: Co-driven audio-image diffusion for talking faces via
intermediate landmarks
- URL: http://arxiv.org/abs/2309.07509v1
- Date: Thu, 14 Sep 2023 08:22:34 GMT
- Title: DiffTalker: Co-driven audio-image diffusion for talking faces via
intermediate landmarks
- Authors: Zipeng Qi, Xulong Zhang, Ning Cheng, Jing Xiao, Jianzong Wang
- Abstract summary: We present DiffTalker, a novel model designed to generate lifelike talking faces through audio and landmark co-driving.
Experiments showcase DiffTalker's superior performance in producing clear and geometrically accurate talking faces.
- Score: 34.80705897511651
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating realistic talking faces is a complex and widely discussed task
with numerous applications. In this paper, we present DiffTalker, a novel model
designed to generate lifelike talking faces through audio and landmark
co-driving. DiffTalker addresses the challenges associated with directly
applying diffusion models to audio control, which are traditionally trained on
text-image pairs. DiffTalker consists of two agent networks: a
transformer-based landmarks completion network for geometric accuracy and a
diffusion-based face generation network for texture details. Landmarks play a
pivotal role in establishing a seamless connection between the audio and image
domains, facilitating the incorporation of knowledge from pre-trained diffusion
models. This innovative approach efficiently produces articulate-speaking
faces. Experimental results showcase DiffTalker's superior performance in
producing clear and geometrically accurate talking faces, all without the need
for additional alignment between audio and image features.
Related papers
- JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation [24.2065254076207]
We introduce a novel method for joint expression and audio-guided talking face generation.
Our method can synthesize high-fidelity talking face videos, achieving state-of-the-art facial expression transfer.
arXiv Detail & Related papers (2024-09-18T17:18:13Z) - KAN-Based Fusion of Dual-Domain for Audio-Driven Facial Landmarks Generation [8.111156834055821]
Reconstructing a talking face using audio significantly contributes to fields such as education, healthcare, online conversations, virtual assistants, and virtual reality.
Recently, researchers have proposed a new approach of constructing the entire face, including face pose, neck, and shoulders.
We propose the KFusion of Dual-Domain model, a robust model that generates landmarks from audio.
arXiv Detail & Related papers (2024-09-09T05:20:02Z) - High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model [89.29655924125461]
We propose a novel landmark-based diffusion model for talking face generation.
We first establish the less ambiguous mapping from audio to landmark motion of lip and jaw.
Then, we introduce an innovative conditioning module called TalkFormer to align the synthesized motion with the motion represented by landmarks.
arXiv Detail & Related papers (2024-08-10T02:58:28Z) - RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network [48.95833484103569]
RealTalk is an audio-to-expression transformer and a high-fidelity expression-to-face framework.
In the first component, we consider both identity and intra-personal variation features related to speaking lip movements.
In the second component, we design a lightweight facial identity alignment (FIA) module.
This novel design allows us to generate fine details in real-time, without depending on sophisticated and inefficient feature alignment modules.
arXiv Detail & Related papers (2024-06-26T12:09:59Z) - FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio [45.71036380866305]
We abstract the process of people hearing speech, extracting meaningful cues, and creating dynamically audio-consistent talking faces from a single audio.
Specifically, it involves two critical challenges: one is to effectively decouple identity, content, and emotion from entangled audio, and the other is to maintain intra-video diversity and inter-video consistency.
We introduce the Controllable Coherent Frame generation, which involves the flexible integration of three trainable adapters with frozen Latent Diffusion Models.
arXiv Detail & Related papers (2024-03-04T09:59:48Z) - SelfTalk: A Self-Supervised Commutative Training Diagram to Comprehend
3D Talking Faces [28.40393487247833]
Speech-driven 3D face animation technique, extending its applications to various multimedia fields.
Previous research has generated promising realistic lip movements and facial expressions from audio signals.
We propose a novel framework SelfTalk, by involving self-supervision in a cross-modals network system to learn 3D talking faces.
arXiv Detail & Related papers (2023-06-19T09:39:10Z) - Identity-Preserving Talking Face Generation with Landmark and Appearance
Priors [106.79923577700345]
Existing person-generic methods have difficulty in generating realistic and lip-synced videos.
We propose a two-stage framework consisting of audio-to-landmark generation and landmark-to-video rendering procedures.
Our method can produce more realistic, lip-synced, and identity-preserving videos than existing person-generic talking face generation methods.
arXiv Detail & Related papers (2023-05-15T01:31:32Z) - DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven
Portraits Animation [78.08004432704826]
We model the Talking head generation as an audio-driven temporally coherent denoising process (DiffTalk)
In this paper, we investigate the control mechanism of the talking face, and incorporate reference face images and landmarks as conditions for personality-aware generalized synthesis.
Our DiffTalk can be gracefully tailored for higher-resolution synthesis with negligible extra computational cost.
arXiv Detail & Related papers (2023-01-10T05:11:25Z) - DialogueNeRF: Towards Realistic Avatar Face-to-Face Conversation Video
Generation [54.84137342837465]
Face-to-face conversations account for the vast majority of daily conversations.
Most existing methods focused on single-person talking head generation.
We propose a novel unified framework based on neural radiance field (NeRF)
arXiv Detail & Related papers (2022-03-15T14:16:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.