Related papers: DiffTalker: Co-driven audio-image diffusion for talking faces via intermediate landmarks

DiffTalker: Co-driven audio-image diffusion for talking faces via intermediate landmarks

URL: http://arxiv.org/abs/2309.07509v1
Date: Thu, 14 Sep 2023 08:22:34 GMT
Title: DiffTalker: Co-driven audio-image diffusion for talking faces via intermediate landmarks
Authors: Zipeng Qi, Xulong Zhang, Ning Cheng, Jing Xiao, Jianzong Wang
Abstract summary: We present DiffTalker, a novel model designed to generate lifelike talking faces through audio and landmark co-driving. Experiments showcase DiffTalker's superior performance in producing clear and geometrically accurate talking faces.
Score: 34.80705897511651
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generating realistic talking faces is a complex and widely discussed task with numerous applications. In this paper, we present DiffTalker, a novel model designed to generate lifelike talking faces through audio and landmark co-driving. DiffTalker addresses the challenges associated with directly applying diffusion models to audio control, which are traditionally trained on text-image pairs. DiffTalker consists of two agent networks: a transformer-based landmarks completion network for geometric accuracy and a diffusion-based face generation network for texture details. Landmarks play a pivotal role in establishing a seamless connection between the audio and image domains, facilitating the incorporation of knowledge from pre-trained diffusion models. This innovative approach efficiently produces articulate-speaking faces. Experimental results showcase DiffTalker's superior performance in producing clear and geometrically accurate talking faces, all without the need for additional alignment between audio and image features.

Related papers

Text2Lip: Progressive Lip-Synced Talking Face Generation from Text via Viseme-Guided Rendering [53.2204901422631]
Text2Lip is a viseme-centric framework that constructs an interpretable phonetic-visual bridge.<n>We show that Text2Lip outperforms existing approaches in semantic fidelity, visual realism, and modality robustness.
arXiv Detail & Related papers (2025-08-04T12:50:22Z)
Shushing! Let's Imagine an Authentic Speech from the Silent Video [15.426152742881365]
Vision-guided speech generation aims to produce authentic speech from facial appearance or lip motions without relying on auditory signals. Despite recent progress, existing methods struggle to achieve unified cross-modal alignment across semantics, timbre, and emotional prosody from visual cues. We introduce ImaginTalk, a novel cross-modal diffusion framework that generates faithful speech using only visual input.
arXiv Detail & Related papers (2025-03-19T06:28:17Z)
JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation [24.2065254076207]
We introduce a novel method for joint expression and audio-guided talking face generation. Our method can synthesize high-fidelity talking face videos, achieving state-of-the-art facial expression transfer.
arXiv Detail & Related papers (2024-09-18T17:18:13Z)
KAN-Based Fusion of Dual-Domain for Audio-Driven Facial Landmarks Generation [8.111156834055821]
Reconstructing a talking face using audio significantly contributes to fields such as education, healthcare, online conversations, virtual assistants, and virtual reality. Recently, researchers have proposed a new approach of constructing the entire face, including face pose, neck, and shoulders. We propose the KFusion of Dual-Domain model, a robust model that generates landmarks from audio.
arXiv Detail & Related papers (2024-09-09T05:20:02Z)
High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model [89.29655924125461]
We propose a novel landmark-based diffusion model for talking face generation. We first establish the less ambiguous mapping from audio to landmark motion of lip and jaw. Then, we introduce an innovative conditioning module called TalkFormer to align the synthesized motion with the motion represented by landmarks.
arXiv Detail & Related papers (2024-08-10T02:58:28Z)
RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network [48.95833484103569]
RealTalk is an audio-to-expression transformer and a high-fidelity expression-to-face framework. In the first component, we consider both identity and intra-personal variation features related to speaking lip movements. In the second component, we design a lightweight facial identity alignment (FIA) module. This novel design allows us to generate fine details in real-time, without depending on sophisticated and inefficient feature alignment modules.
arXiv Detail & Related papers (2024-06-26T12:09:59Z)
FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio [45.71036380866305]
We abstract the process of people hearing speech, extracting meaningful cues, and creating dynamically audio-consistent talking faces from a single audio. Specifically, it involves two critical challenges: one is to effectively decouple identity, content, and emotion from entangled audio, and the other is to maintain intra-video diversity and inter-video consistency. We introduce the Controllable Coherent Frame generation, which involves the flexible integration of three trainable adapters with frozen Latent Diffusion Models.
arXiv Detail & Related papers (2024-03-04T09:59:48Z)
SelfTalk: A Self-Supervised Commutative Training Diagram to Comprehend 3D Talking Faces [28.40393487247833]
Speech-driven 3D face animation technique, extending its applications to various multimedia fields. Previous research has generated promising realistic lip movements and facial expressions from audio signals. We propose a novel framework SelfTalk, by involving self-supervision in a cross-modals network system to learn 3D talking faces.
arXiv Detail & Related papers (2023-06-19T09:39:10Z)
Identity-Preserving Talking Face Generation with Landmark and Appearance Priors [106.79923577700345]
Existing person-generic methods have difficulty in generating realistic and lip-synced videos. We propose a two-stage framework consisting of audio-to-landmark generation and landmark-to-video rendering procedures. Our method can produce more realistic, lip-synced, and identity-preserving videos than existing person-generic talking face generation methods.
arXiv Detail & Related papers (2023-05-15T01:31:32Z)
DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation [78.08004432704826]
We model the Talking head generation as an audio-driven temporally coherent denoising process (DiffTalk) In this paper, we investigate the control mechanism of the talking face, and incorporate reference face images and landmarks as conditions for personality-aware generalized synthesis. Our DiffTalk can be gracefully tailored for higher-resolution synthesis with negligible extra computational cost.
arXiv Detail & Related papers (2023-01-10T05:11:25Z)
DialogueNeRF: Towards Realistic Avatar Face-to-Face Conversation Video Generation [54.84137342837465]
Face-to-face conversations account for the vast majority of daily conversations. Most existing methods focused on single-person talking head generation. We propose a novel unified framework based on neural radiance field (NeRF)
arXiv Detail & Related papers (2022-03-15T14:16:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.