DiffTalker: Co-driven audio-image diffusion for talking faces via
intermediate landmarks
- URL: http://arxiv.org/abs/2309.07509v1
- Date: Thu, 14 Sep 2023 08:22:34 GMT
- Title: DiffTalker: Co-driven audio-image diffusion for talking faces via
intermediate landmarks
- Authors: Zipeng Qi, Xulong Zhang, Ning Cheng, Jing Xiao, Jianzong Wang
- Abstract summary: We present DiffTalker, a novel model designed to generate lifelike talking faces through audio and landmark co-driving.
Experiments showcase DiffTalker's superior performance in producing clear and geometrically accurate talking faces.
- Score: 34.80705897511651
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating realistic talking faces is a complex and widely discussed task
with numerous applications. In this paper, we present DiffTalker, a novel model
designed to generate lifelike talking faces through audio and landmark
co-driving. DiffTalker addresses the challenges associated with directly
applying diffusion models to audio control, which are traditionally trained on
text-image pairs. DiffTalker consists of two agent networks: a
transformer-based landmarks completion network for geometric accuracy and a
diffusion-based face generation network for texture details. Landmarks play a
pivotal role in establishing a seamless connection between the audio and image
domains, facilitating the incorporation of knowledge from pre-trained diffusion
models. This innovative approach efficiently produces articulate-speaking
faces. Experimental results showcase DiffTalker's superior performance in
producing clear and geometrically accurate talking faces, all without the need
for additional alignment between audio and image features.
Related papers
- RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network [63.77823518278202]
RealTalk is an audio-to-expression transformer and a high-fidelity expression-to-face framework.
In the first component, we consider both identity and intra-personal variation features related to speaking lip movements.
In the second component, we design a lightweight facial identity alignment (FIA) module.
This novel design allows us to generate fine details in real-time, without depending on sophisticated and inefficient feature alignment modules.
arXiv Detail & Related papers (2024-06-26T12:09:59Z) - FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio [45.71036380866305]
We abstract the process of people hearing speech, extracting meaningful cues, and creating dynamically audio-consistent talking faces from a single audio.
Specifically, it involves two critical challenges: one is to effectively decouple identity, content, and emotion from entangled audio, and the other is to maintain intra-video diversity and inter-video consistency.
We introduce the Controllable Coherent Frame generation, which involves the flexible integration of three trainable adapters with frozen Latent Diffusion Models.
arXiv Detail & Related papers (2024-03-04T09:59:48Z) - DreamTalk: When Expressive Talking Head Generation Meets Diffusion
Probabilistic Models [26.896633471326744]
We propose a DreamTalk framework to unlock the potential of diffusion models in generating expressive talking heads.
DreamTalk consists of a denoising network, a style-aware lip expert, and a style predictor.
Experimental results demonstrate that DreamTalk is capable of generating photo-realistic talking faces with diverse speaking styles.
arXiv Detail & Related papers (2023-12-15T13:15:42Z) - SelfTalk: A Self-Supervised Commutative Training Diagram to Comprehend
3D Talking Faces [28.40393487247833]
Speech-driven 3D face animation technique, extending its applications to various multimedia fields.
Previous research has generated promising realistic lip movements and facial expressions from audio signals.
We propose a novel framework SelfTalk, by involving self-supervision in a cross-modals network system to learn 3D talking faces.
arXiv Detail & Related papers (2023-06-19T09:39:10Z) - Identity-Preserving Talking Face Generation with Landmark and Appearance
Priors [106.79923577700345]
Existing person-generic methods have difficulty in generating realistic and lip-synced videos.
We propose a two-stage framework consisting of audio-to-landmark generation and landmark-to-video rendering procedures.
Our method can produce more realistic, lip-synced, and identity-preserving videos than existing person-generic talking face generation methods.
arXiv Detail & Related papers (2023-05-15T01:31:32Z) - A Survey on Audio Diffusion Models: Text To Speech Synthesis and
Enhancement in Generative AI [64.71397830291838]
Generative AI has demonstrated impressive performance in various fields, among which speech synthesis is an interesting direction.
With the diffusion model as the most popular generative model, numerous works have attempted two active tasks: text to speech and speech enhancement.
This work conducts a survey on audio diffusion model, which is complementary to existing surveys.
arXiv Detail & Related papers (2023-03-23T15:17:15Z) - DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven
Portraits Animation [78.08004432704826]
We model the Talking head generation as an audio-driven temporally coherent denoising process (DiffTalk)
In this paper, we investigate the control mechanism of the talking face, and incorporate reference face images and landmarks as conditions for personality-aware generalized synthesis.
Our DiffTalk can be gracefully tailored for higher-resolution synthesis with negligible extra computational cost.
arXiv Detail & Related papers (2023-01-10T05:11:25Z) - DialogueNeRF: Towards Realistic Avatar Face-to-Face Conversation Video
Generation [54.84137342837465]
Face-to-face conversations account for the vast majority of daily conversations.
Most existing methods focused on single-person talking head generation.
We propose a novel unified framework based on neural radiance field (NeRF)
arXiv Detail & Related papers (2022-03-15T14:16:49Z) - Write-a-speaker: Text-based Emotional and Rhythmic Talking-head
Generation [28.157431757281692]
We propose a text-based talking-head video generation framework that synthesizes high-fidelity facial expressions and head motions.
Our framework consists of a speaker-independent stage and a speaker-specific stage.
Our algorithm achieves high-quality photo-realistic talking-head videos including various facial expressions and head motions according to speech rhythms.
arXiv Detail & Related papers (2021-04-16T09:44:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.