Related papers: DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder

DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder

URL: http://arxiv.org/abs/2303.17550v5
Date: Fri, 1 Mar 2024 11:43:46 GMT
Title: DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder
Authors: Chenpeng Du, Qi Chen, Xie Chen, Kai Yu
Abstract summary: We propose DAE-Talker to synthesize full video frames and produce natural head movements that align with the content of speech. We also introduce pose modelling in speech2latent for pose controllability. Our experiments show that DAE-Talker outperforms existing popular methods in lip-sync, video fidelity, and pose naturalness.
Score: 20.814063371439904
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While recent research has made significant progress in speech-driven talking face generation, the quality of the generated video still lags behind that of real recordings. One reason for this is the use of handcrafted intermediate representations like facial landmarks and 3DMM coefficients, which are designed based on human knowledge and are insufficient to precisely describe facial movements. Additionally, these methods require an external pretrained model for extracting these representations, whose performance sets an upper bound on talking face generation. To address these limitations, we propose a novel method called DAE-Talker that leverages data-driven latent representations obtained from a diffusion autoencoder (DAE). DAE contains an image encoder that encodes an image into a latent vector and a DDIM image decoder that reconstructs the image from it. We train our DAE on talking face video frames and then extract their latent representations as the training target for a Conformer-based speech2latent model. This allows DAE-Talker to synthesize full video frames and produce natural head movements that align with the content of speech, rather than relying on a predetermined head pose from a template video. We also introduce pose modelling in speech2latent for pose controllability. Additionally, we propose a novel method for generating continuous video frames with the DDIM image decoder trained on individual frames, eliminating the need for modelling the joint distribution of consecutive frames directly. Our experiments show that DAE-Talker outperforms existing popular methods in lip-sync, video fidelity, and pose naturalness. We also conduct ablation studies to analyze the effectiveness of the proposed techniques and demonstrate the pose controllability of DAE-Talker.

Related papers

Exploring Talking Head Models With Adjacent Frame Prior for Speech-Preserving Facial Expression Manipulation [34.89590516635867]
Speech-Preserving Facial Expression Manipulation (SPFEM) is an innovative technique aimed at altering facial expressions in images and videos.<n>Despite advancements, SPFEM still struggles with accurate lip synchronization due to the complex interplay between facial expressions and mouth shapes.<n>We present a new framework, Talking Head Facial Expression Manipulation (THFEM), which utilizes AD-THG models to generate frames with accurately synchronized lip movements.
arXiv Detail & Related papers (2026-01-19T09:31:24Z)
DirectSwap: Mask-Free Cross-Identity Training and Benchmarking for Expression-Consistent Video Head Swapping [58.2549561389375]
Video head swapping aims to replace the entire head of a video subject, including facial identity, head shape, and hairstyle, with that of a reference image.<n>Due to the lack of ground-truth paired swapping data, prior methods typically train on cross-frame pairs of the same person within a video.<n>We propose DirectSwap, a mask-free, direct video head-swapping framework that extends an image U-Net into a video diffusion model.
arXiv Detail & Related papers (2025-12-10T08:31:28Z)
See the Speaker: Crafting High-Resolution Talking Faces from Speech with Prior Guidance and Region Refinement [19.653004988642163]
This work proposes a novel approach that directly extracts information from the speech, addressing key challenges in speech-to-talking face.<n> Notably, this is the first method capable of generating high-resolution, high-quality talking face videos exclusively from a single speech input.
arXiv Detail & Related papers (2025-10-28T09:46:19Z)
Audio-Driven Universal Gaussian Head Avatars [66.56656075831954]
We introduce the first method for audio-driven universal photorealistic avatar synthesis.<n>It combines a person-agnostic speech model with our novel Universal Head Avatar Prior.<n>Our method is not only the first general audio-driven avatar model that can account for detailed appearance modeling and rendering.
arXiv Detail & Related papers (2025-09-23T12:46:43Z)
EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion [49.55774551366049]
Diffusion models have revolutionized the field of talking head generation, yet still face challenges in expressiveness, controllability, and stability in long-time generation. We propose an EmotiveTalk framework to address these issues. Experimental results show that EmotiveTalk can generate expressive talking head videos, ensuring the promised controllability of emotions and stability during long-time generation.
arXiv Detail & Related papers (2024-11-23T04:38:51Z)
Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs [67.27840327499625]
We present a multimodal learning-based method to simultaneously synthesize co-speech facial expressions and upper-body gestures for digital characters. Our approach learns from sparse face landmarks and upper-body joints, estimated directly from video data, to generate plausible emotive character motions.
arXiv Detail & Related papers (2024-06-26T04:53:11Z)
Controllable Talking Face Generation by Implicit Facial Keypoints Editing [6.036277153327655]
We present ControlTalk, a talking face generation method to control face expression deformation based on driven audio. Our experiments show that our method is superior to state-of-the-art performance on widely used benchmarks, including HDTF and MEAD.
arXiv Detail & Related papers (2024-06-05T02:54:46Z)
Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users. Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry. In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z)
DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models [24.401443462720135]
We propose DiffPoseTalk, a generative framework based on the diffusion model combined with a style encoder. In particular, our style includes the generation of head poses, thereby enhancing user perception. We address the shortage of scanned 3D talking face data by training our model on reconstructed 3DMM parameters from a high-quality, in-the-wild audio-visual dataset.
arXiv Detail & Related papers (2023-09-30T17:01:18Z)
Identity-Preserving Talking Face Generation with Landmark and Appearance Priors [106.79923577700345]
Existing person-generic methods have difficulty in generating realistic and lip-synced videos. We propose a two-stage framework consisting of audio-to-landmark generation and landmark-to-video rendering procedures. Our method can produce more realistic, lip-synced, and identity-preserving videos than existing person-generic talking face generation methods.
arXiv Detail & Related papers (2023-05-15T01:31:32Z)
Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation [54.68893964373141]
Talking face generation has historically struggled to produce head movements and natural facial expressions without guidance from additional reference videos. Recent developments in diffusion-based generative models allow for more realistic and stable data synthesis. We present an autoregressive diffusion model that requires only one identity image and audio sequence to generate a video of a realistic talking human head.
arXiv Detail & Related papers (2023-01-06T14:16:54Z)
Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation [96.66010515343106]
We propose a clean yet effective framework to generate pose-controllable talking faces. We operate on raw face images, using only a single photo as an identity reference. Our model has multiple advanced capabilities including extreme view robustness and talking face frontalization.
arXiv Detail & Related papers (2021-04-22T15:10:26Z)
Realistic Face Reenactment via Self-Supervised Disentangling of Identity and Pose [23.211318473026243]
We propose a self-supervised hybrid model (DAE-GAN) that learns how to reenact face naturally given large amounts of unlabeled videos. Our approach combines two deforming autoencoders with the latest advances in the conditional generation. Experiment results demonstrate the superior quality of reenacted images and the flexibility of transferring facial movements between identities.
arXiv Detail & Related papers (2020-03-29T06:45:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.