Related papers: FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio

FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio

URL: http://arxiv.org/abs/2403.01901v2
Date: Mon, 1 Apr 2024 03:00:21 GMT
Title: FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio
Authors: Chao Xu, Yang Liu, Jiazheng Xing, Weida Wang, Mingze Sun, Jun Dan, Tianxin Huang, Siyuan Li, Zhi-Qi Cheng, Ying Tai, Baigui Sun,
Abstract summary: We abstract the process of people hearing speech, extracting meaningful cues, and creating dynamically audio-consistent talking faces from a single audio. Specifically, it involves two critical challenges: one is to effectively decouple identity, content, and emotion from entangled audio, and the other is to maintain intra-video diversity and inter-video consistency. We introduce the Controllable Coherent Frame generation, which involves the flexible integration of three trainable adapters with frozen Latent Diffusion Models.
Score: 45.71036380866305
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we abstract the process of people hearing speech, extracting meaningful cues, and creating various dynamically audio-consistent talking faces, termed Listening and Imagining, into the task of high-fidelity diverse talking faces generation from a single audio. Specifically, it involves two critical challenges: one is to effectively decouple identity, content, and emotion from entangled audio, and the other is to maintain intra-video diversity and inter-video consistency. To tackle the issues, we first dig out the intricate relationships among facial factors and simplify the decoupling process, tailoring a Progressive Audio Disentanglement for accurate facial geometry and semantics learning, where each stage incorporates a customized training module responsible for a specific factor. Secondly, to achieve visually diverse and audio-synchronized animation solely from input audio within a single model, we introduce the Controllable Coherent Frame generation, which involves the flexible integration of three trainable adapters with frozen Latent Diffusion Models (LDMs) to focus on maintaining facial geometry and semantics, as well as texture and temporal coherence between frames. In this way, we inherit high-quality diverse generation from LDMs while significantly improving their controllability at a low training cost. Extensive experiments demonstrate the flexibility and effectiveness of our method in handling this paradigm. The codes will be released at https://github.com/modelscope/facechain.

Related papers

Text2Lip: Progressive Lip-Synced Talking Face Generation from Text via Viseme-Guided Rendering [53.2204901422631]
Text2Lip is a viseme-centric framework that constructs an interpretable phonetic-visual bridge.<n>We show that Text2Lip outperforms existing approaches in semantic fidelity, visual realism, and modality robustness.
arXiv Detail & Related papers (2025-08-04T12:50:22Z)
AsynFusion: Towards Asynchronous Latent Consistency Models for Decoupled Whole-Body Audio-Driven Avatars [65.53676584955686]
Whole-body audio-driven avatar pose and expression generation is a critical task for creating lifelike digital humans.<n>We propose AsynFusion, a novel framework that leverages diffusion transformers to achieve cohesive expression and gesture synthesis.<n>AsynFusion achieves state-of-the-art performance in generating real-time, synchronized whole-body animations.
arXiv Detail & Related papers (2025-05-21T03:28:53Z)
OmniTalker: One-shot Real-time Text-Driven Talking Audio-Video Generation With Multimodal Style Mimicking [22.337906095079198]
We present OmniTalker, a unified framework that jointly generates synchronized talking audio-video content from input text.<n>Our framework adopts a dual-branch diffusion transformer (DiT) architecture, with one branch dedicated to audio generation and the other to video synthesis.
arXiv Detail & Related papers (2025-04-03T09:48:13Z)
EMO2: End-Effector Guided Audio-Driven Avatar Video Generation [17.816939983301474]
We propose a novel audio-driven talking head method capable of simultaneously generating highly expressive facial expressions and hand gestures. In the first stage, we generate hand poses directly from audio input, leveraging the strong correlation between audio signals and hand movements. In the second stage, we employ a diffusion model to synthesize video frames, incorporating the hand poses generated in the first stage to produce realistic facial expressions and body movements.
arXiv Detail & Related papers (2025-01-18T07:51:29Z)
PortraitTalk: Towards Customizable One-Shot Audio-to-Talking Face Generation [48.94486508604052]
We introduce a novel, customizable one-shot audio-driven talking face generation framework, named PortraitTalk.<n>Our proposed method utilizes a latent diffusion framework consisting of two main components: IdentityNet and AnimateNet.<n>Key innovation of PortraitTalk is the incorporation of text prompts through decoupled cross-attention mechanisms.
arXiv Detail & Related papers (2024-12-10T18:51:31Z)
MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation [55.95148886437854]
Memory-guided EMOtion-aware diffusion (MEMO) is an end-to-end audio-driven portrait animation approach to generate talking videos. MEMO generates more realistic talking videos across diverse image and audio types, outperforming state-of-the-art methods in overall quality, audio-lip synchronization, identity consistency, and expression-emotion alignment.
arXiv Detail & Related papers (2024-12-05T18:57:26Z)
JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation [24.2065254076207]
We introduce a novel method for joint expression and audio-guided talking face generation. Our method can synthesize high-fidelity talking face videos, achieving state-of-the-art facial expression transfer.
arXiv Detail & Related papers (2024-09-18T17:18:13Z)
High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model [89.29655924125461]
We propose a novel landmark-based diffusion model for talking face generation. We first establish the less ambiguous mapping from audio to landmark motion of lip and jaw. Then, we introduce an innovative conditioning module called TalkFormer to align the synthesized motion with the motion represented by landmarks.
arXiv Detail & Related papers (2024-08-10T02:58:28Z)
RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network [48.95833484103569]
RealTalk is an audio-to-expression transformer and a high-fidelity expression-to-face framework. In the first component, we consider both identity and intra-personal variation features related to speaking lip movements. In the second component, we design a lightweight facial identity alignment (FIA) module. This novel design allows us to generate fine details in real-time, without depending on sophisticated and inefficient feature alignment modules.
arXiv Detail & Related papers (2024-06-26T12:09:59Z)
Controllable Talking Face Generation by Implicit Facial Keypoints Editing [6.036277153327655]
We present ControlTalk, a talking face generation method to control face expression deformation based on driven audio. Our experiments show that our method is superior to state-of-the-art performance on widely used benchmarks, including HDTF and MEAD.
arXiv Detail & Related papers (2024-06-05T02:54:46Z)
EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling [57.08286593059137]
We propose EMAGE, a framework to generate full-body human gestures from audio and masked gestures. We first introduce BEAT2 (BEAT-SMPLX-FLAME), a new mesh-level holistic co-speech dataset. Experiments demonstrate that EMAGE generates holistic gestures with state-of-the-art performance.
arXiv Detail & Related papers (2023-12-31T02:25:41Z)
Cooperative Dual Attention for Audio-Visual Speech Enhancement with Facial Cues [80.53407593586411]
We focus on leveraging facial cues beyond the lip region for robust Audio-Visual Speech Enhancement (AVSE) We propose a Dual Attention Cooperative Framework, DualAVSE, to ignore speech-unrelated information, capture speech-related information with facial cues, and dynamically integrate it with the audio signal for AVSE.
arXiv Detail & Related papers (2023-11-24T04:30:31Z)
DualTalker: A Cross-Modal Dual Learning Approach for Speech-Driven 3D Facial Animation [10.73030153404956]
We propose a cross-modal dual-learning framework, termed DualTalker, to improve data usage efficiency. The framework is trained jointly with the primary task (audio-driven facial animation) and its dual task (lip reading) and shares common audio/motion encoder components. Our approach outperforms current state-of-the-art methods both qualitatively and quantitatively.
arXiv Detail & Related papers (2023-11-08T15:39:56Z)
SelfTalk: A Self-Supervised Commutative Training Diagram to Comprehend 3D Talking Faces [28.40393487247833]
Speech-driven 3D face animation technique, extending its applications to various multimedia fields. Previous research has generated promising realistic lip movements and facial expressions from audio signals. We propose a novel framework SelfTalk, by involving self-supervision in a cross-modals network system to learn 3D talking faces.
arXiv Detail & Related papers (2023-06-19T09:39:10Z)
Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation [96.66010515343106]
We propose a clean yet effective framework to generate pose-controllable talking faces. We operate on raw face images, using only a single photo as an identity reference. Our model has multiple advanced capabilities including extreme view robustness and talking face frontalization.
arXiv Detail & Related papers (2021-04-22T15:10:26Z)
Robust One Shot Audio to Video Generation [10.957973845883162]
OneShotA2V is a novel approach to synthesize a talking person video of arbitrary length using as input: an audio signal and a single unseen image of a person. OneShotA2V leverages curriculum learning to learn movements of expressive facial components and hence generates a high-quality talking-head video of the given person.
arXiv Detail & Related papers (2020-12-14T10:50:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.