Dual Audio-Centric Modality Coupling for Talking Head Generation
- URL: http://arxiv.org/abs/2503.22728v1
- Date: Wed, 26 Mar 2025 06:46:51 GMT
- Title: Dual Audio-Centric Modality Coupling for Talking Head Generation
- Authors: Ao Fu, Ziqi Ni, Yi Zhou,
- Abstract summary: The generation of audio-driven talking head videos is a key challenge in computer vision and graphics, with applications in virtual avatars and digital media.<n>Traditional approaches often struggle with capturing the complex interaction between audio and facial dynamics, leading to lip synchronization and visual quality issues.<n>We propose a novel NeRF-based framework, Dual Audio-Centric Modality Coupling (DAMC), which effectively integrates content and dynamic features from audio inputs.
- Score: 4.03322932416974
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The generation of audio-driven talking head videos is a key challenge in computer vision and graphics, with applications in virtual avatars and digital media. Traditional approaches often struggle with capturing the complex interaction between audio and facial dynamics, leading to lip synchronization and visual quality issues. In this paper, we propose a novel NeRF-based framework, Dual Audio-Centric Modality Coupling (DAMC), which effectively integrates content and dynamic features from audio inputs. By leveraging a dual encoder structure, DAMC captures semantic content through the Content-Aware Encoder and ensures precise visual synchronization through the Dynamic-Sync Encoder. These features are fused using a Cross-Synchronized Fusion Module (CSFM), enhancing content representation and lip synchronization. Extensive experiments show that our method outperforms existing state-of-the-art approaches in key metrics such as lip synchronization accuracy and image quality, demonstrating robust generalization across various audio inputs, including synthetic speech from text-to-speech (TTS) systems. Our results provide a promising solution for high-quality, audio-driven talking head generation and present a scalable approach for creating realistic talking heads.
Related papers
- OmniTalker: Real-Time Text-Driven Talking Head Generation with In-Context Audio-Visual Style Replication [19.688375369516923]
We introduce an end-to-end unified framework that simultaneously generates synchronized speech and talking head videos from text and reference video in real-time zero-shot scenarios.
Our method surpasses existing approaches in generation quality, particularly excelling in style preservation and audio-video synchronization.
arXiv Detail & Related papers (2025-04-03T09:48:13Z) - UniSync: A Unified Framework for Audio-Visual Synchronization [7.120340851879775]
We present UniSync, a novel approach for evaluating audio-visual synchronization using embedding similarities.<n>We enhance the contrastive learning framework with a margin-based loss component and cross-speaker unsynchronized pairs.<n>UniSync outperforms existing methods on standard datasets.
arXiv Detail & Related papers (2025-03-20T17:16:03Z) - Cosh-DiT: Co-Speech Gesture Video Synthesis via Hybrid Audio-Visual Diffusion Transformers [58.86974149731874]
Cosh-DiT is a Co-speech gesture video system with hybrid Diffusion Transformers.
We introduce an audio Diffusion Transformer to synthesize expressive gesture dynamics synchronized with speech rhythms.
For realistic video synthesis conditioned on the generated speech-driven motion, we design a visual Diffusion Transformer.
arXiv Detail & Related papers (2025-03-13T01:36:05Z) - PointTalk: Audio-Driven Dynamic Lip Point Cloud for 3D Gaussian-based Talking Head Synthesis [27.97031664678664]
Methods based on radiance fields have received increasing attention due to their ability to synthesize high-fidelity talking heads.<n>We propose a novel 3D Gaussian-based method called PointTalk, which constructs a static 3D Gaussian field of the head and deforms it in sync with the audio.<n>Our method achieves superior high-fidelity and audio-lip synchronization in talking head synthesis compared to previous methods.
arXiv Detail & Related papers (2024-12-11T16:15:14Z) - MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation [55.95148886437854]
Memory-guided EMOtion-aware diffusion (MEMO) is an end-to-end audio-driven portrait animation approach to generate talking videos.<n>MEMO generates more realistic talking videos across diverse image and audio types, outperforming state-of-the-art methods in overall quality, audio-lip synchronization, identity consistency, and expression-emotion alignment.
arXiv Detail & Related papers (2024-12-05T18:57:26Z) - When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding [118.72266141321647]
Cross-Modality Video Coding (CMVC) is a pioneering approach to explore multimodality representation and video generative models in video coding.<n>During decoding, previously encoded components and video generation models are leveraged to create multiple encoding-decoding modes.<n>Experiments indicate that TT2V achieves effective semantic reconstruction, while IT2V exhibits competitive perceptual consistency.
arXiv Detail & Related papers (2024-08-15T11:36:18Z) - Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity [12.848371604063168]
We propose a V2A generative model, named MaskVAT, that interconnects a full-band high-quality general audio with a sequence-to-sequence masked generative model.
Our results show that, by combining a high-quality with the proper pre-trained audio-visual features and a sequence-to-sequence parallel structure, we are able to yield highly synchronized results.
arXiv Detail & Related papers (2024-07-15T01:49:59Z) - Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation [29.87407471246318]
This research delves into the complexities of synchronizing facial movements and creating visually appealing, temporally consistent animations.
Our innovative approach embraces the end-to-end diffusion paradigm and introduces a hierarchical audio-driven visual synthesis module.
The proposed hierarchical audio-driven visual synthesis offers adaptive control over expression and pose diversity, enabling more effective personalization tailored to different identities.
arXiv Detail & Related papers (2024-06-13T04:33:20Z) - FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio [45.71036380866305]
We abstract the process of people hearing speech, extracting meaningful cues, and creating dynamically audio-consistent talking faces from a single audio.
Specifically, it involves two critical challenges: one is to effectively decouple identity, content, and emotion from entangled audio, and the other is to maintain intra-video diversity and inter-video consistency.
We introduce the Controllable Coherent Frame generation, which involves the flexible integration of three trainable adapters with frozen Latent Diffusion Models.
arXiv Detail & Related papers (2024-03-04T09:59:48Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Align, Adapt and Inject: Sound-guided Unified Image Generation [50.34667929051005]
We propose a unified framework 'Align, Adapt, and Inject' (AAI) for sound-guided image generation, editing, and stylization.
Our method adapts input sound into a sound token, like an ordinary word, which can plug and play with existing Text-to-Image (T2I) models.
Our proposed AAI outperforms other text and sound-guided state-of-the-art methods.
arXiv Detail & Related papers (2023-06-20T12:50:49Z) - VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge
transfer from voice conversion [77.50171525265056]
This paper proposes a novel multi-speaker Video-to-Speech (VTS) system based on cross-modal knowledge transfer from voice conversion (VC)
The Lip2Ind network can substitute the content encoder of VC to form a multi-speaker VTS system to convert silent video to acoustic units for reconstructing accurate spoken content.
arXiv Detail & Related papers (2022-02-18T08:58:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.