Related papers: Audio-Plane: Audio Factorization Plane Gaussian Splatting for Real-Time Talking Head Synthesis

Audio-Plane: Audio Factorization Plane Gaussian Splatting for Real-Time Talking Head Synthesis

URL: http://arxiv.org/abs/2503.22605v1
Date: Fri, 28 Mar 2025 16:50:27 GMT
Title: Audio-Plane: Audio Factorization Plane Gaussian Splatting for Real-Time Talking Head Synthesis
Authors: Shuai Shen, Wanhua Li, Yunpeng Zhang, Weipeng Hu, Yap-Peng Tan,
Abstract summary: We present a novel approach that leverages an Audio Factorization Plane (Audio-Plane) based Gaussian Splatting for real-time talking head generation.<n>Our method is capable of synthesizing highly realistic talking videos in real time while ensuring precise audio-lip synchronization.
Score: 22.042129396991253
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Talking head synthesis has become a key research area in computer graphics and multimedia, yet most existing methods often struggle to balance generation quality with computational efficiency. In this paper, we present a novel approach that leverages an Audio Factorization Plane (Audio-Plane) based Gaussian Splatting for high-quality and real-time talking head generation. For modeling a dynamic talking head, 4D volume representation is needed. However, directly storing a dense 4D grid is impractical due to the high cost and lack of scalability for longer durations. We overcome this challenge with the proposed Audio-Plane, where the 4D volume representation is decomposed into audio-independent space planes and audio-dependent planes. This provides a compact and interpretable feature representation for talking head, facilitating more precise audio-aware spatial encoding and enhanced audio-driven lip dynamic modeling. To further improve speech dynamics, we develop a dynamic splatting method that helps the network more effectively focus on modeling the dynamics of the mouth region. Extensive experiments demonstrate that by integrating these innovations with the powerful Gaussian Splatting, our method is capable of synthesizing highly realistic talking videos in real time while ensuring precise audio-lip synchronization. Synthesized results are available in https://sstzal.github.io/Audio-Plane/.

Related papers

M2DAO-Talker: Harmonizing Multi-granular Motion Decoupling and Alternating Optimization for Talking-head Generation [65.08520614570288]
We reformulate talking head generation into a unified framework comprising video preprocessing, motion representation, and rendering reconstruction.<n>M2DAO-Talker achieves state-of-the-art performance, with the 2.43 dB PSNR improvement in generation quality and 0.64 gain in user-evaluated video realness.
arXiv Detail & Related papers (2025-07-11T04:48:12Z)
4D Gaussian Splatting: Modeling Dynamic Scenes with Native 4D Primitives [116.2042238179433]
In this paper, we frame dynamic scenes as unconstrained 4D volume learning problems. We represent a target dynamic scene using a collection of 4D Gaussian primitives with explicit geometry and appearance features. This approach can capture relevant information in space and time by fitting the underlying photorealistic-temporal volume. Notably, our 4DGS model is the first solution that supports real-time rendering of high-resolution, novel views for complex dynamic scenes.
arXiv Detail & Related papers (2024-12-30T05:30:26Z)
PointTalk: Audio-Driven Dynamic Lip Point Cloud for 3D Gaussian-based Talking Head Synthesis [27.97031664678664]
Methods based on radiance fields have received increasing attention due to their ability to synthesize high-fidelity talking heads.<n>We propose a novel 3D Gaussian-based method called PointTalk, which constructs a static 3D Gaussian field of the head and deforms it in sync with the audio.<n>Our method achieves superior high-fidelity and audio-lip synchronization in talking head synthesis compared to previous methods.
arXiv Detail & Related papers (2024-12-11T16:15:14Z)
Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation [32.24603883810094]
Controlling stereo audio with spatial contexts remains challenging due to high data costs and unstable generative models.<n>We first construct a large-scale, simulation-based, and GPT-assisted dataset, BEWO-1M, with abundant soundscapes and descriptions.<n>By leveraging spatial guidance, our model achieves the objective of generating immersive and controllable spatial audio from text.
arXiv Detail & Related papers (2024-10-14T16:18:29Z)
AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis [62.33446681243413]
view acoustic synthesis aims to render audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene.<n>Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing audio.<n>We propose a novel Audio-Visual Gaussian Splatting (AV-GS) model to characterize the entire scene environment.<n>Experiments validate the superiority of our AV-GS over existing alternatives on the real-world RWAS and simulation-based SoundSpaces datasets.
arXiv Detail & Related papers (2024-06-13T08:34:12Z)
NeRFFaceSpeech: One-shot Audio-driven 3D Talking Head Synthesis via Generative Prior [5.819784482811377]
We propose a novel method, NeRFFaceSpeech, which enables to produce high-quality 3D-aware talking head. Our method can craft a 3D-consistent facial feature space corresponding to a single image. We also introduce LipaintNet that can replenish the lacking information in the inner-mouth area.
arXiv Detail & Related papers (2024-05-09T13:14:06Z)
GSTalker: Real-time Audio-Driven Talking Face Generation via Deformable Gaussian Splatting [57.59261043916292]
GStalker is a 3D audio-driven talking face generation model with Gaussian Splatting. It can generate high-fidelity and audio-lips synchronized results with fast training and real-time rendering speed.
arXiv Detail & Related papers (2024-04-29T18:28:36Z)
Talk3D: High-Fidelity Talking Portrait Synthesis via Personalized 3D Generative Prior [29.120669908374424]
We introduce a novel audio-driven talking head synthesis framework, called Talk3D. It can faithfully reconstruct its plausible facial geometries by effectively adopting the pre-trained 3D-aware generative prior. Compared to existing methods, our method excels in generating realistic facial geometries even under extreme head poses.
arXiv Detail & Related papers (2024-03-29T12:49:40Z)
FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models [85.16273912625022]
We introduce FaceTalk, a novel generative approach designed for synthesizing high-fidelity 3D motion sequences of talking human heads from audio signal. To the best of our knowledge, this is the first work to propose a generative approach for realistic and high-quality motion synthesis of human heads.
arXiv Detail & Related papers (2023-12-13T19:01:07Z)
Efficient Region-Aware Neural Radiance Fields for High-Fidelity Talking Portrait Synthesis [20.111316792226482]
ER-NeRF is a conditional Neural Radiance Fields (NeRF) based architecture for talking portrait. Our method renders better high-fidelity and audio-lips, with realistic details and high efficiency compared to previous methods.
arXiv Detail & Related papers (2023-07-18T15:07:39Z)
Pose-Controllable 3D Facial Animation Synthesis using Hierarchical Audio-Vertex Attention [52.63080543011595]
A novel pose-controllable 3D facial animation synthesis method is proposed by utilizing hierarchical audio-vertex attention. The proposed method can produce more realistic facial expressions and head posture movements.
arXiv Detail & Related papers (2023-02-24T09:36:31Z)
AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene Synthesis [61.07542274267568]
We study a new task -- real-world audio-visual scene synthesis -- and a first-of-its-kind NeRF-based approach for multimodal learning. We propose an acoustic-aware audio generation module that integrates prior knowledge of audio propagation into NeRF. We present a coordinate transformation module that expresses a view direction relative to the sound source, enabling the model to learn sound source-centric acoustic fields.
arXiv Detail & Related papers (2023-02-04T04:17:19Z)
Real-time Neural Radiance Talking Portrait Synthesis via Audio-spatial Decomposition [61.6677901687009]
We propose an efficient NeRF-based framework that enables real-time synthesizing of talking portraits. Our method can generate realistic and audio-lips synchronized talking portrait videos.
arXiv Detail & Related papers (2022-11-22T16:03:11Z)
Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations. We autoregressively output multiple possibilities of corresponding listener motion. Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z)
AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis [55.24336227884039]
We present a novel framework to generate high-fidelity talking head video. We use neural scene representation networks to bridge the gap between audio input and video output. Our framework can (1) produce high-fidelity and natural results, and (2) support free adjustment of audio signals, viewing directions, and background images.
arXiv Detail & Related papers (2021-03-20T02:58:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.