Related papers: D^3-Talker: Dual-Branch Decoupled Deformation Fields for Few-Shot 3D Talking Head Synthesis

D^3-Talker: Dual-Branch Decoupled Deformation Fields for Few-Shot 3D Talking Head Synthesis

URL: http://arxiv.org/abs/2508.14449v1
Date: Wed, 20 Aug 2025 06:12:33 GMT
Title: D^3-Talker: Dual-Branch Decoupled Deformation Fields for Few-Shot 3D Talking Head Synthesis
Authors: Yuhang Guo, Kaijun Deng, Siyang Song, Jindong Xie, Wenhui Ma, Linlin Shen,
Abstract summary: A key challenge in 3D talking head synthesis lies in the reliance on a long-duration talking head video to train a new model from scratch.<n>Recent methods have attempted to address this issue by extracting general features from audio through pre-training models.<n>This paper proposes D3-Talker, a novel approach that constructs a static 3D Gaussian attribute field and employs audio and Facial Motion signals.
Score: 28.923949756720425
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A key challenge in 3D talking head synthesis lies in the reliance on a long-duration talking head video to train a new model for each target identity from scratch. Recent methods have attempted to address this issue by extracting general features from audio through pre-training models. However, since audio contains information irrelevant to lip motion, existing approaches typically struggle to map the given audio to realistic lip behaviors in the target face when trained on only a few frames, causing poor lip synchronization and talking head image quality. This paper proposes D^3-Talker, a novel approach that constructs a static 3D Gaussian attribute field and employs audio and Facial Motion signals to independently control two distinct Gaussian attribute deformation fields, effectively decoupling the predictions of general and personalized deformations. We design a novel similarity contrastive loss function during pre-training to achieve more thorough decoupling. Furthermore, we integrate a Coarse-to-Fine module to refine the rendered images, alleviating blurriness caused by head movements and enhancing overall image quality. Extensive experiments demonstrate that D^3-Talker outperforms state-of-the-art methods in both high-fidelity rendering and accurate audio-lip synchronization with limited training data. Our code will be provided upon acceptance.

Related papers

StableDub: Taming Diffusion Prior for Generalized and Efficient Visual Dubbing [63.72095377128904]
The visual dubbing task aims to generate mouth movements synchronized with the driving audio.<n>Audio-only driving paradigms inadequately capture speaker-specific lip habits.<n>Blind-inpainting approaches produce visual artifacts when handling obstructions.
arXiv Detail & Related papers (2025-09-26T05:23:31Z)
Audio-Driven Universal Gaussian Head Avatars [66.56656075831954]
We introduce the first method for audio-driven universal photorealistic avatar synthesis.<n>It combines a person-agnostic speech model with our novel Universal Head Avatar Prior.<n>Our method is not only the first general audio-driven avatar model that can account for detailed appearance modeling and rendering.
arXiv Detail & Related papers (2025-09-23T12:46:43Z)
GGTalker: Talking Head Systhesis with Generalizable Gaussian Priors and Identity-Specific Adaptation [20.17978153568009]
GGTalker synthesizes talking heads through a combination of generalizable priors and identity-specific adaptation.<n> GGTalker achieves state-of-the-art performance in rendering quality, 3D consistency, lip-sync accuracy, and training efficiency.
arXiv Detail & Related papers (2025-06-26T17:37:18Z)
OT-Talk: Animating 3D Talking Head with Optimal Transportation [20.023346831300373]
OT-Talk is the first approach to leverage optimal transportation to optimize the learning model in talking head animation.<n>Building on existing learning frameworks, we utilize a pre-trained Hubert model to extract audio features and a transformer model to process temporal sequences.<n>Our experiments on two public audio-mesh datasets demonstrate that our method outperforms state-of-the-art techniques.
arXiv Detail & Related papers (2025-05-03T21:49:23Z)
Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait [30.481914742000697]
We propose KDTalker, the first framework to combine unsupervised implicit 3D keypoint with atemporal diffusion model.<n> KDTalker adapts facial information densities, allowing the diffusion process to model diverse head poses and capture fine facial details flexibly.
arXiv Detail & Related papers (2025-03-17T09:18:31Z)
S^3D-NeRF: Single-Shot Speech-Driven Neural Radiance Field for High Fidelity Talking Head Synthesis [14.437741528053504]
We design a Single-Shot Speech-Driven Radiance Field (S3D-NeRF) method to tackle the three difficulties: learning a representative appearance feature for each identity, modeling motion of different face regions with audio, and keeping the temporal consistency of the lip area. Our S3D-NeRF surpasses previous arts on both video fidelity and audio-lip synchronization.
arXiv Detail & Related papers (2024-08-18T03:59:57Z)
NeRFFaceSpeech: One-shot Audio-driven 3D Talking Head Synthesis via Generative Prior [5.819784482811377]
We propose a novel method, NeRFFaceSpeech, which enables to produce high-quality 3D-aware talking head. Our method can craft a 3D-consistent facial feature space corresponding to a single image. We also introduce LipaintNet that can replenish the lacking information in the inner-mouth area.
arXiv Detail & Related papers (2024-05-09T13:14:06Z)
GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained 3D Face Guidance [83.43852715997596]
GSmoothFace is a novel two-stage generalized talking face generation model guided by a fine-grained 3d face model. It can synthesize smooth lip dynamics while preserving the speaker's identity. Both quantitative and qualitative experiments confirm the superiority of our method in terms of realism, lip synchronization, and visual quality.
arXiv Detail & Related papers (2023-12-12T16:00:55Z)
Audio-driven Talking Face Generation with Stabilized Synchronization Loss [60.01529422759644]
Talking face generation aims to create realistic videos with accurate lip synchronization and high visual quality. We first tackle the lip leaking problem by introducing a silent-lip generator, which changes the lips of the identity reference to alleviate leakage. Experiments show that our model outperforms state-of-the-art methods in both visual quality and lip synchronization.
arXiv Detail & Related papers (2023-07-18T15:50:04Z)
Pose-Controllable 3D Facial Animation Synthesis using Hierarchical Audio-Vertex Attention [52.63080543011595]
A novel pose-controllable 3D facial animation synthesis method is proposed by utilizing hierarchical audio-vertex attention. The proposed method can produce more realistic facial expressions and head posture movements.
arXiv Detail & Related papers (2023-02-24T09:36:31Z)
GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis [62.297513028116576]
GeneFace is a general and high-fidelity NeRF-based talking face generation method. A head-aware torso-NeRF is proposed to eliminate the head-torso problem.
arXiv Detail & Related papers (2023-01-31T05:56:06Z)
DFA-NeRF: Personalized Talking Head Generation via Disentangled Face Attributes Neural Rendering [69.9557427451339]
We propose a framework based on neural radiance field to pursue high-fidelity talking head generation. Specifically, neural radiance field takes lip movements features and personalized attributes as two disentangled conditions. We show that our method achieves significantly better results than state-of-the-art methods.
arXiv Detail & Related papers (2022-01-03T18:23:38Z)
Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation [96.66010515343106]
We propose a clean yet effective framework to generate pose-controllable talking faces. We operate on raw face images, using only a single photo as an identity reference. Our model has multiple advanced capabilities including extreme view robustness and talking face frontalization.
arXiv Detail & Related papers (2021-04-22T15:10:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.