MANGO:Natural Multi-speaker 3D Talking Head Generation via 2D-Lifted Enhancement
- URL: http://arxiv.org/abs/2601.01749v1
- Date: Mon, 05 Jan 2026 02:59:49 GMT
- Title: MANGO:Natural Multi-speaker 3D Talking Head Generation via 2D-Lifted Enhancement
- Authors: Lei Zhu, Lijian Lin, Ye Zhu, Jiahao Wu, Xuehan Hou, Yu Li, Yunfei Liu, Jie Chen,
- Abstract summary: Existing 3D conversational avatar approaches rely on error-prone pseudo-3D labels that fail to capture fine-grained facial dynamics.<n>We introduce a novel two-stage framework MANGO, which leveraging pure image-level supervision by alternately training to mitigate the noise introduced by pseudo-3D labels.<n>Our method achieves exceptional accuracy and realism in modeling two-person 3D dialogue motion, significantly advancing the fidelity and controllability of audio-driven talking heads.
- Score: 26.32210658603041
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current audio-driven 3D head generation methods mainly focus on single-speaker scenarios, lacking natural, bidirectional listen-and-speak interaction. Achieving seamless conversational behavior, where speaking and listening states transition fluidly remains a key challenge. Existing 3D conversational avatar approaches rely on error-prone pseudo-3D labels that fail to capture fine-grained facial dynamics. To address these limitations, we introduce a novel two-stage framework MANGO, which leveraging pure image-level supervision by alternately training to mitigate the noise introduced by pseudo-3D labels, thereby achieving better alignment with real-world conversational behaviors. Specifically, in the first stage, a diffusion-based transformer with a dual-audio interaction module models natural 3D motion from multi-speaker audio. In the second stage, we use a fast 3D Gaussian Renderer to generate high-fidelity images and provide 2D-level photometric supervision for the 3D motions through alternate training. Additionally, we introduce MANGO-Dialog, a high-quality dataset with over 50 hours of aligned 2D-3D conversational data across 500+ identities. Extensive experiments demonstrate that our method achieves exceptional accuracy and realism in modeling two-person 3D dialogue motion, significantly advancing the fidelity and controllability of audio-driven talking heads.
Related papers
- From Blurry to Believable: Enhancing Low-quality Talking Heads with 3D Generative Priors [49.37666175170832]
We introduce SuperHead, a framework for enhancing low-resolution, animatable 3D head avatars.<n>SuperHead synthesizes high-quality geometry and textures, while ensuring both 3D and temporal consistency.<n>Experiments demonstrate that SuperHead generates avatars with fine-grained facial details under dynamic motions.
arXiv Detail & Related papers (2026-02-05T19:00:50Z) - Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics [40.86039227407712]
We present TIMAR (Turn-level Interleaved Masked AutoRegression), a causal framework for 3D conversational head generation.<n>It fuses multimodal information within each turn and applies turn-level causal attention to accumulate conversational history.<n>Experiments on the DualTalk benchmark show that TIMAR reduces Fréchet Distance and MSE by 15-30% on the test set.
arXiv Detail & Related papers (2025-12-17T11:37:35Z) - VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis [70.76837748695841]
We propose VisualSpeaker, a novel method that bridges the gap using photorealistic differentiable rendering, supervised by visual speech recognition, for improved 3D facial animation.<n>Our contribution is a perceptual lip-reading loss, derived by passing 3D Gaussian Splatting avatar renders through a pre-trained Visual Automatic Speech Recognition model during training.<n> Evaluation on the MEAD dataset demonstrates that VisualSpeaker improves both the standard Lip Vertex Error metric by 56.1% and the perceptual quality of the generated animations, while retaining the controllability of mesh-driven animation.
arXiv Detail & Related papers (2025-07-08T15:04:17Z) - InteractVLM: 3D Interaction Reasoning from 2D Foundational Models [85.76211596755151]
We introduce InteractVLM, a novel method to estimate 3D contact points on human bodies and objects from single in-the-wild images.<n>Existing methods rely on 3D contact annotations collected via expensive motion-capture systems or tedious manual labeling.<n>We propose a new task called Semantic Human Contact estimation, where human contact predictions are conditioned explicitly on object semantics.
arXiv Detail & Related papers (2025-04-07T17:59:33Z) - MMHead: Towards Fine-grained Multi-modal 3D Facial Animation [68.04052669266174]
We construct a large-scale multi-modal 3D facial animation dataset, MMHead.
MMHead consists of 49 hours of 3D facial motion sequences, speech audios, and rich hierarchical text annotations.
Based on the MMHead dataset, we establish benchmarks for two new tasks: text-induced 3D talking head animation and text-to-3D facial motion generation.
arXiv Detail & Related papers (2024-10-10T09:37:01Z) - Investigating the impact of 2D gesture representation on co-speech gesture generation [5.408549711581793]
We evaluate the impact of the dimensionality of the training data, 2D or 3D joint coordinates, on the performance of a multimodal speech-to-gesture deep generative model.
arXiv Detail & Related papers (2024-06-21T12:59:20Z) - NeRFFaceSpeech: One-shot Audio-driven 3D Talking Head Synthesis via Generative Prior [5.819784482811377]
We propose a novel method, NeRFFaceSpeech, which enables to produce high-quality 3D-aware talking head.
Our method can craft a 3D-consistent facial feature space corresponding to a single image.
We also introduce LipaintNet that can replenish the lacking information in the inner-mouth area.
arXiv Detail & Related papers (2024-05-09T13:14:06Z) - Learn2Talk: 3D Talking Face Learns from 2D Talking Face [15.99315075587735]
We propose a learning framework named Learn2Talk, which can construct a better 3D talking face network.
Inspired by the audio-video sync network, a 3D sync-lip expert model is devised for the pursuit of lip-sync.
A teacher model selected from 2D talking face methods is used to guide the training of the audio-to-3D motions regression network.
arXiv Detail & Related papers (2024-04-19T13:45:14Z) - EmoVOCA: Speech-Driven Emotional 3D Talking Heads [12.161006152509653]
We propose an innovative data-driven technique for creating a synthetic dataset, called EmoVOCA.<n>We then designed and trained an emotional 3D talking head generator that accepts a 3D face, an audio file, an emotion label, and an intensity value as inputs, and learns to animate the audio-synchronized lip movements with expressive traits of the face.
arXiv Detail & Related papers (2024-03-19T16:33:26Z) - FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models [85.16273912625022]
We introduce FaceTalk, a novel generative approach designed for synthesizing high-fidelity 3D motion sequences of talking human heads from audio signal.
To the best of our knowledge, this is the first work to propose a generative approach for realistic and high-quality motion synthesis of human heads.
arXiv Detail & Related papers (2023-12-13T19:01:07Z) - Real-time Neural Radiance Talking Portrait Synthesis via Audio-spatial
Decomposition [61.6677901687009]
We propose an efficient NeRF-based framework that enables real-time synthesizing of talking portraits.
Our method can generate realistic and audio-lips synchronized talking portrait videos.
arXiv Detail & Related papers (2022-11-22T16:03:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.