VASA-3D: Lifelike Audio-Driven Gaussian Head Avatars from a Single Image
- URL: http://arxiv.org/abs/2512.14677v1
- Date: Tue, 16 Dec 2025 18:44:00 GMT
- Title: VASA-3D: Lifelike Audio-Driven Gaussian Head Avatars from a Single Image
- Authors: Sicheng Xu, Guojun Chen, Jiaolong Yang, Yizhong Zhang, Yu Deng, Steve Lin, Baining Guo,
- Abstract summary: VASA-3D is an audio-driven, single-shot 3D head avatar generator.<n>This research tackles two major challenges: capturing the subtle expression details present in real human faces, and reconstructing an intricate 3D head avatar from a single portrait image.
- Score: 27.76629170122787
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose VASA-3D, an audio-driven, single-shot 3D head avatar generator. This research tackles two major challenges: capturing the subtle expression details present in real human faces, and reconstructing an intricate 3D head avatar from a single portrait image. To accurately model expression details, VASA-3D leverages the motion latent of VASA-1, a method that yields exceptional realism and vividness in 2D talking heads. A critical element of our work is translating this motion latent to 3D, which is accomplished by devising a 3D head model that is conditioned on the motion latent. Customization of this model to a single image is achieved through an optimization framework that employs numerous video frames of the reference head synthesized from the input image. The optimization takes various training losses robust to artifacts and limited pose coverage in the generated training data. Our experiment shows that VASA-3D produces realistic 3D talking heads that cannot be achieved by prior art, and it supports the online generation of 512x512 free-viewpoint videos at up to 75 FPS, facilitating more immersive engagements with lifelike 3D avatars.
Related papers
- From Blurry to Believable: Enhancing Low-quality Talking Heads with 3D Generative Priors [49.37666175170832]
We introduce SuperHead, a framework for enhancing low-resolution, animatable 3D head avatars.<n>SuperHead synthesizes high-quality geometry and textures, while ensuring both 3D and temporal consistency.<n>Experiments demonstrate that SuperHead generates avatars with fine-grained facial details under dynamic motions.
arXiv Detail & Related papers (2026-02-05T19:00:50Z) - Generalizable and Animatable 3D Full-Head Gaussian Avatar from a Single Image [9.505520774467263]
Building 3D animatable head avatars from a single image is an important yet challenging problem.<n>Existing methods generally collapse under large camera pose variations, compromising the realism of 3D avatars.<n>We propose a new framework to tackle the novel setting of one-shot 3D full-head animatable avatar reconstruction in a single feed-forward pass.
arXiv Detail & Related papers (2026-01-19T06:56:58Z) - Avat3r: Large Animatable Gaussian Reconstruction Model for High-fidelity 3D Head Avatars [60.0866477932976]
We present Avat3r, which regresses a high-quality and animatable 3D head avatar from just a few input images.<n>We make Large Reconstruction Models animatable and learn a powerful prior over 3D human heads from a large multi-view video dataset.<n>We increase robustness by feeding input images with different expressions to our model during training, enabling the reconstruction of 3D head avatars from inconsistent inputs.
arXiv Detail & Related papers (2025-02-27T16:00:11Z) - Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis [88.17520303867099]
One-shot 3D talking portrait generation aims to reconstruct a 3D avatar from an unseen image, and then animate it with a reference video or audio.
We present Real3D-Potrait, a framework that improves the one-shot 3D reconstruction power with a large image-to-plane model.
Experiments show that Real3D-Portrait generalizes well to unseen identities and generates more realistic talking portrait videos.
arXiv Detail & Related papers (2024-01-16T17:04:30Z) - Articulated 3D Head Avatar Generation using Text-to-Image Diffusion
Models [107.84324544272481]
The ability to generate diverse 3D articulated head avatars is vital to a plethora of applications, including augmented reality, cinematography, and education.
Recent work on text-guided 3D object generation has shown great promise in addressing these needs.
We show that our diffusion-based articulated head avatars outperform state-of-the-art approaches for this task.
arXiv Detail & Related papers (2023-07-10T19:15:32Z) - Dynamic Neural Portraits [58.480811535222834]
We present Dynamic Neural Portraits, a novel approach to the problem of full-head reenactment.
Our method generates photo-realistic video portraits by explicitly controlling head pose, facial expressions and eye gaze.
Our experiments demonstrate that the proposed method is 270 times faster than recent NeRF-based reenactment methods.
arXiv Detail & Related papers (2022-11-25T10:06:14Z) - DRaCoN -- Differentiable Rasterization Conditioned Neural Radiance
Fields for Articulated Avatars [92.37436369781692]
We present DRaCoN, a framework for learning full-body volumetric avatars.
It exploits the advantages of both the 2D and 3D neural rendering techniques.
Experiments on the challenging ZJU-MoCap and Human3.6M datasets indicate that DRaCoN outperforms state-of-the-art methods.
arXiv Detail & Related papers (2022-03-29T17:59:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.