Real-time Neural Radiance Talking Portrait Synthesis via Audio-spatial
Decomposition
- URL: http://arxiv.org/abs/2211.12368v1
- Date: Tue, 22 Nov 2022 16:03:11 GMT
- Title: Real-time Neural Radiance Talking Portrait Synthesis via Audio-spatial
Decomposition
- Authors: Jiaxiang Tang, Kaisiyuan Wang, Hang Zhou, Xiaokang Chen, Dongliang He,
Tianshu Hu, Jingtuo Liu, Gang Zeng, Jingdong Wang
- Abstract summary: We propose an efficient NeRF-based framework that enables real-time synthesizing of talking portraits.
Our method can generate realistic and audio-lips synchronized talking portrait videos.
- Score: 61.6677901687009
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While dynamic Neural Radiance Fields (NeRF) have shown success in
high-fidelity 3D modeling of talking portraits, the slow training and inference
speed severely obstruct their potential usage. In this paper, we propose an
efficient NeRF-based framework that enables real-time synthesizing of talking
portraits and faster convergence by leveraging the recent success of grid-based
NeRF. Our key insight is to decompose the inherently high-dimensional talking
portrait representation into three low-dimensional feature grids. Specifically,
a Decomposed Audio-spatial Encoding Module models the dynamic head with a 3D
spatial grid and a 2D audio grid. The torso is handled with another 2D grid in
a lightweight Pseudo-3D Deformable Module. Both modules focus on efficiency
under the premise of good rendering quality. Extensive experiments demonstrate
that our method can generate realistic and audio-lips synchronized talking
portrait videos, while also being highly efficient compared to previous
methods.
Related papers
- NLDF: Neural Light Dynamic Fields for Efficient 3D Talking Head Generation [0.0]
A novel Neural Light Dynamic Fields model is proposed aiming to achieve generating high quality 3D talking face with significant speedup.
The NLDF represents light fields based on light segments, and a deep network is used to learn the entire light beam's information at once.
The propose method effectively represents the facial light dynamics in 3D talking video generation, and it achieves approximately 30 times faster speed compared to state of the art NeRF based method.
arXiv Detail & Related papers (2024-06-17T06:53:37Z) - NeRFFaceSpeech: One-shot Audio-driven 3D Talking Head Synthesis via Generative Prior [5.819784482811377]
We propose a novel method, NeRFFaceSpeech, which enables to produce high-quality 3D-aware talking head.
Our method can craft a 3D-consistent facial feature space corresponding to a single image.
We also introduce LipaintNet that can replenish the lacking information in the inner-mouth area.
arXiv Detail & Related papers (2024-05-09T13:14:06Z) - Talk3D: High-Fidelity Talking Portrait Synthesis via Personalized 3D Generative Prior [29.120669908374424]
We introduce a novel audio-driven talking head synthesis framework, called Talk3D.
It can faithfully reconstruct its plausible facial geometries by effectively adopting the pre-trained 3D-aware generative prior.
Compared to existing methods, our method excels in generating realistic facial geometries even under extreme head poses.
arXiv Detail & Related papers (2024-03-29T12:49:40Z) - Efficient Region-Aware Neural Radiance Fields for High-Fidelity Talking
Portrait Synthesis [20.111316792226482]
ER-NeRF is a conditional Neural Radiance Fields (NeRF) based architecture for talking portrait.
Our method renders better high-fidelity and audio-lips, with realistic details and high efficiency compared to previous methods.
arXiv Detail & Related papers (2023-07-18T15:07:39Z) - OD-NeRF: Efficient Training of On-the-Fly Dynamic Neural Radiance Fields [63.04781030984006]
Dynamic neural radiance fields (dynamic NeRFs) have demonstrated impressive results in novel view synthesis on 3D dynamic scenes.
We propose OD-NeRF to efficiently train and render dynamic NeRFs on-the-fly which instead is capable of streaming the dynamic scene.
Our algorithm can achieve an interactive speed of 6FPS training and rendering on synthetic dynamic scenes on-the-fly, and a significant speed-up compared to the state-of-the-art on real-world dynamic scenes.
arXiv Detail & Related papers (2023-05-24T07:36:47Z) - Magic3D: High-Resolution Text-to-3D Content Creation [78.40092800817311]
DreamFusion has recently demonstrated the utility of a pre-trained text-to-image diffusion model to optimize Neural Radiance Fields (NeRF)
In this paper, we address these limitations by utilizing a two-stage optimization framework.
Our method, dubbed Magic3D, can create high quality 3D mesh models in 40 minutes, which is 2x faster than DreamFusion.
arXiv Detail & Related papers (2022-11-18T18:59:59Z) - NeRFPlayer: A Streamable Dynamic Scene Representation with Decomposed
Neural Radiance Fields [99.57774680640581]
We present an efficient framework capable of fast reconstruction, compact modeling, and streamable rendering.
We propose to decompose the 4D space according to temporal characteristics. Points in the 4D space are associated with probabilities belonging to three categories: static, deforming, and new areas.
arXiv Detail & Related papers (2022-10-28T07:11:05Z) - Neural Deformable Voxel Grid for Fast Optimization of Dynamic View
Synthesis [63.25919018001152]
We propose a fast deformable radiance field method to handle dynamic scenes.
Our method achieves comparable performance to D-NeRF using only 20 minutes for training.
arXiv Detail & Related papers (2022-06-15T17:49:08Z) - Semantic-Aware Implicit Neural Audio-Driven Video Portrait Generation [61.8546794105462]
We propose Semantic-aware Speaking Portrait NeRF (SSP-NeRF), which creates delicate audio-driven portraits using one unified set of NeRF.
We first propose a Semantic-Aware Dynamic Ray Sampling module with an additional parsing branch that facilitates audio-driven volume rendering.
To enable portrait rendering in one unified neural radiance field, a Torso Deformation module is designed to stabilize the large-scale non-rigid torso motions.
arXiv Detail & Related papers (2022-01-19T18:54:41Z) - 3D-aware Image Synthesis via Learning Structural and Textural
Representations [39.681030539374994]
We propose VolumeGAN, for high-fidelity 3D-aware image synthesis, through explicitly learning a structural representation and a textural representation.
Our approach achieves sufficiently higher image quality and better 3D control than the previous methods.
arXiv Detail & Related papers (2021-12-20T18:59:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.