ReliTalk: Relightable Talking Portrait Generation from a Single Video
- URL: http://arxiv.org/abs/2309.02434v1
- Date: Tue, 5 Sep 2023 17:59:42 GMT
- Title: ReliTalk: Relightable Talking Portrait Generation from a Single Video
- Authors: Haonan Qiu, Zhaoxi Chen, Yuming Jiang, Hang Zhou, Xiangyu Fan, Lei
Yang, Wayne Wu and Ziwei Liu
- Abstract summary: ReliTalk is a novel framework for relightable audio-driven talking portrait generation from monocular videos.
Our key insight is to decompose the portrait's reflectance from implicitly learned audio-driven facial normals and images.
- Score: 62.47116237654984
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent years have witnessed great progress in creating vivid audio-driven
portraits from monocular videos. However, how to seamlessly adapt the created
video avatars to other scenarios with different backgrounds and lighting
conditions remains unsolved. On the other hand, existing relighting studies
mostly rely on dynamically lighted or multi-view data, which are too expensive
for creating video portraits. To bridge this gap, we propose ReliTalk, a novel
framework for relightable audio-driven talking portrait generation from
monocular videos. Our key insight is to decompose the portrait's reflectance
from implicitly learned audio-driven facial normals and images. Specifically,
we involve 3D facial priors derived from audio features to predict delicate
normal maps through implicit functions. These initially predicted normals then
take a crucial part in reflectance decomposition by dynamically estimating the
lighting condition of the given video. Moreover, the stereoscopic face
representation is refined using the identity-consistent loss under simulated
multiple lighting conditions, addressing the ill-posed problem caused by
limited views available from a single monocular video. Extensive experiments
validate the superiority of our proposed framework on both real and synthetic
datasets. Our code is released in https://github.com/arthur-qiu/ReliTalk.
Related papers
- Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a
Short Video [91.92782707888618]
We present a decomposition-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance.
We show that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization.
arXiv Detail & Related papers (2023-09-09T14:52:39Z) - Identity-Preserving Talking Face Generation with Landmark and Appearance
Priors [106.79923577700345]
Existing person-generic methods have difficulty in generating realistic and lip-synced videos.
We propose a two-stage framework consisting of audio-to-landmark generation and landmark-to-video rendering procedures.
Our method can produce more realistic, lip-synced, and identity-preserving videos than existing person-generic talking face generation methods.
arXiv Detail & Related papers (2023-05-15T01:31:32Z) - LipSync3D: Data-Efficient Learning of Personalized 3D Talking Faces from
Video using Pose and Lighting Normalization [4.43316916502814]
We present a video-based learning framework for animating personalized 3D talking faces from audio.
We introduce two training-time data normalizations that significantly improve data sample efficiency.
Our method outperforms contemporary state-of-the-art audio-driven video reenactment benchmarks in terms of realism, lip-sync and visual quality scores.
arXiv Detail & Related papers (2021-06-08T08:56:40Z) - Pose-Controllable Talking Face Generation by Implicitly Modularized
Audio-Visual Representation [96.66010515343106]
We propose a clean yet effective framework to generate pose-controllable talking faces.
We operate on raw face images, using only a single photo as an identity reference.
Our model has multiple advanced capabilities including extreme view robustness and talking face frontalization.
arXiv Detail & Related papers (2021-04-22T15:10:26Z) - Learning to Predict Salient Faces: A Novel Visual-Audio Saliency Model [96.24038430433885]
We propose a novel multi-modal video saliency model consisting of three branches: visual, audio and face.
Experimental results show that the proposed method outperforms 11 state-of-the-art saliency prediction works.
arXiv Detail & Related papers (2021-03-29T09:09:39Z) - Relightable 3D Head Portraits from a Smartphone Video [15.639140551193073]
We present a system for creating a relightable 3D portrait of a human head.
Our neural pipeline operates on a sequence of frames captured by a smartphone camera with the flash blinking.
A deep rendering network is trained to regress dense albedo, normals, and environmental lighting maps for arbitrary new viewpoints.
arXiv Detail & Related papers (2020-12-17T22:49:02Z) - Everybody's Talkin': Let Me Talk as You Want [134.65914135774605]
We present a method to edit a target portrait footage by taking a sequence of audio as input to synthesize a photo-realistic video.
It does not assume a person-specific rendering network yet capable of translating arbitrary source audio into arbitrary video output.
arXiv Detail & Related papers (2020-01-15T09:54:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.