LipSync3D: Data-Efficient Learning of Personalized 3D Talking Faces from
Video using Pose and Lighting Normalization
- URL: http://arxiv.org/abs/2106.04185v1
- Date: Tue, 8 Jun 2021 08:56:40 GMT
- Title: LipSync3D: Data-Efficient Learning of Personalized 3D Talking Faces from
Video using Pose and Lighting Normalization
- Authors: Avisek Lahiri, Vivek Kwatra, Christian Frueh, John Lewis, Chris
Bregler
- Abstract summary: We present a video-based learning framework for animating personalized 3D talking faces from audio.
We introduce two training-time data normalizations that significantly improve data sample efficiency.
Our method outperforms contemporary state-of-the-art audio-driven video reenactment benchmarks in terms of realism, lip-sync and visual quality scores.
- Score: 4.43316916502814
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this paper, we present a video-based learning framework for animating
personalized 3D talking faces from audio. We introduce two training-time data
normalizations that significantly improve data sample efficiency. First, we
isolate and represent faces in a normalized space that decouples 3D geometry,
head pose, and texture. This decomposes the prediction problem into regressions
over the 3D face shape and the corresponding 2D texture atlas. Second, we
leverage facial symmetry and approximate albedo constancy of skin to isolate
and remove spatio-temporal lighting variations. Together, these normalizations
allow simple networks to generate high fidelity lip-sync videos under novel
ambient illumination while training with just a single speaker-specific video.
Further, to stabilize temporal dynamics, we introduce an auto-regressive
approach that conditions the model on its previous visual state. Human ratings
and objective metrics demonstrate that our method outperforms contemporary
state-of-the-art audio-driven video reenactment benchmarks in terms of realism,
lip-sync and visual quality scores. We illustrate several applications enabled
by our framework.
Related papers
- Real-time 3D-aware Portrait Video Relighting [89.41078798641732]
We present the first real-time 3D-aware method for relighting in-the-wild videos of talking faces based on Neural Radiance Fields (NeRF)
We infer an albedo tri-plane, as well as a shading tri-plane based on a desired lighting condition for each video frame with fast dual-encoders.
Our method runs at 32.98 fps on consumer-level hardware and achieves state-of-the-art results in terms of reconstruction quality, lighting error, lighting instability, temporal consistency and inference speed.
arXiv Detail & Related papers (2024-10-24T01:34:11Z) - SAiD: Speech-driven Blendshape Facial Animation with Diffusion [6.4271091365094515]
Speech-driven 3D facial animation is challenging due to the scarcity of large-scale visual-audio datasets.
We propose a speech-driven 3D facial animation with a diffusion model (SAiD), a lightweight Transformer-based U-Net with a cross-modality alignment bias between audio and visual to enhance lip synchronization.
arXiv Detail & Related papers (2023-12-25T04:40:32Z) - GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained
3D Face Guidance [83.43852715997596]
GSmoothFace is a novel two-stage generalized talking face generation model guided by a fine-grained 3d face model.
It can synthesize smooth lip dynamics while preserving the speaker's identity.
Both quantitative and qualitative experiments confirm the superiority of our method in terms of realism, lip synchronization, and visual quality.
arXiv Detail & Related papers (2023-12-12T16:00:55Z) - ReliTalk: Relightable Talking Portrait Generation from a Single Video [62.47116237654984]
ReliTalk is a novel framework for relightable audio-driven talking portrait generation from monocular videos.
Our key insight is to decompose the portrait's reflectance from implicitly learned audio-driven facial normals and images.
arXiv Detail & Related papers (2023-09-05T17:59:42Z) - Audio-Driven 3D Facial Animation from In-the-Wild Videos [16.76533748243908]
Given an arbitrary audio clip, audio-driven 3D facial animation aims to generate lifelike lip motions and facial expressions for a 3D head.
Existing methods typically rely on training their models using limited public 3D datasets that contain a restricted number of audio-3D scan pairs.
We propose a novel method that leverages in-the-wild 2D talking-head videos to train our 3D facial animation model.
arXiv Detail & Related papers (2023-06-20T13:53:05Z) - GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking
Face Generation [71.73912454164834]
A modern talking face generation method is expected to achieve the goals of generalized audio-lip synchronization, good video quality, and high system efficiency.
NeRF has become a popular technique in this field since it could achieve high-fidelity and 3D-consistent talking face generation with a few-minute-long training video.
We propose GeneFace++ to handle these challenges by utilizing the rendering pitch contour as an auxiliary feature and introducing a temporal loss in the facial motion prediction process.
arXiv Detail & Related papers (2023-05-01T12:24:09Z) - LiP-Flow: Learning Inference-time Priors for Codec Avatars via
Normalizing Flows in Latent Space [90.74976459491303]
We introduce a prior model that is conditioned on the runtime inputs and tie this prior space to the 3D face model via a normalizing flow in the latent space.
A normalizing flow bridges the two representation spaces and transforms latent samples from one domain to another, allowing us to define a latent likelihood objective.
We show that our approach leads to an expressive and effective prior, capturing facial dynamics and subtle expressions better.
arXiv Detail & Related papers (2022-03-15T13:22:57Z) - FaceFormer: Speech-Driven 3D Facial Animation with Transformers [46.8780140220063]
Speech-driven 3D facial animation is challenging due to the complex geometry of human faces and the limited availability of 3D audio-visual data.
We propose a Transformer-based autoregressive model, FaceFormer, which encodes the long-term audio context and autoregressively predicts a sequence of animated 3D face meshes.
arXiv Detail & Related papers (2021-12-10T04:21:59Z) - Audio- and Gaze-driven Facial Animation of Codec Avatars [149.0094713268313]
We describe the first approach to animate Codec Avatars in real-time using audio and/or eye tracking.
Our goal is to display expressive conversations between individuals that exhibit important social signals.
arXiv Detail & Related papers (2020-08-11T22:28:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.