StableFace: Analyzing and Improving Motion Stability for Talking Face
Generation
- URL: http://arxiv.org/abs/2208.13717v1
- Date: Mon, 29 Aug 2022 16:56:35 GMT
- Title: StableFace: Analyzing and Improving Motion Stability for Talking Face
Generation
- Authors: Jun Ling, Xu Tan, Liyang Chen, Runnan Li, Yuchao Zhang, Sheng Zhao, Li
Song
- Abstract summary: We study the motion jittering problem based on a state-of-the-art pipeline that uses 3D face representations to bridge the input audio and output video.
We find that several issues can lead to jitters in synthesized talking face video.
- Score: 38.25025849434312
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While previous speech-driven talking face generation methods have made
significant progress in improving the visual quality and lip-sync quality of
the synthesized videos, they pay less attention to lip motion jitters which
greatly undermine the realness of talking face videos. What causes motion
jitters, and how to mitigate the problem? In this paper, we conduct systematic
analyses on the motion jittering problem based on a state-of-the-art pipeline
that uses 3D face representations to bridge the input audio and output video,
and improve the motion stability with a series of effective designs. We find
that several issues can lead to jitters in synthesized talking face video: 1)
jitters from the input 3D face representations; 2) training-inference mismatch;
3) lack of dependency modeling among video frames. Accordingly, we propose
three effective solutions to address this issue: 1) we propose a gaussian-based
adaptive smoothing module to smooth the 3D face representations to eliminate
jitters in the input; 2) we add augmented erosions on the input data of the
neural renderer in training to simulate the distortion in inference to reduce
mismatch; 3) we develop an audio-fused transformer generator to model
dependency among video frames. Besides, considering there is no off-the-shelf
metric for measuring motion jitters in talking face video, we devise an
objective metric (Motion Stability Index, MSI), to quantitatively measure the
motion jitters by calculating the reciprocal of variance acceleration.
Extensive experimental results show the superiority of our method on
motion-stable face video generation, with better quality than previous systems.
Related papers
- SAiD: Speech-driven Blendshape Facial Animation with Diffusion [6.4271091365094515]
Speech-driven 3D facial animation is challenging due to the scarcity of large-scale visual-audio datasets.
We propose a speech-driven 3D facial animation with a diffusion model (SAiD), a lightweight Transformer-based U-Net with a cross-modality alignment bias between audio and visual to enhance lip synchronization.
arXiv Detail & Related papers (2023-12-25T04:40:32Z) - GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained
3D Face Guidance [83.43852715997596]
GSmoothFace is a novel two-stage generalized talking face generation model guided by a fine-grained 3d face model.
It can synthesize smooth lip dynamics while preserving the speaker's identity.
Both quantitative and qualitative experiments confirm the superiority of our method in terms of realism, lip synchronization, and visual quality.
arXiv Detail & Related papers (2023-12-12T16:00:55Z) - Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a
Short Video [91.92782707888618]
We present a decomposition-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance.
We show that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization.
arXiv Detail & Related papers (2023-09-09T14:52:39Z) - GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking
Face Generation [71.73912454164834]
A modern talking face generation method is expected to achieve the goals of generalized audio-lip synchronization, good video quality, and high system efficiency.
NeRF has become a popular technique in this field since it could achieve high-fidelity and 3D-consistent talking face generation with a few-minute-long training video.
We propose GeneFace++ to handle these challenges by utilizing the rendering pitch contour as an auxiliary feature and introducing a temporal loss in the facial motion prediction process.
arXiv Detail & Related papers (2023-05-01T12:24:09Z) - SadTalker: Learning Realistic 3D Motion Coefficients for Stylized
Audio-Driven Single Image Talking Face Animation [33.651156455111916]
We present SadTalker, which generates 3D motion coefficients (head pose, expression) of the 3DMM from audio.
Precisely, we present ExpNet to learn the accurate facial expression from audio by distilling both coefficients and 3D-rendered faces.
arXiv Detail & Related papers (2022-11-22T11:35:07Z) - Render In-between: Motion Guided Video Synthesis for Action
Interpolation [53.43607872972194]
We propose a motion-guided frame-upsampling framework that is capable of producing realistic human motion and appearance.
A novel motion model is trained to inference the non-linear skeletal motion between frames by leveraging a large-scale motion-capture dataset.
Our pipeline only requires low-frame-rate videos and unpaired human motion data but does not require high-frame-rate videos for training.
arXiv Detail & Related papers (2021-11-01T15:32:51Z) - PIRenderer: Controllable Portrait Image Generation via Semantic Neural
Rendering [56.762094966235566]
A Portrait Image Neural Renderer is proposed to control the face motions with the parameters of three-dimensional morphable face models.
The proposed model can generate photo-realistic portrait images with accurate movements according to intuitive modifications.
Our model can generate coherent videos with convincing movements from only a single reference image and a driving audio stream.
arXiv Detail & Related papers (2021-09-17T07:24:16Z) - LipSync3D: Data-Efficient Learning of Personalized 3D Talking Faces from
Video using Pose and Lighting Normalization [4.43316916502814]
We present a video-based learning framework for animating personalized 3D talking faces from audio.
We introduce two training-time data normalizations that significantly improve data sample efficiency.
Our method outperforms contemporary state-of-the-art audio-driven video reenactment benchmarks in terms of realism, lip-sync and visual quality scores.
arXiv Detail & Related papers (2021-06-08T08:56:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.