A Neural Lip-Sync Framework for Synthesizing Photorealistic Virtual News
Anchors
- URL: http://arxiv.org/abs/2002.08700v2
- Date: Wed, 5 May 2021 10:01:18 GMT
- Title: A Neural Lip-Sync Framework for Synthesizing Photorealistic Virtual News
Anchors
- Authors: Ruobing Zheng, Zhou Zhu, Bo Song, Changjiang Ji
- Abstract summary: Lip sync has emerged as a promising technique for generating mouth movements from audio signals.
This paper presents a novel lip-sync framework specially designed for producing high-fidelity virtual news anchors.
- Score: 8.13692293541489
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Lip sync has emerged as a promising technique for generating mouth movements
from audio signals. However, synthesizing a high-resolution and photorealistic
virtual news anchor is still challenging. Lack of natural appearance, visual
consistency, and processing efficiency are the main problems with existing
methods. This paper presents a novel lip-sync framework specially designed for
producing high-fidelity virtual news anchors. A pair of Temporal Convolutional
Networks are used to learn the cross-modal sequential mapping from audio
signals to mouth movements, followed by a neural rendering network that
translates the synthetic facial map into a high-resolution and photorealistic
appearance. This fully trainable framework provides end-to-end processing that
outperforms traditional graphics-based methods in many low-delay applications.
Experiments also show the framework has advantages over modern neural-based
methods in both visual appearance and efficiency.
Related papers
- LawDNet: Enhanced Audio-Driven Lip Synthesis via Local Affine Warping Deformation [0.4537124110113416]
LawDNet is a novel deep-learning architecture enhancing lip synthesis through a Local Affine Warping Deformation mechanism.
LawDNet incorporates a dual-stream discriminator for improved frame-to-frame continuity and employs face normalization techniques to handle pose and scene variations.
arXiv Detail & Related papers (2024-09-14T06:04:21Z) - High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model [89.29655924125461]
We propose a novel landmark-based diffusion model for talking face generation.
We first establish the less ambiguous mapping from audio to landmark motion of lip and jaw.
Then, we introduce an innovative conditioning module called TalkFormer to align the synthesized motion with the motion represented by landmarks.
arXiv Detail & Related papers (2024-08-10T02:58:28Z) - RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network [48.95833484103569]
RealTalk is an audio-to-expression transformer and a high-fidelity expression-to-face framework.
In the first component, we consider both identity and intra-personal variation features related to speaking lip movements.
In the second component, we design a lightweight facial identity alignment (FIA) module.
This novel design allows us to generate fine details in real-time, without depending on sophisticated and inefficient feature alignment modules.
arXiv Detail & Related papers (2024-06-26T12:09:59Z) - Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation [29.87407471246318]
This research delves into the complexities of synchronizing facial movements and creating visually appealing, temporally consistent animations.
Our innovative approach embraces the end-to-end diffusion paradigm and introduces a hierarchical audio-driven visual synthesis module.
The proposed hierarchical audio-driven visual synthesis offers adaptive control over expression and pose diversity, enabling more effective personalization tailored to different identities.
arXiv Detail & Related papers (2024-06-13T04:33:20Z) - OpFlowTalker: Realistic and Natural Talking Face Generation via Optical Flow Guidance [13.050998759819933]
"OpFlowTalker" is a novel approach that utilizes predicted optical flow changes from audio inputs rather than direct image predictions.
It smooths image transitions and aligns changes with semantic content.
We also developed an optical flow synchronization module that regulates both full-face and lip movements.
arXiv Detail & Related papers (2024-05-23T15:42:34Z) - Pose-Controllable 3D Facial Animation Synthesis using Hierarchical
Audio-Vertex Attention [52.63080543011595]
A novel pose-controllable 3D facial animation synthesis method is proposed by utilizing hierarchical audio-vertex attention.
The proposed method can produce more realistic facial expressions and head posture movements.
arXiv Detail & Related papers (2023-02-24T09:36:31Z) - DFA-NeRF: Personalized Talking Head Generation via Disentangled Face
Attributes Neural Rendering [69.9557427451339]
We propose a framework based on neural radiance field to pursue high-fidelity talking head generation.
Specifically, neural radiance field takes lip movements features and personalized attributes as two disentangled conditions.
We show that our method achieves significantly better results than state-of-the-art methods.
arXiv Detail & Related papers (2022-01-03T18:23:38Z) - Fast Training of Neural Lumigraph Representations using Meta Learning [109.92233234681319]
We develop a new neural rendering approach with the goal of quickly learning a high-quality representation which can also be rendered in real-time.
Our approach, MetaNLR++, accomplishes this by using a unique combination of a neural shape representation and 2D CNN-based image feature extraction, aggregation, and re-projection.
We show that MetaNLR++ achieves similar or better photorealistic novel view synthesis results in a fraction of the time that competing methods require.
arXiv Detail & Related papers (2021-06-28T18:55:50Z) - Neural Human Video Rendering by Learning Dynamic Textures and
Rendering-to-Video Translation [99.64565200170897]
We propose a novel human video synthesis method by explicitly disentangling the learning of time-coherent fine-scale details from the embedding of the human in 2D screen space.
We show several applications of our approach, such as human reenactment and novel view synthesis from monocular video, where we show significant improvement over the state of the art both qualitatively and quantitatively.
arXiv Detail & Related papers (2020-01-14T18:06:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.