Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a
Short Video
- URL: http://arxiv.org/abs/2309.04814v1
- Date: Sat, 9 Sep 2023 14:52:39 GMT
- Title: Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a
Short Video
- Authors: Xiuzhe Wu, Pengfei Hu, Yang Wu, Xiaoyang Lyu, Yan-Pei Cao, Ying Shan,
Wenming Yang, Zhongqian Sun, Xiaojuan Qi
- Abstract summary: We present a decomposition-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance.
We show that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization.
- Score: 91.92782707888618
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Synthesizing realistic videos according to a given speech is still an open
challenge. Previous works have been plagued by issues such as inaccurate lip
shape generation and poor image quality. The key reason is that only motions
and appearances on limited facial areas (e.g., lip area) are mainly driven by
the input speech. Therefore, directly learning a mapping function from speech
to the entire head image is prone to ambiguity, particularly when using a short
video for training. We thus propose a decomposition-synthesis-composition
framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive
and speech-insensitive motion/appearance to facilitate effective learning from
limited training data, resulting in the generation of natural-looking videos.
First, given a fixed head pose (i.e., canonical space), we present a
speech-driven implicit model for lip image generation which concentrates on
learning speech-sensitive motion and appearance. Next, to model the major
speech-insensitive motion (i.e., head movement), we introduce a geometry-aware
mutual explicit mapping (GAMEM) module that establishes geometric mappings
between different head poses. This allows us to paste generated lip images at
the canonical space onto head images with arbitrary poses and synthesize
talking videos with natural head movements. In addition, a Blend-Net and a
contrastive sync loss are introduced to enhance the overall synthesis
performance. Quantitative and qualitative results on three benchmarks
demonstrate that our model can be trained by a video of just a few minutes in
length and achieve state-of-the-art performance in both visual quality and
speech-visual synchronization. Code: https://github.com/CVMI-Lab/Speech2Lip.
Related papers
- JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation [24.2065254076207]
We introduce a novel method for joint expression and audio-guided talking face generation.
Our method can synthesize high-fidelity talking face videos, achieving state-of-the-art facial expression transfer.
arXiv Detail & Related papers (2024-09-18T17:18:13Z) - PoseTalk: Text-and-Audio-based Pose Control and Motion Refinement for One-Shot Talking Head Generation [17.158581488104186]
Previous audio-driven talking head generation (THG) methods generate head poses from driving audio.
We propose textbfPoseTalk, a THG system that can freely generate lip-synchronized talking head videos with free head poses conditioned on text prompts and audio.
arXiv Detail & Related papers (2024-09-04T12:30:25Z) - High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model [89.29655924125461]
We propose a novel landmark-based diffusion model for talking face generation.
We first establish the less ambiguous mapping from audio to landmark motion of lip and jaw.
Then, we introduce an innovative conditioning module called TalkFormer to align the synthesized motion with the motion represented by landmarks.
arXiv Detail & Related papers (2024-08-10T02:58:28Z) - Learning to Dub Movies via Hierarchical Prosody Models [167.6465354313349]
Given a piece of text, a video clip and a reference audio, the movie dubbing (also known as visual voice clone V2C) task aims to generate speeches that match the speaker's emotion presented in the video using the desired speaker voice as reference.
We propose a novel movie dubbing architecture to tackle these problems via hierarchical prosody modelling, which bridges the visual information to corresponding speech prosody from three aspects: lip, face, and scene.
arXiv Detail & Related papers (2022-12-08T03:29:04Z) - Audio-Driven Co-Speech Gesture Video Generation [92.15661971086746]
We define and study this challenging problem of audio-driven co-speech gesture video generation.
Our key insight is that the co-speech gestures can be decomposed into common motion patterns and subtle rhythmic dynamics.
We propose a novel framework, Audio-driveN Gesture vIdeo gEneration (ANGIE), to effectively capture the reusable co-speech gesture patterns.
arXiv Detail & Related papers (2022-12-05T15:28:22Z) - A Novel Speech-Driven Lip-Sync Model with CNN and LSTM [12.747541089354538]
We present a combined deep neural network of one-dimensional convolutions and LSTM to generate displacement of a 3D template face model from variable-length speech input.
In order to enhance the robustness of the network to different sound signals, we adapt a trained speech recognition model to extract speech feature.
We show that our model is able to generate smooth and natural lip movements synchronized with speech.
arXiv Detail & Related papers (2022-05-02T13:57:50Z) - FACIAL: Synthesizing Dynamic Talking Face with Implicit Attribute
Learning [23.14865405847467]
We propose a talking face generation method that takes an audio signal as input and a short target video clip as reference.
The method synthesizes a photo-realistic video of the target face with natural lip motions, head poses, and eye blinks that are in-sync with the input audio signal.
Experimental results and user studies show our method can generate realistic talking face videos with better qualities than the results of state-of-the-art methods.
arXiv Detail & Related papers (2021-08-18T02:10:26Z) - Pose-Controllable Talking Face Generation by Implicitly Modularized
Audio-Visual Representation [96.66010515343106]
We propose a clean yet effective framework to generate pose-controllable talking faces.
We operate on raw face images, using only a single photo as an identity reference.
Our model has multiple advanced capabilities including extreme view robustness and talking face frontalization.
arXiv Detail & Related papers (2021-04-22T15:10:26Z) - Audio-driven Talking Face Video Generation with Learning-based
Personalized Head Pose [67.31838207805573]
We propose a deep neural network model that takes an audio signal A of a source person and a short video V of a target person as input.
We outputs a synthesized high-quality talking face video with personalized head pose.
Our method can generate high-quality talking face videos with more distinguishing head movement effects than state-of-the-art methods.
arXiv Detail & Related papers (2020-02-24T10:02:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.