One-shot Talking Face Generation from Single-speaker Audio-Visual
Correlation Learning
- URL: http://arxiv.org/abs/2112.02749v1
- Date: Mon, 6 Dec 2021 02:53:51 GMT
- Title: One-shot Talking Face Generation from Single-speaker Audio-Visual
Correlation Learning
- Authors: Suzhen Wang, Lincheng Li, Yu Ding, Xin Yu
- Abstract summary: It would be much easier to learn a consistent speech style from a specific speaker, which leads to authentic mouth movements.
We propose a novel one-shot talking face generation framework by exploring consistent correlations between audio and visual motions from a specific speaker.
Thanks to our learned consistent speaking style, our method generates authentic mouth shapes and vivid movements.
- Score: 20.51814865676907
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Audio-driven one-shot talking face generation methods are usually trained on
video resources of various persons. However, their created videos often suffer
unnatural mouth shapes and asynchronous lips because those methods struggle to
learn a consistent speech style from different speakers. We observe that it
would be much easier to learn a consistent speech style from a specific
speaker, which leads to authentic mouth movements. Hence, we propose a novel
one-shot talking face generation framework by exploring consistent correlations
between audio and visual motions from a specific speaker and then transferring
audio-driven motion fields to a reference image. Specifically, we develop an
Audio-Visual Correlation Transformer (AVCT) that aims to infer talking motions
represented by keypoint based dense motion fields from an input audio. In
particular, considering audio may come from different identities in deployment,
we incorporate phonemes to represent audio signals. In this manner, our AVCT
can inherently generalize to audio spoken by other identities. Moreover, as
face keypoints are used to represent speakers, AVCT is agnostic against
appearances of the training speaker, and thus allows us to manipulate face
images of different identities readily. Considering different face shapes lead
to different motions, a motion field transfer module is exploited to reduce the
audio-driven dense motion field gap between the training identity and the
one-shot reference. Once we obtained the dense motion field of the reference
image, we employ an image renderer to generate its talking face videos from an
audio clip. Thanks to our learned consistent speaking style, our method
generates authentic mouth shapes and vivid movements. Extensive experiments
demonstrate that our synthesized videos outperform the state-of-the-art in
terms of visual quality and lip-sync.
Related papers
- JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation [24.2065254076207]
We introduce a novel method for joint expression and audio-guided talking face generation.
Our method can synthesize high-fidelity talking face videos, achieving state-of-the-art facial expression transfer.
arXiv Detail & Related papers (2024-09-18T17:18:13Z) - FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio [45.71036380866305]
We abstract the process of people hearing speech, extracting meaningful cues, and creating dynamically audio-consistent talking faces from a single audio.
Specifically, it involves two critical challenges: one is to effectively decouple identity, content, and emotion from entangled audio, and the other is to maintain intra-video diversity and inter-video consistency.
We introduce the Controllable Coherent Frame generation, which involves the flexible integration of three trainable adapters with frozen Latent Diffusion Models.
arXiv Detail & Related papers (2024-03-04T09:59:48Z) - Identity-Preserving Talking Face Generation with Landmark and Appearance
Priors [106.79923577700345]
Existing person-generic methods have difficulty in generating realistic and lip-synced videos.
We propose a two-stage framework consisting of audio-to-landmark generation and landmark-to-video rendering procedures.
Our method can produce more realistic, lip-synced, and identity-preserving videos than existing person-generic talking face generation methods.
arXiv Detail & Related papers (2023-05-15T01:31:32Z) - Imitator: Personalized Speech-driven 3D Facial Animation [63.57811510502906]
State-of-the-art methods deform the face topology of the target actor to sync the input audio without considering the identity-specific speaking style and facial idiosyncrasies of the target actor.
We present Imitator, a speech-driven facial expression synthesis method, which learns identity-specific details from a short input video.
We show that our approach produces temporally coherent facial expressions from input audio while preserving the speaking style of the target actors.
arXiv Detail & Related papers (2022-12-30T19:00:02Z) - LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders [53.30016986953206]
We propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture.
We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference.
arXiv Detail & Related papers (2022-11-20T15:27:55Z) - Pose-Controllable Talking Face Generation by Implicitly Modularized
Audio-Visual Representation [96.66010515343106]
We propose a clean yet effective framework to generate pose-controllable talking faces.
We operate on raw face images, using only a single photo as an identity reference.
Our model has multiple advanced capabilities including extreme view robustness and talking face frontalization.
arXiv Detail & Related papers (2021-04-22T15:10:26Z) - VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency [111.55430893354769]
Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers.
Our approach jointly learns audio-visual speech separation and cross-modal speaker embeddings from unlabeled video.
It yields state-of-the-art results on five benchmark datasets for audio-visual speech separation and enhancement.
arXiv Detail & Related papers (2021-01-08T18:25:24Z) - FaceFilter: Audio-visual speech separation using still images [41.97445146257419]
This paper aims to separate a target speaker's speech from a mixture of two speakers using a deep audio-visual speech separation network.
Unlike previous works that used lip movement on video clips or pre-enrolled speaker information as an auxiliary conditional feature, we use a single face image of the target speaker.
arXiv Detail & Related papers (2020-05-14T15:42:31Z) - MakeItTalk: Speaker-Aware Talking-Head Animation [49.77977246535329]
We present a method that generates expressive talking heads from a single facial image with audio as the only input.
Based on this intermediate representation, our method is able to synthesize photorealistic videos of entire talking heads with full range of motion.
arXiv Detail & Related papers (2020-04-27T17:56:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.