Identity-Preserving Talking Face Generation with Landmark and Appearance
Priors
- URL: http://arxiv.org/abs/2305.08293v1
- Date: Mon, 15 May 2023 01:31:32 GMT
- Title: Identity-Preserving Talking Face Generation with Landmark and Appearance
Priors
- Authors: Weizhi Zhong, Chaowei Fang, Yinqi Cai, Pengxu Wei, Gangming Zhao,
Liang Lin, Guanbin Li
- Abstract summary: Existing person-generic methods have difficulty in generating realistic and lip-synced videos.
We propose a two-stage framework consisting of audio-to-landmark generation and landmark-to-video rendering procedures.
Our method can produce more realistic, lip-synced, and identity-preserving videos than existing person-generic talking face generation methods.
- Score: 106.79923577700345
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating talking face videos from audio attracts lots of research interest.
A few person-specific methods can generate vivid videos but require the target
speaker's videos for training or fine-tuning. Existing person-generic methods
have difficulty in generating realistic and lip-synced videos while preserving
identity information. To tackle this problem, we propose a two-stage framework
consisting of audio-to-landmark generation and landmark-to-video rendering
procedures. First, we devise a novel Transformer-based landmark generator to
infer lip and jaw landmarks from the audio. Prior landmark characteristics of
the speaker's face are employed to make the generated landmarks coincide with
the facial outline of the speaker. Then, a video rendering model is built to
translate the generated landmarks into face images. During this stage, prior
appearance information is extracted from the lower-half occluded target face
and static reference images, which helps generate realistic and
identity-preserving visual content. For effectively exploring the prior
information of static reference images, we align static reference images with
the target face's pose and expression based on motion fields. Moreover,
auditory features are reused to guarantee that the generated face images are
well synchronized with the audio. Extensive experiments demonstrate that our
method can produce more realistic, lip-synced, and identity-preserving videos
than existing person-generic talking face generation methods.
Related papers
- JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation [24.2065254076207]
We introduce a novel method for joint expression and audio-guided talking face generation.
Our method can synthesize high-fidelity talking face videos, achieving state-of-the-art facial expression transfer.
arXiv Detail & Related papers (2024-09-18T17:18:13Z) - High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model [89.29655924125461]
We propose a novel landmark-based diffusion model for talking face generation.
We first establish the less ambiguous mapping from audio to landmark motion of lip and jaw.
Then, we introduce an innovative conditioning module called TalkFormer to align the synthesized motion with the motion represented by landmarks.
arXiv Detail & Related papers (2024-08-10T02:58:28Z) - Imitator: Personalized Speech-driven 3D Facial Animation [63.57811510502906]
State-of-the-art methods deform the face topology of the target actor to sync the input audio without considering the identity-specific speaking style and facial idiosyncrasies of the target actor.
We present Imitator, a speech-driven facial expression synthesis method, which learns identity-specific details from a short input video.
We show that our approach produces temporally coherent facial expressions from input audio while preserving the speaking style of the target actors.
arXiv Detail & Related papers (2022-12-30T19:00:02Z) - One-shot Talking Face Generation from Single-speaker Audio-Visual
Correlation Learning [20.51814865676907]
It would be much easier to learn a consistent speech style from a specific speaker, which leads to authentic mouth movements.
We propose a novel one-shot talking face generation framework by exploring consistent correlations between audio and visual motions from a specific speaker.
Thanks to our learned consistent speaking style, our method generates authentic mouth shapes and vivid movements.
arXiv Detail & Related papers (2021-12-06T02:53:51Z) - Speech2Video: Cross-Modal Distillation for Speech to Video Generation [21.757776580641902]
Speech-to-video generation technique can spark interesting applications in entertainment, customer service, and human-computer-interaction industries.
The challenge mainly lies in disentangling the distinct visual attributes from audio signals.
We propose a light-weight, cross-modal distillation method to extract disentangled emotional and identity information from unlabelled video inputs.
arXiv Detail & Related papers (2021-07-10T10:27:26Z) - Pose-Controllable Talking Face Generation by Implicitly Modularized
Audio-Visual Representation [96.66010515343106]
We propose a clean yet effective framework to generate pose-controllable talking faces.
We operate on raw face images, using only a single photo as an identity reference.
Our model has multiple advanced capabilities including extreme view robustness and talking face frontalization.
arXiv Detail & Related papers (2021-04-22T15:10:26Z) - Identity-Preserving Realistic Talking Face Generation [4.848016645393023]
We propose a method for identity-preserving realistic facial animation from speech.
We impose eye blinks on facial landmarks using unsupervised learning.
We also use LSGAN to generate the facial texture from person-specific facial landmarks.
arXiv Detail & Related papers (2020-05-25T18:08:28Z) - MakeItTalk: Speaker-Aware Talking-Head Animation [49.77977246535329]
We present a method that generates expressive talking heads from a single facial image with audio as the only input.
Based on this intermediate representation, our method is able to synthesize photorealistic videos of entire talking heads with full range of motion.
arXiv Detail & Related papers (2020-04-27T17:56:15Z) - Everybody's Talkin': Let Me Talk as You Want [134.65914135774605]
We present a method to edit a target portrait footage by taking a sequence of audio as input to synthesize a photo-realistic video.
It does not assume a person-specific rendering network yet capable of translating arbitrary source audio into arbitrary video output.
arXiv Detail & Related papers (2020-01-15T09:54:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.