Memories are One-to-Many Mapping Alleviators in Talking Face Generation
- URL: http://arxiv.org/abs/2212.05005v3
- Date: Tue, 5 Mar 2024 07:03:48 GMT
- Title: Memories are One-to-Many Mapping Alleviators in Talking Face Generation
- Authors: Anni Tang, Tianyu He, Xu Tan, Jun Ling, Li Song
- Abstract summary: Talking face generation aims at generating photo-realistic video portraits of a target person driven by input audio.
In this paper, we propose MemFace to complement the missing information with an implicit memory and an explicit memory.
Our experimental results show that our proposed MemFace surpasses all the state-of-the-art results across multiple scenarios consistently and significantly.
- Score: 31.55290250247604
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Talking face generation aims at generating photo-realistic video portraits of
a target person driven by input audio. Due to its nature of one-to-many mapping
from the input audio to the output video (e.g., one speech content may have
multiple feasible visual appearances), learning a deterministic mapping like
previous works brings ambiguity during training, and thus causes inferior
visual results. Although this one-to-many mapping could be alleviated in part
by a two-stage framework (i.e., an audio-to-expression model followed by a
neural-rendering model), it is still insufficient since the prediction is
produced without enough information (e.g., emotions, wrinkles, etc.). In this
paper, we propose MemFace to complement the missing information with an
implicit memory and an explicit memory that follow the sense of the two stages
respectively. More specifically, the implicit memory is employed in the
audio-to-expression model to capture high-level semantics in the
audio-expression shared space, while the explicit memory is employed in the
neural-rendering model to help synthesize pixel-level details. Our experimental
results show that our proposed MemFace surpasses all the state-of-the-art
results across multiple scenarios consistently and significantly.
Related papers
- High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model [89.29655924125461]
We propose a novel landmark-based diffusion model for talking face generation.
We first establish the less ambiguous mapping from audio to landmark motion of lip and jaw.
Then, we introduce an innovative conditioning module called TalkFormer to align the synthesized motion with the motion represented by landmarks.
arXiv Detail & Related papers (2024-08-10T02:58:28Z) - GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained
3D Face Guidance [83.43852715997596]
GSmoothFace is a novel two-stage generalized talking face generation model guided by a fine-grained 3d face model.
It can synthesize smooth lip dynamics while preserving the speaker's identity.
Both quantitative and qualitative experiments confirm the superiority of our method in terms of realism, lip synchronization, and visual quality.
arXiv Detail & Related papers (2023-12-12T16:00:55Z) - Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a
Short Video [91.92782707888618]
We present a decomposition-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance.
We show that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization.
arXiv Detail & Related papers (2023-09-09T14:52:39Z) - Identity-Preserving Talking Face Generation with Landmark and Appearance
Priors [106.79923577700345]
Existing person-generic methods have difficulty in generating realistic and lip-synced videos.
We propose a two-stage framework consisting of audio-to-landmark generation and landmark-to-video rendering procedures.
Our method can produce more realistic, lip-synced, and identity-preserving videos than existing person-generic talking face generation methods.
arXiv Detail & Related papers (2023-05-15T01:31:32Z) - Visual Speech-Aware Perceptual 3D Facial Expression Reconstruction from
Videos [32.48058491211032]
We present the first method for visual speech-aware perceptual reconstruction of 3D mouth expressions.
We do this by proposing a "lipread" loss, which guides the fitting process so that the elicited perception from the 3D reconstructed talking head resembles that of the original video footage.
arXiv Detail & Related papers (2022-07-22T14:07:46Z) - Speech2Video: Cross-Modal Distillation for Speech to Video Generation [21.757776580641902]
Speech-to-video generation technique can spark interesting applications in entertainment, customer service, and human-computer-interaction industries.
The challenge mainly lies in disentangling the distinct visual attributes from audio signals.
We propose a light-weight, cross-modal distillation method to extract disentangled emotional and identity information from unlabelled video inputs.
arXiv Detail & Related papers (2021-07-10T10:27:26Z) - Pose-Controllable Talking Face Generation by Implicitly Modularized
Audio-Visual Representation [96.66010515343106]
We propose a clean yet effective framework to generate pose-controllable talking faces.
We operate on raw face images, using only a single photo as an identity reference.
Our model has multiple advanced capabilities including extreme view robustness and talking face frontalization.
arXiv Detail & Related papers (2021-04-22T15:10:26Z) - Learning to Predict Salient Faces: A Novel Visual-Audio Saliency Model [96.24038430433885]
We propose a novel multi-modal video saliency model consisting of three branches: visual, audio and face.
Experimental results show that the proposed method outperforms 11 state-of-the-art saliency prediction works.
arXiv Detail & Related papers (2021-03-29T09:09:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.