That's What I Said: Fully-Controllable Talking Face Generation
- URL: http://arxiv.org/abs/2304.03275v2
- Date: Mon, 18 Sep 2023 12:45:41 GMT
- Title: That's What I Said: Fully-Controllable Talking Face Generation
- Authors: Youngjoon Jang, Kyeongha Rho, Jong-Bin Woo, Hyeongkeun Lee, Jihwan
Park, Youshin Lim, Byeong-Yeol Kim, Joon Son Chung
- Abstract summary: We propose a canonical space where every face has the same motion patterns but different identities.
The second is to navigate a multimodal motion space that only represents motion-related features while eliminating identity information.
Our method can generate natural-looking talking faces with fully controllable facial attributes and accurate lip synchronisation.
- Score: 16.570649208028343
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The goal of this paper is to synthesise talking faces with controllable
facial motions. To achieve this goal, we propose two key ideas. The first is to
establish a canonical space where every face has the same motion patterns but
different identities. The second is to navigate a multimodal motion space that
only represents motion-related features while eliminating identity information.
To disentangle identity and motion, we introduce an orthogonality constraint
between the two different latent spaces. From this, our method can generate
natural-looking talking faces with fully controllable facial attributes and
accurate lip synchronisation. Extensive experiments demonstrate that our method
achieves state-of-the-art results in terms of both visual quality and lip-sync
score. To the best of our knowledge, we are the first to develop a talking face
generation framework that can accurately manifest full target facial motions
including lip, head pose, and eye movements in the generated video without any
additional supervision beyond RGB video with audio.
Related papers
- Make Your Actor Talk: Generalizable and High-Fidelity Lip Sync with Motion and Appearance Disentanglement [38.17828583069966]
We aim to edit the lip movements in talking video according to the given speech while preserving the personal identity and visual details.
To capture motion-agnostic visual details, we use separate encoders to encode the lip, non-lip appearance and motion, and then integrate them with a learned fusion module.
arXiv Detail & Related papers (2024-06-12T11:22:03Z) - Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation [12.044308738509402]
We propose a two-stage audio-driven talking face generation framework that employs 3D facial landmarks as intermediate variables.
This framework achieves collaborative alignment of expression, gaze, and pose with emotions through self-supervised learning.
Our model significantly advances the state-of-the-art performance in both visual quality and emotional alignment.
arXiv Detail & Related papers (2024-06-12T06:00:00Z) - SPEAK: Speech-Driven Pose and Emotion-Adjustable Talking Head Generation [13.459396544300137]
We propose a novel one-shot Talking Head Generation framework (SPEAK) that distinguishes itself from the general Talking Face Generation.
We introduce Inter-Reconstructed Feature Disentanglement (IRFD) module to decouple facial features into three latent spaces.
We then design a face editing module that modifies speech content and facial latent codes into a single latent space.
arXiv Detail & Related papers (2024-05-12T11:41:44Z) - AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding [24.486705010561067]
The paper introduces AniTalker, a framework designed to generate lifelike talking faces from a single portrait.
AniTalker effectively captures a wide range of facial dynamics, including subtle expressions and head movements.
arXiv Detail & Related papers (2024-05-06T02:32:41Z) - Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a
Short Video [91.92782707888618]
We present a decomposition-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance.
We show that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization.
arXiv Detail & Related papers (2023-09-09T14:52:39Z) - DF-3DFace: One-to-Many Speech Synchronized 3D Face Animation with
Diffusion [68.85904927374165]
We propose DF-3DFace, a diffusion-driven speech-to-3D face mesh synthesis.
It captures the complex one-to-many relationships between speech and 3D face based on diffusion.
It simultaneously achieves more realistic facial animation than the state-of-the-art methods.
arXiv Detail & Related papers (2023-08-23T04:14:55Z) - Identity-Preserving Talking Face Generation with Landmark and Appearance
Priors [106.79923577700345]
Existing person-generic methods have difficulty in generating realistic and lip-synced videos.
We propose a two-stage framework consisting of audio-to-landmark generation and landmark-to-video rendering procedures.
Our method can produce more realistic, lip-synced, and identity-preserving videos than existing person-generic talking face generation methods.
arXiv Detail & Related papers (2023-05-15T01:31:32Z) - Audio-Driven Talking Face Generation with Diverse yet Realistic Facial
Animations [61.65012981435094]
DIRFA is a novel method that can generate talking faces with diverse yet realistic facial animations from the same driving audio.
To accommodate fair variation of plausible facial animations for the same audio, we design a transformer-based probabilistic mapping network.
We show that DIRFA can generate talking faces with realistic facial animations effectively.
arXiv Detail & Related papers (2023-04-18T12:36:15Z) - Pose-Controllable Talking Face Generation by Implicitly Modularized
Audio-Visual Representation [96.66010515343106]
We propose a clean yet effective framework to generate pose-controllable talking faces.
We operate on raw face images, using only a single photo as an identity reference.
Our model has multiple advanced capabilities including extreme view robustness and talking face frontalization.
arXiv Detail & Related papers (2021-04-22T15:10:26Z) - Identity-Preserving Realistic Talking Face Generation [4.848016645393023]
We propose a method for identity-preserving realistic facial animation from speech.
We impose eye blinks on facial landmarks using unsupervised learning.
We also use LSGAN to generate the facial texture from person-specific facial landmarks.
arXiv Detail & Related papers (2020-05-25T18:08:28Z) - Audio-driven Talking Face Video Generation with Learning-based
Personalized Head Pose [67.31838207805573]
We propose a deep neural network model that takes an audio signal A of a source person and a short video V of a target person as input.
We outputs a synthesized high-quality talking face video with personalized head pose.
Our method can generate high-quality talking face videos with more distinguishing head movement effects than state-of-the-art methods.
arXiv Detail & Related papers (2020-02-24T10:02:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.