Visual Speech-Aware Perceptual 3D Facial Expression Reconstruction from
Videos
- URL: http://arxiv.org/abs/2207.11094v1
- Date: Fri, 22 Jul 2022 14:07:46 GMT
- Title: Visual Speech-Aware Perceptual 3D Facial Expression Reconstruction from
Videos
- Authors: Panagiotis P. Filntisis, George Retsinas, Foivos
Paraperas-Papantoniou, Athanasios Katsamanis, Anastasios Roussos, Petros
Maragos
- Abstract summary: We present the first method for visual speech-aware perceptual reconstruction of 3D mouth expressions.
We do this by proposing a "lipread" loss, which guides the fitting process so that the elicited perception from the 3D reconstructed talking head resembles that of the original video footage.
- Score: 32.48058491211032
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recent state of the art on monocular 3D face reconstruction from image
data has made some impressive advancements, thanks to the advent of Deep
Learning. However, it has mostly focused on input coming from a single RGB
image, overlooking the following important factors: a) Nowadays, the vast
majority of facial image data of interest do not originate from single images
but rather from videos, which contain rich dynamic information. b) Furthermore,
these videos typically capture individuals in some form of verbal communication
(public talks, teleconferences, audiovisual human-computer interactions,
interviews, monologues/dialogues in movies, etc). When existing 3D face
reconstruction methods are applied in such videos, the artifacts in the
reconstruction of the shape and motion of the mouth area are often severe,
since they do not match well with the speech audio.
To overcome the aforementioned limitations, we present the first method for
visual speech-aware perceptual reconstruction of 3D mouth expressions. We do
this by proposing a "lipread" loss, which guides the fitting process so that
the elicited perception from the 3D reconstructed talking head resembles that
of the original video footage. We demonstrate that, interestingly, the lipread
loss is better suited for 3D reconstruction of mouth movements compared to
traditional landmark losses, and even direct 3D supervision. Furthermore, the
devised method does not rely on any text transcriptions or corresponding audio,
rendering it ideal for training in unlabeled datasets. We verify the efficiency
of our method through exhaustive objective evaluations on three large-scale
datasets, as well as subjective evaluation with two web-based user studies.
Related papers
- Learn2Talk: 3D Talking Face Learns from 2D Talking Face [15.99315075587735]
We propose a learning framework named Learn2Talk, which can construct a better 3D talking face network.
Inspired by the audio-video sync network, a 3D sync-lip expert model is devised for the pursuit of lip-sync.
A teacher model selected from 2D talking face methods is used to guide the training of the audio-to-3D motions regression network.
arXiv Detail & Related papers (2024-04-19T13:45:14Z) - Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis [88.17520303867099]
One-shot 3D talking portrait generation aims to reconstruct a 3D avatar from an unseen image, and then animate it with a reference video or audio.
We present Real3D-Potrait, a framework that improves the one-shot 3D reconstruction power with a large image-to-plane model.
Experiments show that Real3D-Portrait generalizes well to unseen identities and generates more realistic talking portrait videos.
arXiv Detail & Related papers (2024-01-16T17:04:30Z) - Neural Text to Articulate Talk: Deep Text to Audiovisual Speech
Synthesis achieving both Auditory and Photo-realism [26.180371869137257]
State of the art in talking face generation focuses mainly on lip-syncing, being conditioned on audio clips.
NEUral Text to ARticulate Talk (NEUTART) is a talking face generator that uses a joint audiovisual feature space.
Model produces photorealistic talking face videos with human-like articulation and well-synced audiovisual streams.
arXiv Detail & Related papers (2023-12-11T18:41:55Z) - Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a
Short Video [91.92782707888618]
We present a decomposition-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance.
We show that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization.
arXiv Detail & Related papers (2023-09-09T14:52:39Z) - Identity-Preserving Talking Face Generation with Landmark and Appearance
Priors [106.79923577700345]
Existing person-generic methods have difficulty in generating realistic and lip-synced videos.
We propose a two-stage framework consisting of audio-to-landmark generation and landmark-to-video rendering procedures.
Our method can produce more realistic, lip-synced, and identity-preserving videos than existing person-generic talking face generation methods.
arXiv Detail & Related papers (2023-05-15T01:31:32Z) - AVFace: Towards Detailed Audio-Visual 4D Face Reconstruction [33.78412925549308]
3D face reconstruction from 2D images is an under-constrained problem due to the ambiguity of depth.
We propose AVFace that incorporates both modalities and accurately reconstructs the 4D facial and lip motion of any speaker.
arXiv Detail & Related papers (2023-04-25T19:41:10Z) - EMOCA: Emotion Driven Monocular Face Capture and Animation [59.15004328155593]
We introduce a novel deep perceptual emotion consistency loss during training, which helps ensure that the reconstructed 3D expression matches the expression depicted in the input image.
On the task of in-the-wild emotion recognition, our purely geometric approach is on par with the best image-based methods, highlighting the value of 3D geometry in analyzing human behavior.
arXiv Detail & Related papers (2022-04-24T15:58:35Z) - Depth-Aware Generative Adversarial Network for Talking Head Video
Generation [15.43672834991479]
Talking head video generation aims to produce a synthetic human face video that contains the identity and pose information respectively from a given source image and a driving video.
Existing works for this task heavily rely on 2D representations (e.g. appearance and motion) learned from the input images.
In this paper, we introduce a self-supervised geometry learning method to automatically recover the dense 3D geometry (i.e.depth) from the face videos.
arXiv Detail & Related papers (2022-03-13T09:32:22Z) - LipSync3D: Data-Efficient Learning of Personalized 3D Talking Faces from
Video using Pose and Lighting Normalization [4.43316916502814]
We present a video-based learning framework for animating personalized 3D talking faces from audio.
We introduce two training-time data normalizations that significantly improve data sample efficiency.
Our method outperforms contemporary state-of-the-art audio-driven video reenactment benchmarks in terms of realism, lip-sync and visual quality scores.
arXiv Detail & Related papers (2021-06-08T08:56:40Z) - Learning to Predict Salient Faces: A Novel Visual-Audio Saliency Model [96.24038430433885]
We propose a novel multi-modal video saliency model consisting of three branches: visual, audio and face.
Experimental results show that the proposed method outperforms 11 state-of-the-art saliency prediction works.
arXiv Detail & Related papers (2021-03-29T09:09:39Z) - Audio-driven Talking Face Video Generation with Learning-based
Personalized Head Pose [67.31838207805573]
We propose a deep neural network model that takes an audio signal A of a source person and a short video V of a target person as input.
We outputs a synthesized high-quality talking face video with personalized head pose.
Our method can generate high-quality talking face videos with more distinguishing head movement effects than state-of-the-art methods.
arXiv Detail & Related papers (2020-02-24T10:02:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.