Neural Emotion Director: Speech-preserving semantic control of facial
expressions in "in-the-wild" videos
- URL: http://arxiv.org/abs/2112.00585v1
- Date: Wed, 1 Dec 2021 15:55:04 GMT
- Title: Neural Emotion Director: Speech-preserving semantic control of facial
expressions in "in-the-wild" videos
- Authors: Foivos Paraperas Papantoniou, Panagiotis P. Filntisis, Petros Maragos,
Anastasios Roussos
- Abstract summary: We introduce a novel deep learning method for photo-realistic manipulation of the emotional state of actors in "in-the-wild" videos.
The proposed method is based on a parametric 3D face representation of the actor in the input scene that offers a reliable disentanglement of the facial identity from the head pose and facial expressions.
It then uses a novel deep domain translation framework that alters the facial expressions in a consistent and plausible manner, taking into account their dynamics.
- Score: 31.746152261362777
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we introduce a novel deep learning method for photo-realistic
manipulation of the emotional state of actors in "in-the-wild" videos. The
proposed method is based on a parametric 3D face representation of the actor in
the input scene that offers a reliable disentanglement of the facial identity
from the head pose and facial expressions. It then uses a novel deep domain
translation framework that alters the facial expressions in a consistent and
plausible manner, taking into account their dynamics. Finally, the altered
facial expressions are used to photo-realistically manipulate the facial region
in the input scene based on an especially-designed neural face renderer. To the
best of our knowledge, our method is the first to be capable of controlling the
actor's facial expressions by even using as a sole input the semantic labels of
the manipulated emotions, while at the same time preserving the speech-related
lip movements. We conduct extensive qualitative and quantitative evaluations
and comparisons, which demonstrate the effectiveness of our approach and the
especially promising results that we obtain. Our method opens a plethora of new
possibilities for useful applications of neural rendering technologies, ranging
from movie post-production and video games to photo-realistic affective
avatars.
Related papers
- GaussianHeads: End-to-End Learning of Drivable Gaussian Head Avatars from Coarse-to-fine Representations [54.94362657501809]
We propose a new method to generate highly dynamic and deformable human head avatars from multi-view imagery in real-time.
At the core of our method is a hierarchical representation of head models that allows to capture the complex dynamics of facial expressions and head movements.
We train this coarse-to-fine facial avatar model along with the head pose as a learnable parameter in an end-to-end framework.
arXiv Detail & Related papers (2024-09-18T13:05:43Z) - Knowledge-Enhanced Facial Expression Recognition with Emotional-to-Neutral Transformation [66.53435569574135]
Existing facial expression recognition methods typically fine-tune a pre-trained visual encoder using discrete labels.
We observe that the rich knowledge in text embeddings, generated by vision-language models, is a promising alternative for learning discriminative facial expression representations.
We propose a novel knowledge-enhanced FER method with an emotional-to-neutral transformation.
arXiv Detail & Related papers (2024-09-13T07:28:57Z) - Towards Localized Fine-Grained Control for Facial Expression Generation [54.82883891478555]
Humans, particularly their faces, are central to content generation due to their ability to convey rich expressions and intent.
Current generative models mostly generate flat neutral expressions and characterless smiles without authenticity.
We propose the use of AUs (action units) for facial expression control in face generation.
arXiv Detail & Related papers (2024-07-25T18:29:48Z) - CSTalk: Correlation Supervised Speech-driven 3D Emotional Facial Animation Generation [13.27632316528572]
Speech-driven 3D facial animation technology has been developed for years, but its practical application still lacks expectations.
Main challenges lie in data limitations, lip alignment, and the naturalness of facial expressions.
This paper proposes a method called CSTalk that models the correlations among different regions of facial movements and supervises the training of the generative model to generate realistic expressions.
arXiv Detail & Related papers (2024-04-29T11:19:15Z) - Imitator: Personalized Speech-driven 3D Facial Animation [63.57811510502906]
State-of-the-art methods deform the face topology of the target actor to sync the input audio without considering the identity-specific speaking style and facial idiosyncrasies of the target actor.
We present Imitator, a speech-driven facial expression synthesis method, which learns identity-specific details from a short input video.
We show that our approach produces temporally coherent facial expressions from input audio while preserving the speaking style of the target actors.
arXiv Detail & Related papers (2022-12-30T19:00:02Z) - PERI: Part Aware Emotion Recognition In The Wild [4.206175795966693]
This paper focuses on emotion recognition using visual features.
We create part aware spatial (PAS) images by extracting key regions from the input image using a mask generated from both body pose and facial landmarks.
We provide our results on the publicly available in the wild EMOTIC dataset.
arXiv Detail & Related papers (2022-10-18T20:01:40Z) - Continuously Controllable Facial Expression Editing in Talking Face
Videos [34.83353695337335]
Speech-related expressions and emotion-related expressions are often highly coupled.
Traditional image-to-image translation methods cannot work well in our application.
We propose a high-quality facial expression editing method for talking face videos.
arXiv Detail & Related papers (2022-09-17T09:05:47Z) - Emotion Separation and Recognition from a Facial Expression by Generating the Poker Face with Vision Transformers [57.1091606948826]
We propose a novel FER model, named Poker Face Vision Transformer or PF-ViT, to address these challenges.
PF-ViT aims to separate and recognize the disturbance-agnostic emotion from a static facial image via generating its corresponding poker face.
PF-ViT utilizes vanilla Vision Transformers, and its components are pre-trained as Masked Autoencoders on a large facial expression dataset.
arXiv Detail & Related papers (2022-07-22T13:39:06Z) - Deep Semantic Manipulation of Facial Videos [5.048861360606916]
This paper proposes the first method to perform photorealistic manipulation of facial expressions in videos.
Our method supports semantic video manipulation based on neural rendering and 3D-based facial expression modelling.
The proposed method is based on a disentangled representation and estimation of the 3D facial shape and activity.
arXiv Detail & Related papers (2021-11-15T16:55:16Z) - Real-time Facial Expression Recognition "In The Wild'' by Disentangling
3D Expression from Identity [6.974241731162878]
This paper proposes a novel method for human emotion recognition from a single RGB image.
We construct a large-scale dataset of facial videos, rich in facial dynamics, identities, expressions, appearance and 3D pose variations.
Our proposed framework runs at 50 frames per second and is capable of robustly estimating parameters of 3D expression variation.
arXiv Detail & Related papers (2020-05-12T01:32:55Z) - MakeItTalk: Speaker-Aware Talking-Head Animation [49.77977246535329]
We present a method that generates expressive talking heads from a single facial image with audio as the only input.
Based on this intermediate representation, our method is able to synthesize photorealistic videos of entire talking heads with full range of motion.
arXiv Detail & Related papers (2020-04-27T17:56:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.