Related papers: Neural Emotion Director: Speech-preserving semantic control of facial expressions in "in-the-wild" videos

Neural Emotion Director: Speech-preserving semantic control of facial expressions in "in-the-wild" videos

URL: http://arxiv.org/abs/2112.00585v1
Date: Wed, 1 Dec 2021 15:55:04 GMT
Title: Neural Emotion Director: Speech-preserving semantic control of facial expressions in "in-the-wild" videos
Authors: Foivos Paraperas Papantoniou, Panagiotis P. Filntisis, Petros Maragos, Anastasios Roussos
Abstract summary: We introduce a novel deep learning method for photo-realistic manipulation of the emotional state of actors in "in-the-wild" videos. The proposed method is based on a parametric 3D face representation of the actor in the input scene that offers a reliable disentanglement of the facial identity from the head pose and facial expressions. It then uses a novel deep domain translation framework that alters the facial expressions in a consistent and plausible manner, taking into account their dynamics.
Score: 31.746152261362777
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we introduce a novel deep learning method for photo-realistic manipulation of the emotional state of actors in "in-the-wild" videos. The proposed method is based on a parametric 3D face representation of the actor in the input scene that offers a reliable disentanglement of the facial identity from the head pose and facial expressions. It then uses a novel deep domain translation framework that alters the facial expressions in a consistent and plausible manner, taking into account their dynamics. Finally, the altered facial expressions are used to photo-realistically manipulate the facial region in the input scene based on an especially-designed neural face renderer. To the best of our knowledge, our method is the first to be capable of controlling the actor's facial expressions by even using as a sole input the semantic labels of the manipulated emotions, while at the same time preserving the speech-related lip movements. We conduct extensive qualitative and quantitative evaluations and comparisons, which demonstrate the effectiveness of our approach and the especially promising results that we obtain. Our method opens a plethora of new possibilities for useful applications of neural rendering technologies, ranging from movie post-production and video games to photo-realistic affective avatars.

Related papers

EmoDiffusion: Enhancing Emotional 3D Facial Animation with Latent Diffusion Models [66.67979602235015]
EmoDiffusion is a novel approach that disentangles different emotions in speech to generate rich 3D emotional facial expressions. We capture facial expressions under the guidance of animation experts using LiveLinkFace on an iPhone.
arXiv Detail & Related papers (2025-03-14T02:54:22Z)
GaussianHeads: End-to-End Learning of Drivable Gaussian Head Avatars from Coarse-to-fine Representations [54.94362657501809]
We propose a new method to generate highly dynamic and deformable human head avatars from multi-view imagery in real-time. At the core of our method is a hierarchical representation of head models that allows to capture the complex dynamics of facial expressions and head movements. We train this coarse-to-fine facial avatar model along with the head pose as a learnable parameter in an end-to-end framework.
arXiv Detail & Related papers (2024-09-18T13:05:43Z)
Knowledge-Enhanced Facial Expression Recognition with Emotional-to-Neutral Transformation [66.53435569574135]
Existing facial expression recognition methods typically fine-tune a pre-trained visual encoder using discrete labels. We observe that the rich knowledge in text embeddings, generated by vision-language models, is a promising alternative for learning discriminative facial expression representations. We propose a novel knowledge-enhanced FER method with an emotional-to-neutral transformation.
arXiv Detail & Related papers (2024-09-13T07:28:57Z)
Towards Localized Fine-Grained Control for Facial Expression Generation [54.82883891478555]
Humans, particularly their faces, are central to content generation due to their ability to convey rich expressions and intent. Current generative models mostly generate flat neutral expressions and characterless smiles without authenticity. We propose the use of AUs (action units) for facial expression control in face generation.
arXiv Detail & Related papers (2024-07-25T18:29:48Z)
CSTalk: Correlation Supervised Speech-driven 3D Emotional Facial Animation Generation [13.27632316528572]
Speech-driven 3D facial animation technology has been developed for years, but its practical application still lacks expectations. Main challenges lie in data limitations, lip alignment, and the naturalness of facial expressions. This paper proposes a method called CSTalk that models the correlations among different regions of facial movements and supervises the training of the generative model to generate realistic expressions.
arXiv Detail & Related papers (2024-04-29T11:19:15Z)
Imitator: Personalized Speech-driven 3D Facial Animation [63.57811510502906]
State-of-the-art methods deform the face topology of the target actor to sync the input audio without considering the identity-specific speaking style and facial idiosyncrasies of the target actor. We present Imitator, a speech-driven facial expression synthesis method, which learns identity-specific details from a short input video. We show that our approach produces temporally coherent facial expressions from input audio while preserving the speaking style of the target actors.
arXiv Detail & Related papers (2022-12-30T19:00:02Z)
PERI: Part Aware Emotion Recognition In The Wild [4.206175795966693]
This paper focuses on emotion recognition using visual features. We create part aware spatial (PAS) images by extracting key regions from the input image using a mask generated from both body pose and facial landmarks. We provide our results on the publicly available in the wild EMOTIC dataset.
arXiv Detail & Related papers (2022-10-18T20:01:40Z)
Continuously Controllable Facial Expression Editing in Talking Face Videos [34.83353695337335]
Speech-related expressions and emotion-related expressions are often highly coupled. Traditional image-to-image translation methods cannot work well in our application. We propose a high-quality facial expression editing method for talking face videos.
arXiv Detail & Related papers (2022-09-17T09:05:47Z)
Emotion Separation and Recognition from a Facial Expression by Generating the Poker Face with Vision Transformers [57.1091606948826]
We propose a novel FER model, named Poker Face Vision Transformer or PF-ViT, to address these challenges. PF-ViT aims to separate and recognize the disturbance-agnostic emotion from a static facial image via generating its corresponding poker face. PF-ViT utilizes vanilla Vision Transformers, and its components are pre-trained as Masked Autoencoders on a large facial expression dataset.
arXiv Detail & Related papers (2022-07-22T13:39:06Z)
Deep Semantic Manipulation of Facial Videos [5.048861360606916]
This paper proposes the first method to perform photorealistic manipulation of facial expressions in videos. Our method supports semantic video manipulation based on neural rendering and 3D-based facial expression modelling. The proposed method is based on a disentangled representation and estimation of the 3D facial shape and activity.
arXiv Detail & Related papers (2021-11-15T16:55:16Z)
Real-time Facial Expression Recognition "In The Wild'' by Disentangling 3D Expression from Identity [6.974241731162878]
This paper proposes a novel method for human emotion recognition from a single RGB image. We construct a large-scale dataset of facial videos, rich in facial dynamics, identities, expressions, appearance and 3D pose variations. Our proposed framework runs at 50 frames per second and is capable of robustly estimating parameters of 3D expression variation.
arXiv Detail & Related papers (2020-05-12T01:32:55Z)
MakeItTalk: Speaker-Aware Talking-Head Animation [49.77977246535329]
We present a method that generates expressive talking heads from a single facial image with audio as the only input. Based on this intermediate representation, our method is able to synthesize photorealistic videos of entire talking heads with full range of motion.
arXiv Detail & Related papers (2020-04-27T17:56:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.