Invertable Frowns: Video-to-Video Facial Emotion Translation
- URL: http://arxiv.org/abs/2109.08061v1
- Date: Thu, 16 Sep 2021 15:43:51 GMT
- Title: Invertable Frowns: Video-to-Video Facial Emotion Translation
- Authors: Ian Magnusson and Aruna Sankaranarayanan and Andrew Lippman
- Abstract summary: We present Wav2Lip-Emotion, a video-to-video translation architecture that modifies facial expressions of emotion in videos of speakers.
We extend an existing multi-modal lip synchronization architecture to modify the speaker's emotion using L1 reconstruction and pre-trained emotion objectives.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Wav2Lip-Emotion, a video-to-video translation architecture that
modifies facial expressions of emotion in videos of speakers. Previous work
modifies emotion in images, uses a single image to produce a video with
animated emotion, or puppets facial expressions in videos with landmarks from a
reference video. However, many use cases such as modifying an actor's
performance in post-production, coaching individuals to be more animated
speakers, or touching up emotion in a teleconference require a video-to-video
translation approach. We explore a method to maintain speakers' lip movements,
identity, and pose while translating their expressed emotion. Our approach
extends an existing multi-modal lip synchronization architecture to modify the
speaker's emotion using L1 reconstruction and pre-trained emotion objectives.
We also propose a novel automated emotion evaluation approach and corroborate
it with a user study. These find that we succeed in modifying emotion while
maintaining lip synchronization. Visual quality is somewhat diminished, with a
trade off between greater emotion modification and visual quality between model
variants. Nevertheless, we demonstrate (1) that facial expressions of emotion
can be modified with nothing other than L1 reconstruction and pre-trained
emotion objectives and (2) that our automated emotion evaluation approach
aligns with human judgements.
Related papers
- EmoFace: Audio-driven Emotional 3D Face Animation [3.573880705052592]
EmoFace is a novel audio-driven methodology for creating facial animations with vivid emotional dynamics.
Our approach can generate facial expressions with multiple emotions, and has the ability to generate random yet natural blinks and eye movements.
Our proposed methodology can be applied in producing dialogues animations of non-playable characters in video games, and driving avatars in virtual reality environments.
arXiv Detail & Related papers (2024-07-17T11:32:16Z) - EmoSpeaker: One-shot Fine-grained Emotion-Controlled Talking Face
Generation [34.5592743467339]
We propose a visual attribute-guided audio decoupler to generate fine-grained facial animations.
To achieve more precise emotional expression, we introduce a fine-grained emotion coefficient prediction module.
Our proposed method, EmoSpeaker, outperforms existing emotional talking face generation methods in terms of expression variation and lip synchronization.
arXiv Detail & Related papers (2024-02-02T14:04:18Z) - Attention-based Interactive Disentangling Network for Instance-level
Emotional Voice Conversion [81.1492897350032]
Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components.
We propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion.
arXiv Detail & Related papers (2023-12-29T08:06:45Z) - DREAM-Talk: Diffusion-based Realistic Emotional Audio-driven Method for
Single Image Talking Face Generation [75.90730434449874]
We introduce DREAM-Talk, a two-stage diffusion-based audio-driven framework, tailored for generating diverse expressions and accurate lip-sync concurrently.
Given the strong correlation between lip motion and audio, we then refine the dynamics with enhanced lip-sync accuracy using audio features and emotion style.
Both quantitatively and qualitatively, DREAM-Talk outperforms state-of-the-art methods in terms of expressiveness, lip-sync accuracy and perceptual quality.
arXiv Detail & Related papers (2023-12-21T05:03:18Z) - Emotional Speech-Driven Animation with Content-Emotion Disentanglement [51.34635009347183]
We propose EMOTE, which generates 3D talking-head avatars that maintain lip-sync from speech while enabling explicit control over the expression of emotion.
EmOTE produces speech-driven facial animations with better lip-sync than state-of-the-art methods trained on the same data.
arXiv Detail & Related papers (2023-06-15T09:31:31Z) - Emotionally Enhanced Talking Face Generation [52.07451348895041]
We build a talking face generation framework conditioned on a categorical emotion to generate videos with appropriate expressions.
We show that our model can adapt to arbitrary identities, emotions, and languages.
Our proposed framework is equipped with a user-friendly web interface with a real-time experience for talking face generation with emotions.
arXiv Detail & Related papers (2023-03-21T02:33:27Z) - Learning to Dub Movies via Hierarchical Prosody Models [167.6465354313349]
Given a piece of text, a video clip and a reference audio, the movie dubbing (also known as visual voice clone V2C) task aims to generate speeches that match the speaker's emotion presented in the video using the desired speaker voice as reference.
We propose a novel movie dubbing architecture to tackle these problems via hierarchical prosody modelling, which bridges the visual information to corresponding speech prosody from three aspects: lip, face, and scene.
arXiv Detail & Related papers (2022-12-08T03:29:04Z) - EAMM: One-Shot Emotional Talking Face via Audio-Based Emotion-Aware
Motion Model [32.19539143308341]
We propose the Emotion-Aware Motion Model (EAMM) to generate one-shot emotional talking faces.
By incorporating the results from both modules, our method can generate satisfactory talking face results on arbitrary subjects.
arXiv Detail & Related papers (2022-05-30T17:39:45Z) - Audio-Driven Emotional Video Portraits [79.95687903497354]
We present Emotional Video Portraits (EVP), a system for synthesizing high-quality video portraits with vivid emotional dynamics driven by audios.
Specifically, we propose the Cross-Reconstructed Emotion Disentanglement technique to decompose speech into two decoupled spaces.
With the disentangled features, dynamic 2D emotional facial landmarks can be deduced.
Then we propose the Target-Adaptive Face Synthesis technique to generate the final high-quality video portraits.
arXiv Detail & Related papers (2021-04-15T13:37:13Z) - Speech Driven Talking Face Generation from a Single Image and an Emotion
Condition [28.52180268019401]
We propose a novel approach to rendering visual emotion expression in speech-driven talking face generation.
We design an end-to-end talking face generation system that takes a speech utterance, a single face image, and a categorical emotion label as input.
Objective evaluation on image quality, audiovisual synchronization, and visual emotion expression shows that the proposed system outperforms a state-of-the-art baseline system.
arXiv Detail & Related papers (2020-08-08T20:46:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.