Invertable Frowns: Video-to-Video Facial Emotion Translation
- URL: http://arxiv.org/abs/2109.08061v1
- Date: Thu, 16 Sep 2021 15:43:51 GMT
- Title: Invertable Frowns: Video-to-Video Facial Emotion Translation
- Authors: Ian Magnusson and Aruna Sankaranarayanan and Andrew Lippman
- Abstract summary: We present Wav2Lip-Emotion, a video-to-video translation architecture that modifies facial expressions of emotion in videos of speakers.
We extend an existing multi-modal lip synchronization architecture to modify the speaker's emotion using L1 reconstruction and pre-trained emotion objectives.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Wav2Lip-Emotion, a video-to-video translation architecture that
modifies facial expressions of emotion in videos of speakers. Previous work
modifies emotion in images, uses a single image to produce a video with
animated emotion, or puppets facial expressions in videos with landmarks from a
reference video. However, many use cases such as modifying an actor's
performance in post-production, coaching individuals to be more animated
speakers, or touching up emotion in a teleconference require a video-to-video
translation approach. We explore a method to maintain speakers' lip movements,
identity, and pose while translating their expressed emotion. Our approach
extends an existing multi-modal lip synchronization architecture to modify the
speaker's emotion using L1 reconstruction and pre-trained emotion objectives.
We also propose a novel automated emotion evaluation approach and corroborate
it with a user study. These find that we succeed in modifying emotion while
maintaining lip synchronization. Visual quality is somewhat diminished, with a
trade off between greater emotion modification and visual quality between model
variants. Nevertheless, we demonstrate (1) that facial expressions of emotion
can be modified with nothing other than L1 reconstruction and pre-trained
emotion objectives and (2) that our automated emotion evaluation approach
aligns with human judgements.
Related papers
- EmoDiffusion: Enhancing Emotional 3D Facial Animation with Latent Diffusion Models [66.67979602235015]
EmoDiffusion is a novel approach that disentangles different emotions in speech to generate rich 3D emotional facial expressions.
We capture facial expressions under the guidance of animation experts using LiveLinkFace on an iPhone.
arXiv Detail & Related papers (2025-03-14T02:54:22Z) - Audio-Driven Emotional 3D Talking-Head Generation [47.6666060652434]
We present a novel system for synthesizing high-fidelity, audio-driven video portraits with accurate emotional expressions.
We propose a pose sampling method that generates natural idle-state (non-speaking) videos in response to silent audio inputs.
arXiv Detail & Related papers (2024-10-07T08:23:05Z) - EMOdiffhead: Continuously Emotional Control in Talking Head Generation via Diffusion [5.954758598327494]
EMOdiffhead is a novel method for emotional talking head video generation.
It enables fine-grained control of emotion categories and intensities.
It achieves state-of-the-art performance compared to other emotion portrait animation methods.
arXiv Detail & Related papers (2024-09-11T13:23:22Z) - EmoSpeaker: One-shot Fine-grained Emotion-Controlled Talking Face
Generation [34.5592743467339]
We propose a visual attribute-guided audio decoupler to generate fine-grained facial animations.
To achieve more precise emotional expression, we introduce a fine-grained emotion coefficient prediction module.
Our proposed method, EmoSpeaker, outperforms existing emotional talking face generation methods in terms of expression variation and lip synchronization.
arXiv Detail & Related papers (2024-02-02T14:04:18Z) - Attention-based Interactive Disentangling Network for Instance-level
Emotional Voice Conversion [81.1492897350032]
Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components.
We propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion.
arXiv Detail & Related papers (2023-12-29T08:06:45Z) - DREAM-Talk: Diffusion-based Realistic Emotional Audio-driven Method for
Single Image Talking Face Generation [75.90730434449874]
We introduce DREAM-Talk, a two-stage diffusion-based audio-driven framework, tailored for generating diverse expressions and accurate lip-sync concurrently.
Given the strong correlation between lip motion and audio, we then refine the dynamics with enhanced lip-sync accuracy using audio features and emotion style.
Both quantitatively and qualitatively, DREAM-Talk outperforms state-of-the-art methods in terms of expressiveness, lip-sync accuracy and perceptual quality.
arXiv Detail & Related papers (2023-12-21T05:03:18Z) - Emotional Speech-Driven Animation with Content-Emotion Disentanglement [51.34635009347183]
We propose EMOTE, which generates 3D talking-head avatars that maintain lip-sync from speech while enabling explicit control over the expression of emotion.
EmOTE produces speech-driven facial animations with better lip-sync than state-of-the-art methods trained on the same data.
arXiv Detail & Related papers (2023-06-15T09:31:31Z) - Learning to Dub Movies via Hierarchical Prosody Models [167.6465354313349]
Given a piece of text, a video clip and a reference audio, the movie dubbing (also known as visual voice clone V2C) task aims to generate speeches that match the speaker's emotion presented in the video using the desired speaker voice as reference.
We propose a novel movie dubbing architecture to tackle these problems via hierarchical prosody modelling, which bridges the visual information to corresponding speech prosody from three aspects: lip, face, and scene.
arXiv Detail & Related papers (2022-12-08T03:29:04Z) - EAMM: One-Shot Emotional Talking Face via Audio-Based Emotion-Aware
Motion Model [32.19539143308341]
We propose the Emotion-Aware Motion Model (EAMM) to generate one-shot emotional talking faces.
By incorporating the results from both modules, our method can generate satisfactory talking face results on arbitrary subjects.
arXiv Detail & Related papers (2022-05-30T17:39:45Z) - Audio-Driven Emotional Video Portraits [79.95687903497354]
We present Emotional Video Portraits (EVP), a system for synthesizing high-quality video portraits with vivid emotional dynamics driven by audios.
Specifically, we propose the Cross-Reconstructed Emotion Disentanglement technique to decompose speech into two decoupled spaces.
With the disentangled features, dynamic 2D emotional facial landmarks can be deduced.
Then we propose the Target-Adaptive Face Synthesis technique to generate the final high-quality video portraits.
arXiv Detail & Related papers (2021-04-15T13:37:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.