Audio-Driven Emotional 3D Talking-Head Generation
- URL: http://arxiv.org/abs/2410.17262v1
- Date: Mon, 07 Oct 2024 08:23:05 GMT
- Title: Audio-Driven Emotional 3D Talking-Head Generation
- Authors: Wenqing Wang, Yun Fu,
- Abstract summary: We present a novel system for synthesizing high-fidelity, audio-driven video portraits with accurate emotional expressions.
We propose a pose sampling method that generates natural idle-state (non-speaking) videos in response to silent audio inputs.
- Score: 47.6666060652434
- License:
- Abstract: Audio-driven video portrait synthesis is a crucial and useful technology in virtual human interaction and film-making applications. Recent advancements have focused on improving the image fidelity and lip-synchronization. However, generating accurate emotional expressions is an important aspect of realistic talking-head generation, which has remained underexplored in previous works. We present a novel system in this paper for synthesizing high-fidelity, audio-driven video portraits with accurate emotional expressions. Specifically, we utilize a variational autoencoder (VAE)-based audio-to-motion module to generate facial landmarks. These landmarks are concatenated with emotional embeddings to produce emotional landmarks through our motion-to-emotion module. These emotional landmarks are then used to render realistic emotional talking-head video using a Neural Radiance Fields (NeRF)-based emotion-to-video module. Additionally, we propose a pose sampling method that generates natural idle-state (non-speaking) videos in response to silent audio inputs. Extensive experiments demonstrate that our method obtains more accurate emotion generation with higher fidelity.
Related papers
- EMOdiffhead: Continuously Emotional Control in Talking Head Generation via Diffusion [5.954758598327494]
EMOdiffhead is a novel method for emotional talking head video generation.
It enables fine-grained control of emotion categories and intensities.
It achieves state-of-the-art performance compared to other emotion portrait animation methods.
arXiv Detail & Related papers (2024-09-11T13:23:22Z) - EmoFace: Audio-driven Emotional 3D Face Animation [3.573880705052592]
EmoFace is a novel audio-driven methodology for creating facial animations with vivid emotional dynamics.
Our approach can generate facial expressions with multiple emotions, and has the ability to generate random yet natural blinks and eye movements.
Our proposed methodology can be applied in producing dialogues animations of non-playable characters in video games, and driving avatars in virtual reality environments.
arXiv Detail & Related papers (2024-07-17T11:32:16Z) - EmoSpeaker: One-shot Fine-grained Emotion-Controlled Talking Face
Generation [34.5592743467339]
We propose a visual attribute-guided audio decoupler to generate fine-grained facial animations.
To achieve more precise emotional expression, we introduce a fine-grained emotion coefficient prediction module.
Our proposed method, EmoSpeaker, outperforms existing emotional talking face generation methods in terms of expression variation and lip synchronization.
arXiv Detail & Related papers (2024-02-02T14:04:18Z) - DREAM-Talk: Diffusion-based Realistic Emotional Audio-driven Method for
Single Image Talking Face Generation [75.90730434449874]
We introduce DREAM-Talk, a two-stage diffusion-based audio-driven framework, tailored for generating diverse expressions and accurate lip-sync concurrently.
Given the strong correlation between lip motion and audio, we then refine the dynamics with enhanced lip-sync accuracy using audio features and emotion style.
Both quantitatively and qualitatively, DREAM-Talk outperforms state-of-the-art methods in terms of expressiveness, lip-sync accuracy and perceptual quality.
arXiv Detail & Related papers (2023-12-21T05:03:18Z) - Emotional Speech-Driven Animation with Content-Emotion Disentanglement [51.34635009347183]
We propose EMOTE, which generates 3D talking-head avatars that maintain lip-sync from speech while enabling explicit control over the expression of emotion.
EmOTE produces speech-driven facial animations with better lip-sync than state-of-the-art methods trained on the same data.
arXiv Detail & Related papers (2023-06-15T09:31:31Z) - EAMM: One-Shot Emotional Talking Face via Audio-Based Emotion-Aware
Motion Model [32.19539143308341]
We propose the Emotion-Aware Motion Model (EAMM) to generate one-shot emotional talking faces.
By incorporating the results from both modules, our method can generate satisfactory talking face results on arbitrary subjects.
arXiv Detail & Related papers (2022-05-30T17:39:45Z) - Audio-Driven Emotional Video Portraits [79.95687903497354]
We present Emotional Video Portraits (EVP), a system for synthesizing high-quality video portraits with vivid emotional dynamics driven by audios.
Specifically, we propose the Cross-Reconstructed Emotion Disentanglement technique to decompose speech into two decoupled spaces.
With the disentangled features, dynamic 2D emotional facial landmarks can be deduced.
Then we propose the Target-Adaptive Face Synthesis technique to generate the final high-quality video portraits.
arXiv Detail & Related papers (2021-04-15T13:37:13Z) - Seen and Unseen emotional style transfer for voice conversion with a new
emotional speech dataset [84.53659233967225]
Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity.
We propose a novel framework based on variational auto-encoding Wasserstein generative adversarial network (VAW-GAN)
We show that the proposed framework achieves remarkable performance by consistently outperforming the baseline framework.
arXiv Detail & Related papers (2020-10-28T07:16:18Z) - An End-to-End Visual-Audio Attention Network for Emotion Recognition in
User-Generated Videos [64.91614454412257]
We propose to recognize video emotions in an end-to-end manner based on convolutional neural networks (CNNs)
Specifically, we develop a deep Visual-Audio Attention Network (VAANet), a novel architecture that integrates spatial, channel-wise, and temporal attentions into a visual 3D CNN and temporal attentions into an audio 2D CNN.
arXiv Detail & Related papers (2020-02-12T15:33:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.