EmotionGesture: Audio-Driven Diverse Emotional Co-Speech 3D Gesture
Generation
- URL: http://arxiv.org/abs/2305.18891v2
- Date: Wed, 3 Jan 2024 06:55:36 GMT
- Title: EmotionGesture: Audio-Driven Diverse Emotional Co-Speech 3D Gesture
Generation
- Authors: Xingqun Qi, Chen Liu, Lincheng Li, Jie Hou, Haoran Xin, Xin Yu
- Abstract summary: EmotionGesture is a novel framework for vivid and diverse emotional co-speech 3D gestures from audio.
Our framework outperforms the state-of-the-art, achieving vivid and diverse emotional co-speech 3D gestures.
- Score: 24.547098909937034
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating vivid and diverse 3D co-speech gestures is crucial for various
applications in animating virtual avatars. While most existing methods can
generate gestures from audio directly, they usually overlook that emotion is
one of the key factors of authentic co-speech gesture generation. In this work,
we propose EmotionGesture, a novel framework for synthesizing vivid and diverse
emotional co-speech 3D gestures from audio. Considering emotion is often
entangled with the rhythmic beat in speech audio, we first develop an
Emotion-Beat Mining module (EBM) to extract the emotion and audio beat features
as well as model their correlation via a transcript-based visual-rhythm
alignment. Then, we propose an initial pose based Spatial-Temporal Prompter
(STP) to generate future gestures from the given initial poses. STP effectively
models the spatial-temporal correlations between the initial poses and the
future gestures, thus producing the spatial-temporal coherent pose prompt. Once
we obtain pose prompts, emotion, and audio beat features, we will generate 3D
co-speech gestures through a transformer architecture. However, considering the
poses of existing datasets often contain jittering effects, this would lead to
generating unstable gestures. To address this issue, we propose an effective
objective function, dubbed Motion-Smooth Loss. Specifically, we model motion
offset to compensate for jittering ground-truth by forcing gestures to be
smooth. Last, we present an emotion-conditioned VAE to sample emotion features,
enabling us to generate diverse emotional results. Extensive experiments
demonstrate that our framework outperforms the state-of-the-art, achieving
vivid and diverse emotional co-speech 3D gestures. Our code and dataset will be
released at the project page:
https://xingqunqi-lab.github.io/Emotion-Gesture-Web/
Related papers
- Audio-Driven Emotional 3D Talking-Head Generation [47.6666060652434]
We present a novel system for synthesizing high-fidelity, audio-driven video portraits with accurate emotional expressions.
We propose a pose sampling method that generates natural idle-state (non-speaking) videos in response to silent audio inputs.
arXiv Detail & Related papers (2024-10-07T08:23:05Z) - DEEPTalk: Dynamic Emotion Embedding for Probabilistic Speech-Driven 3D Face Animation [14.07086606183356]
Speech-driven 3D facial animation has garnered lots of attention thanks to its broad range of applications.
Current methods fail to capture the nuanced emotional undertones conveyed through speech and produce monotonous facial motion.
We introduce DEEPTalk, a novel approach that generates diverse and emotionally rich 3D facial expressions directly from speech inputs.
arXiv Detail & Related papers (2024-08-12T08:56:49Z) - EmoSpeaker: One-shot Fine-grained Emotion-Controlled Talking Face
Generation [34.5592743467339]
We propose a visual attribute-guided audio decoupler to generate fine-grained facial animations.
To achieve more precise emotional expression, we introduce a fine-grained emotion coefficient prediction module.
Our proposed method, EmoSpeaker, outperforms existing emotional talking face generation methods in terms of expression variation and lip synchronization.
arXiv Detail & Related papers (2024-02-02T14:04:18Z) - DREAM-Talk: Diffusion-based Realistic Emotional Audio-driven Method for
Single Image Talking Face Generation [75.90730434449874]
We introduce DREAM-Talk, a two-stage diffusion-based audio-driven framework, tailored for generating diverse expressions and accurate lip-sync concurrently.
Given the strong correlation between lip motion and audio, we then refine the dynamics with enhanced lip-sync accuracy using audio features and emotion style.
Both quantitatively and qualitatively, DREAM-Talk outperforms state-of-the-art methods in terms of expressiveness, lip-sync accuracy and perceptual quality.
arXiv Detail & Related papers (2023-12-21T05:03:18Z) - Emotional Speech-driven 3D Body Animation via Disentangled Latent Diffusion [45.081371413693425]
Existing methods for synthesizing 3D human gestures from speech have shown promising results.
We present AMUSE, an emotional speech-driven body animation model based on latent diffusion.
arXiv Detail & Related papers (2023-12-07T17:39:25Z) - Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation [43.04371187071256]
We present a novel method to generate vivid and emotional 3D co-speech gestures in 3D avatars.
We use the ChatGPT-4 and an audio inpainting approach to construct the high-fidelity emotion transition human speeches.
Our method outperforms the state-of-the-art models constructed by adapting single emotion-conditioned counterparts.
arXiv Detail & Related papers (2023-11-29T11:10:40Z) - Emotional Speech-Driven Animation with Content-Emotion Disentanglement [51.34635009347183]
We propose EMOTE, which generates 3D talking-head avatars that maintain lip-sync from speech while enabling explicit control over the expression of emotion.
EmOTE produces speech-driven facial animations with better lip-sync than state-of-the-art methods trained on the same data.
arXiv Detail & Related papers (2023-06-15T09:31:31Z) - Audio-Driven Co-Speech Gesture Video Generation [92.15661971086746]
We define and study this challenging problem of audio-driven co-speech gesture video generation.
Our key insight is that the co-speech gestures can be decomposed into common motion patterns and subtle rhythmic dynamics.
We propose a novel framework, Audio-driveN Gesture vIdeo gEneration (ANGIE), to effectively capture the reusable co-speech gesture patterns.
arXiv Detail & Related papers (2022-12-05T15:28:22Z) - EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional
Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.
Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding.
In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.