Controllable Expressive 3D Facial Animation via Diffusion in a Unified Multimodal Space
- URL: http://arxiv.org/abs/2506.10007v1
- Date: Mon, 14 Apr 2025 01:38:01 GMT
- Title: Controllable Expressive 3D Facial Animation via Diffusion in a Unified Multimodal Space
- Authors: Kangwei Liu, Junwu Liu, Xiaowei Yi, Jinlin Guo, Yun Cao,
- Abstract summary: We present a diffusion-based framework for controllable expressive 3D facial animation.<n>Our approach introduces two key innovations: (1) a FLAME-centered multimodal emotion binding strategy, and (2) an attention-based latent diffusion model.<n>Our method achieves a 21.6% improvement in emotion similarity while preserving physiologically plausible facial dynamics.
- Score: 7.165879904419689
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio-driven emotional 3D facial animation encounters two significant challenges: (1) reliance on single-modal control signals (videos, text, or emotion labels) without leveraging their complementary strengths for comprehensive emotion manipulation, and (2) deterministic regression-based mapping that constrains the stochastic nature of emotional expressions and non-verbal behaviors, limiting the expressiveness of synthesized animations. To address these challenges, we present a diffusion-based framework for controllable expressive 3D facial animation. Our approach introduces two key innovations: (1) a FLAME-centered multimodal emotion binding strategy that aligns diverse modalities (text, audio, and emotion labels) through contrastive learning, enabling flexible emotion control from multiple signal sources, and (2) an attention-based latent diffusion model with content-aware attention and emotion-guided layers, which enriches motion diversity while maintaining temporal coherence and natural facial dynamics. Extensive experiments demonstrate that our method outperforms existing approaches across most metrics, achieving a 21.6\% improvement in emotion similarity while preserving physiologically plausible facial dynamics. Project Page: https://kangweiiliu.github.io/Control_3D_Animation.
Related papers
- Think-Before-Draw: Decomposing Emotion Semantics & Fine-Grained Controllable Expressive Talking Head Generation [7.362433184546492]
Emotional talking-head generation has emerged as a pivotal research area at the intersection of computer vision and multimodal artificial intelligence.<n>This study proposes the Think-Before-Draw framework to address two key challenges.
arXiv Detail & Related papers (2025-07-17T03:33:46Z) - MEDTalk: Multimodal Controlled 3D Facial Animation with Dynamic Emotions by Disentangled Embedding [48.54455964043634]
MEDTalk is a novel framework for fine-grained and dynamic emotional talking head generation.<n>We integrate audio and speech text, predicting frame-wise intensity variations and dynamically adjusting static emotion features to generate realistic emotional expressions.<n>Our generated results can be conveniently integrated into the industrial production pipeline.
arXiv Detail & Related papers (2025-07-08T15:14:27Z) - Disentangle Identity, Cooperate Emotion: Correlation-Aware Emotional Talking Portrait Generation [63.94836524433559]
DICE-Talk is a framework for disentangling identity with emotion and cooperating emotions with similar characteristics.<n>We develop a disentangled emotion embedder that jointly models audio-visual emotional cues through cross-modal attention.<n>Second, we introduce a correlation-enhanced emotion conditioning module with learnable Emotion Banks.<n>Third, we design an emotion discrimination objective that enforces affective consistency during the diffusion process.
arXiv Detail & Related papers (2025-04-25T05:28:21Z) - EmoDiffusion: Enhancing Emotional 3D Facial Animation with Latent Diffusion Models [66.67979602235015]
EmoDiffusion is a novel approach that disentangles different emotions in speech to generate rich 3D emotional facial expressions.<n>We capture facial expressions under the guidance of animation experts using LiveLinkFace on an iPhone.
arXiv Detail & Related papers (2025-03-14T02:54:22Z) - X-Dyna: Expressive Dynamic Human Image Animation [49.896933584815926]
X-Dyna is a zero-shot, diffusion-based pipeline for animating a single human image.<n>It generates realistic, context-aware dynamics for both the subject and the surrounding environment.
arXiv Detail & Related papers (2025-01-17T08:10:53Z) - When Words Smile: Generating Diverse Emotional Facial Expressions from Text [72.19705878257204]
We introduce an end-to-end text-to-expression model that explicitly focuses on emotional dynamics.<n>Our model learns expressive facial variations in a continuous latent space and generates expressions that are diverse, fluid, and emotionally coherent.
arXiv Detail & Related papers (2024-12-03T15:39:05Z) - GMTalker: Gaussian Mixture-based Audio-Driven Emotional Talking Video Portraits [60.05683966405544]
We present GMTalker, a Gaussian mixture-based emotional talking portraits generation framework.<n>Specifically, we propose a continuous and disentangled latent space, achieving more flexible emotion manipulation.<n>We also introduce a normalizing flow-based motion generator pretrained on a large dataset to generate diverse head poses, blinks, and eyeball movements.
arXiv Detail & Related papers (2023-12-12T19:03:04Z) - Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation [43.04371187071256]
We present a novel method to generate vivid and emotional 3D co-speech gestures in 3D avatars.
We use the ChatGPT-4 and an audio inpainting approach to construct the high-fidelity emotion transition human speeches.
Our method outperforms the state-of-the-art models constructed by adapting single emotion-conditioned counterparts.
arXiv Detail & Related papers (2023-11-29T11:10:40Z) - Emotional Speech-Driven Animation with Content-Emotion Disentanglement [51.34635009347183]
We propose EMOTE, which generates 3D talking-head avatars that maintain lip-sync from speech while enabling explicit control over the expression of emotion.
EmOTE produces speech-driven facial animations with better lip-sync than state-of-the-art methods trained on the same data.
arXiv Detail & Related papers (2023-06-15T09:31:31Z) - Expressive Speech-driven Facial Animation with controllable emotions [12.201573788014622]
This paper presents a novel deep learning-based approach for expressive facial animation generation from speech.
It can exhibit wide-spectrum facial expressions with controllable emotion type and intensity.
It enables emotion-controllable facial animation, where the target expression can be continuously adjusted.
arXiv Detail & Related papers (2023-01-05T11:17:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.