Related papers: Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation

Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation

URL: http://arxiv.org/abs/2311.17532v3
Date: Wed, 27 Mar 2024 15:01:22 GMT
Title: Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation
Authors: Xingqun Qi, Jiahao Pan, Peng Li, Ruibin Yuan, Xiaowei Chi, Mengfei Li, Wenhan Luo, Wei Xue, Shanghang Zhang, Qifeng Liu, Yike Guo,
Abstract summary: We present a novel method to generate vivid and emotional 3D co-speech gestures in 3D avatars. We use the ChatGPT-4 and an audio inpainting approach to construct the high-fidelity emotion transition human speeches. Our method outperforms the state-of-the-art models constructed by adapting single emotion-conditioned counterparts.
Score: 43.04371187071256
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Generating vivid and emotional 3D co-speech gestures is crucial for virtual avatar animation in human-machine interaction applications. While the existing methods enable generating the gestures to follow a single emotion label, they overlook that long gesture sequence modeling with emotion transition is more practical in real scenes. In addition, the lack of large-scale available datasets with emotional transition speech and corresponding 3D human gestures also limits the addressing of this task. To fulfill this goal, we first incorporate the ChatGPT-4 and an audio inpainting approach to construct the high-fidelity emotion transition human speeches. Considering obtaining the realistic 3D pose annotations corresponding to the dynamically inpainted emotion transition audio is extremely difficult, we propose a novel weakly supervised training strategy to encourage authority gesture transitions. Specifically, to enhance the coordination of transition gestures w.r.t different emotional ones, we model the temporal association representation between two different emotional gesture sequences as style guidance and infuse it into the transition generation. We further devise an emotion mixture mechanism that provides weak supervision based on a learnable mixed emotion label for transition gestures. Last, we present a keyframe sampler to supply effective initial posture cues in long sequences, enabling us to generate diverse gestures. Extensive experiments demonstrate that our method outperforms the state-of-the-art models constructed by adapting single emotion-conditioned counterparts on our newly defined emotion transition task and datasets. Our code and dataset will be released on the project page: https://xingqunqi-lab.github.io/Emo-Transition-Gesture/.

Related papers

EditEmoTalk: Controllable Speech-Driven 3D Facial Animation with Continuous Expression Editing [24.15552429255594]
Speech-driven 3D facial animation aims to generate realistic and expressive facial motions directly from audio.<n>We present EditEmoTalk, a controllable speech-driven 3D facial animation framework with continuous emotion editing.
arXiv Detail & Related papers (2026-01-15T02:21:22Z)
EmoCAST: Emotional Talking Portrait via Emotive Text Description [56.42674612728354]
EmoCAST is a diffusion-based framework for precise text-driven emotional synthesis.<n>In appearance modeling, emotional prompts are integrated through a text-guided decoupled emotive module.<n>EmoCAST achieves state-of-the-art performance in generating realistic, emotionally expressive, and audio-synchronized talking-head videos.
arXiv Detail & Related papers (2025-08-28T10:02:06Z)
EAI-Avatar: Emotion-Aware Interactive Talking Head Generation [35.56554951482687]
We propose EAI-Avatar, a novel emotion-aware talking head generation framework for dyadic interactions.<n>Our method produces temporally consistent virtual avatars with rich emotional variations that seamlessly transition between speaking and listening states.
arXiv Detail & Related papers (2025-08-25T13:07:03Z)
Taming Transformer for Emotion-Controllable Talking Face Generation [61.835295250047196]
We propose a novel method to tackle the emotion-controllable talking face generation task discretely.<n>Specifically, we employ two pre-training strategies to disentangle audio into independent components and quantize videos into combinations of visual tokens.<n>We conduct experiments on the MEAD dataset that controls the emotion of videos conditioned on multiple emotional audios.
arXiv Detail & Related papers (2025-08-20T02:16:52Z)
MEDTalk: Multimodal Controlled 3D Facial Animation with Dynamic Emotions by Disentangled Embedding [48.54455964043634]
MEDTalk is a novel framework for fine-grained and dynamic emotional talking head generation.<n>We integrate audio and speech text, predicting frame-wise intensity variations and dynamically adjusting static emotion features to generate realistic emotional expressions.<n>Our generated results can be conveniently integrated into the industrial production pipeline.
arXiv Detail & Related papers (2025-07-08T15:14:27Z)
DeepGesture: A conversational gesture synthesis system based on emotions and semantics [0.0]
DeepGesture is a diffusion-based gesture synthesis framework.<n>It generates expressive co-speech gestures conditioned on multimodal signals.<n>We show that DeepGesture produces gestures with improved human-likeness and contextual appropriateness.
arXiv Detail & Related papers (2025-07-03T20:04:04Z)
Disentangle Identity, Cooperate Emotion: Correlation-Aware Emotional Talking Portrait Generation [63.94836524433559]
DICE-Talk is a framework for disentangling identity with emotion and cooperating emotions with similar characteristics. We develop a disentangled emotion embedder that jointly models audio-visual emotional cues through cross-modal attention. Second, we introduce a correlation-enhanced emotion conditioning module with learnable Emotion Banks. Third, we design an emotion discrimination objective that enforces affective consistency during the diffusion process.
arXiv Detail & Related papers (2025-04-25T05:28:21Z)
Controllable Expressive 3D Facial Animation via Diffusion in a Unified Multimodal Space [7.165879904419689]
We present a diffusion-based framework for controllable expressive 3D facial animation.<n>Our approach introduces two key innovations: (1) a FLAME-centered multimodal emotion binding strategy, and (2) an attention-based latent diffusion model.<n>Our method achieves a 21.6% improvement in emotion similarity while preserving physiologically plausible facial dynamics.
arXiv Detail & Related papers (2025-04-14T01:38:01Z)
EmoDiffusion: Enhancing Emotional 3D Facial Animation with Latent Diffusion Models [66.67979602235015]
EmoDiffusion is a novel approach that disentangles different emotions in speech to generate rich 3D emotional facial expressions. We capture facial expressions under the guidance of animation experts using LiveLinkFace on an iPhone.
arXiv Detail & Related papers (2025-03-14T02:54:22Z)
EMOdiffhead: Continuously Emotional Control in Talking Head Generation via Diffusion [5.954758598327494]
EMOdiffhead is a novel method for emotional talking head video generation. It enables fine-grained control of emotion categories and intensities. It achieves state-of-the-art performance compared to other emotion portrait animation methods.
arXiv Detail & Related papers (2024-09-11T13:23:22Z)
DEEPTalk: Dynamic Emotion Embedding for Probabilistic Speech-Driven 3D Face Animation [14.07086606183356]
Speech-driven 3D facial animation has garnered lots of attention thanks to its broad range of applications. Current methods fail to capture the nuanced emotional undertones conveyed through speech and produce monotonous facial motion. We introduce DEEPTalk, a novel approach that generates diverse and emotionally rich 3D facial expressions directly from speech inputs.
arXiv Detail & Related papers (2024-08-12T08:56:49Z)
ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis. Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities. Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z)
EmoSpeaker: One-shot Fine-grained Emotion-Controlled Talking Face Generation [34.5592743467339]
We propose a visual attribute-guided audio decoupler to generate fine-grained facial animations. To achieve more precise emotional expression, we introduce a fine-grained emotion coefficient prediction module. Our proposed method, EmoSpeaker, outperforms existing emotional talking face generation methods in terms of expression variation and lip synchronization.
arXiv Detail & Related papers (2024-02-02T14:04:18Z)
Attention-based Interactive Disentangling Network for Instance-level Emotional Voice Conversion [81.1492897350032]
Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components. We propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion.
arXiv Detail & Related papers (2023-12-29T08:06:45Z)
GMTalker: Gaussian Mixture-based Audio-Driven Emotional Talking Video Portraits [60.05683966405544]
We present GMTalker, a Gaussian mixture-based emotional talking portraits generation framework. Specifically, we propose a continuous and disentangled latent space, achieving more flexible emotion manipulation. We also introduce a normalizing flow-based motion generator pretrained on a large dataset to generate diverse head poses, blinks, and eyeball movements.
arXiv Detail & Related papers (2023-12-12T19:03:04Z)
Emotional Speech-driven 3D Body Animation via Disentangled Latent Diffusion [45.081371413693425]
Existing methods for synthesizing 3D human gestures from speech have shown promising results. We present AMUSE, an emotional speech-driven body animation model based on latent diffusion.
arXiv Detail & Related papers (2023-12-07T17:39:25Z)
Emotional Speech-Driven Animation with Content-Emotion Disentanglement [51.34635009347183]
We propose EMOTE, which generates 3D talking-head avatars that maintain lip-sync from speech while enabling explicit control over the expression of emotion. EmOTE produces speech-driven facial animations with better lip-sync than state-of-the-art methods trained on the same data.
arXiv Detail & Related papers (2023-06-15T09:31:31Z)
EmotionGesture: Audio-Driven Diverse Emotional Co-Speech 3D Gesture Generation [24.547098909937034]
EmotionGesture is a novel framework for vivid and diverse emotional co-speech 3D gestures from audio. Our framework outperforms the state-of-the-art, achieving vivid and diverse emotional co-speech 3D gestures.
arXiv Detail & Related papers (2023-05-30T09:47:29Z)
Audio-Driven Co-Speech Gesture Video Generation [92.15661971086746]
We define and study this challenging problem of audio-driven co-speech gesture video generation. Our key insight is that the co-speech gestures can be decomposed into common motion patterns and subtle rhythmic dynamics. We propose a novel framework, Audio-driveN Gesture vIdeo gEneration (ANGIE), to effectively capture the reusable co-speech gesture patterns.
arXiv Detail & Related papers (2022-12-05T15:28:22Z)
Seen and Unseen emotional style transfer for voice conversion with a new emotional speech dataset [84.53659233967225]
Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity. We propose a novel framework based on variational auto-encoding Wasserstein generative adversarial network (VAW-GAN) We show that the proposed framework achieves remarkable performance by consistently outperforming the baseline framework.
arXiv Detail & Related papers (2020-10-28T07:16:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.