Related papers: EAI-Avatar: Emotion-Aware Interactive Talking Head Generation

EAI-Avatar: Emotion-Aware Interactive Talking Head Generation

URL: http://arxiv.org/abs/2508.18337v2
Date: Wed, 24 Sep 2025 06:28:07 GMT
Title: EAI-Avatar: Emotion-Aware Interactive Talking Head Generation
Authors: Haijie Yang, Zhenyu Zhang, Hao Tang, Jianjun Qian, Jian Yang,
Abstract summary: We propose EAI-Avatar, a novel emotion-aware talking head generation framework for dyadic interactions.<n>Our method produces temporally consistent virtual avatars with rich emotional variations that seamlessly transition between speaking and listening states.
Score: 35.56554951482687
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generative models have advanced rapidly, enabling impressive talking head generation that brings AI to life. However, most existing methods focus solely on one-way portrait animation. Even the few that support bidirectional conversational interactions lack precise emotion-adaptive capabilities, significantly limiting their practical applicability. In this paper, we propose EAI-Avatar, a novel emotion-aware talking head generation framework for dyadic interactions. Leveraging the dialogue generation capability of large language models (LLMs, e.g., GPT-4), our method produces temporally consistent virtual avatars with rich emotional variations that seamlessly transition between speaking and listening states. Specifically, we design a Transformer-based head mask generator that learns temporally consistent motion features in a latent mask space, capable of generating arbitrary-length, temporally consistent mask sequences to constrain head motions. Furthermore, we introduce an interactive talking tree structure to represent dialogue state transitions, where each tree node contains information such as child/parent/sibling nodes and the current character's emotional state. By performing reverse-level traversal, we extract rich historical emotional cues from the current node to guide expression synthesis. Extensive experiments demonstrate the superior performance and effectiveness of our method.

Related papers

MIBURI: Towards Expressive Interactive Gesture Synthesis [62.45332399212876]
Embodied Conversational Agents (ECAs) aim to emulate human face-to-face interaction through speech, gestures, and facial expressions.<n>Existing solutions for ECAs produce rigid, low-diversity motions that are unsuitable for human-like interaction.<n>We present MIBURI, the first online, causal framework for generating expressive full-body gestures and facial expressions synchronized with real-time spoken dialogue.
arXiv Detail & Related papers (2026-03-03T18:59:51Z)
AUHead: Realistic Emotional Talking Head Generation via Action Units Control [67.20660861826357]
Realistic talking-head video generation is critical for virtual avatars, film production, and interactive systems.<n>Current methods struggle with nuanced emotional expressions due to the lack of fine-grained emotion control.<n>We introduce a novel two-stage method to disentangle emotion control, i.e. Action Units (AUs), from audio and achieve controllable generation.
arXiv Detail & Related papers (2026-02-10T08:45:51Z)
EmoCAST: Emotional Talking Portrait via Emotive Text Description [56.42674612728354]
EmoCAST is a diffusion-based framework for precise text-driven emotional synthesis.<n>In appearance modeling, emotional prompts are integrated through a text-guided decoupled emotive module.<n>EmoCAST achieves state-of-the-art performance in generating realistic, emotionally expressive, and audio-synchronized talking-head videos.
arXiv Detail & Related papers (2025-08-28T10:02:06Z)
Taming Transformer for Emotion-Controllable Talking Face Generation [61.835295250047196]
We propose a novel method to tackle the emotion-controllable talking face generation task discretely.<n>Specifically, we employ two pre-training strategies to disentangle audio into independent components and quantize videos into combinations of visual tokens.<n>We conduct experiments on the MEAD dataset that controls the emotion of videos conditioned on multiple emotional audios.
arXiv Detail & Related papers (2025-08-20T02:16:52Z)
MEDTalk: Multimodal Controlled 3D Facial Animation with Dynamic Emotions by Disentangled Embedding [48.54455964043634]
MEDTalk is a novel framework for fine-grained and dynamic emotional talking head generation.<n>We integrate audio and speech text, predicting frame-wise intensity variations and dynamically adjusting static emotion features to generate realistic emotional expressions.<n>Our generated results can be conveniently integrated into the industrial production pipeline.
arXiv Detail & Related papers (2025-07-08T15:14:27Z)
Emotional Face-to-Speech [13.725558939494407]
Existing face-to-speech methods offer great promise in capturing identity characteristics but struggle to generate diverse vocal styles with emotional expression.<n>We introduce DEmoFace, a novel generative framework that leverages a discrete diffusion transformer (DiT) with curriculum learning.<n>We develop an enhanced predictor-free guidance to handle diverse conditioning scenarios, enabling multi-conditional generation and disentangling complex attributes effectively.
arXiv Detail & Related papers (2025-02-03T04:48:50Z)
INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations [11.101103116878438]
We propose INFP, a novel audio-driven head generation framework for dyadic interaction.<n>INFP comprises a Motion-Based Head Imitation stage and an Audio-Guided Motion Generation stage.<n>To facilitate this line of research, we introduce DyConv, a large scale dataset of rich dyadic conversations collected from the Internet.
arXiv Detail & Related papers (2024-12-05T10:20:34Z)
EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion [49.55774551366049]
Diffusion models have revolutionized the field of talking head generation, yet still face challenges in expressiveness, controllability, and stability in long-time generation.<n>We propose an EmotiveTalk framework to address these issues.<n> Experimental results show that EmotiveTalk can generate expressive talking head videos, ensuring the promised controllability of emotions and stability during long-time generation.
arXiv Detail & Related papers (2024-11-23T04:38:51Z)
DEEPTalk: Dynamic Emotion Embedding for Probabilistic Speech-Driven 3D Face Animation [14.07086606183356]
Speech-driven 3D facial animation has garnered lots of attention thanks to its broad range of applications.<n>Current methods fail to capture the nuanced emotional undertones conveyed through speech and produce monotonous facial motion.<n>We introduce DEEPTalk, a novel approach that generates diverse and emotionally rich 3D facial expressions directly from speech inputs.
arXiv Detail & Related papers (2024-08-12T08:56:49Z)
Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation [43.04371187071256]
We present a novel method to generate vivid and emotional 3D co-speech gestures in 3D avatars. We use the ChatGPT-4 and an audio inpainting approach to construct the high-fidelity emotion transition human speeches. Our method outperforms the state-of-the-art models constructed by adapting single emotion-conditioned counterparts.
arXiv Detail & Related papers (2023-11-29T11:10:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.