Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation
- URL: http://arxiv.org/abs/2601.00664v1
- Date: Fri, 02 Jan 2026 11:58:48 GMT
- Title: Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation
- Authors: Taekyung Ki, Sangwon Jang, Jaehyeong Jo, Jaehong Yoon, Sung Ju Hwang,
- Abstract summary: Talking head generation creates lifelike avatars from static portraits for virtual communication and content creation.<n>Current models do not yet convey the feeling of truly interactive communication, often generating one-way responses that lack emotional engagement.<n>We propose Avatar Forcing, a new framework for interactive head avatar generation that models real-time user-avatar interactions through diffusion forcing.
- Score: 71.38488610271247
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Talking head generation creates lifelike avatars from static portraits for virtual communication and content creation. However, current models do not yet convey the feeling of truly interactive communication, often generating one-way responses that lack emotional engagement. We identify two key challenges toward truly interactive avatars: generating motion in real-time under causal constraints and learning expressive, vibrant reactions without additional labeled data. To address these challenges, we propose Avatar Forcing, a new framework for interactive head avatar generation that models real-time user-avatar interactions through diffusion forcing. This design allows the avatar to process real-time multimodal inputs, including the user's audio and motion, with low latency for instant reactions to both verbal and non-verbal cues such as speech, nods, and laughter. Furthermore, we introduce a direct preference optimization method that leverages synthetic losing samples constructed by dropping user conditions, enabling label-free learning of expressive interaction. Experimental results demonstrate that our framework enables real-time interaction with low latency (approximately 500ms), achieving 6.8X speedup compared to the baseline, and produces reactive and expressive avatar motion, which is preferred over 80% against the baseline.
Related papers
- Making Avatars Interact: Towards Text-Driven Human-Object Interaction for Controllable Talking Avatars [32.76524805419984]
Existing methods can generate full-body talking avatars with simple human motion.<n>This challenge stems from the need for environmental perception and the control-quality dilemma in GHOI generation.<n>We propose a novel dual-stream framework, InteractAvatar, which decouples perception and planning from video synthesis for grounded human-object interaction.
arXiv Detail & Related papers (2026-02-02T02:12:09Z) - JoyAvatar: Unlocking Highly Expressive Avatars via Harmonized Text-Audio Conditioning [18.72712280434528]
JoyAvatar is a framework capable of generating long duration avatar videos.<n>We introduce a twin-teacher enhanced training algorithm that enables the model to transfer inherent text-controllability.<n>During training, we dynamically modulate the strength of multi-modal conditions.
arXiv Detail & Related papers (2026-01-31T13:00:57Z) - StreamAvatar: Streaming Diffusion Models for Real-Time Interactive Human Avatars [32.75338796722652]
We propose a two-stage autoregressive adaptation and acceleration framework to adapt a high-fidelity human video diffusion model for real-time, interactive streaming.<n>We develop a one-shot, interactive, human avatar model capable of generating both natural talking and listening behaviors with coherent gestures.<n>Our method achieves state-of-the-art performance, surpassing existing approaches in generation quality, real-time efficiency, and interaction naturalness.
arXiv Detail & Related papers (2025-12-26T15:41:24Z) - Audio Driven Real-Time Facial Animation for Social Telepresence [65.66220599734338]
We present an audio-driven real-time system for animating photorealistic 3D facial avatars with minimal latency.<n>Central to our approach is an encoder model that transforms audio signals into latent facial expression sequences in real time.<n>We capture the rich spectrum of facial expressions necessary for natural communication while achieving real-time performance.
arXiv Detail & Related papers (2025-10-01T17:57:05Z) - EAI-Avatar: Emotion-Aware Interactive Talking Head Generation [35.56554951482687]
We propose EAI-Avatar, a novel emotion-aware talking head generation framework for dyadic interactions.<n>Our method produces temporally consistent virtual avatars with rich emotional variations that seamlessly transition between speaking and listening states.
arXiv Detail & Related papers (2025-08-25T13:07:03Z) - SmartAvatar: Text- and Image-Guided Human Avatar Generation with VLM AI Agents [91.26239311240873]
SmartAvatar is a vision-language-agent-driven framework for generating fully rigged, animation-ready 3D human avatars.<n>A key innovation is an autonomous verification loop, where the agent renders draft avatars.<n>The generated avatars are fully rigged and support pose manipulation with consistent identity and appearance.
arXiv Detail & Related papers (2025-06-05T03:49:01Z) - Dyadic Interaction Modeling for Social Behavior Generation [6.626277726145613]
We present an effective framework for creating 3D facial motions in dyadic interactions.
The heart of our framework is Dyadic Interaction Modeling (DIM), a pre-training approach.
Experiments demonstrate the superiority of our framework in generating listener motions.
arXiv Detail & Related papers (2024-03-14T03:21:33Z) - From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations [107.88375243135579]
Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands.
We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures.
Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods.
arXiv Detail & Related papers (2024-01-03T18:55:16Z) - Drivable Volumetric Avatars using Texel-Aligned Features [52.89305658071045]
Photo telepresence requires both high-fidelity body modeling and faithful driving to enable dynamically synthesized appearance.
We propose an end-to-end framework that addresses two core challenges in modeling and driving full-body avatars of real people.
arXiv Detail & Related papers (2022-07-20T09:28:16Z) - DialogueNeRF: Towards Realistic Avatar Face-to-Face Conversation Video
Generation [54.84137342837465]
Face-to-face conversations account for the vast majority of daily conversations.
Most existing methods focused on single-person talking head generation.
We propose a novel unified framework based on neural radiance field (NeRF)
arXiv Detail & Related papers (2022-03-15T14:16:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.