JoyAvatar: Unlocking Highly Expressive Avatars via Harmonized Text-Audio Conditioning
- URL: http://arxiv.org/abs/2602.00702v1
- Date: Sat, 31 Jan 2026 13:00:57 GMT
- Title: JoyAvatar: Unlocking Highly Expressive Avatars via Harmonized Text-Audio Conditioning
- Authors: Ruikui Wang, Jinheng Feng, Lang Tian, Huaishao Luo, Chaochao Li, Liangbo Zhou, Huan Zhang, Youzheng Wu, Xiaodong He,
- Abstract summary: JoyAvatar is a framework capable of generating long duration avatar videos.<n>We introduce a twin-teacher enhanced training algorithm that enables the model to transfer inherent text-controllability.<n>During training, we dynamically modulate the strength of multi-modal conditions.
- Score: 18.72712280434528
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing video avatar models have demonstrated impressive capabilities in scenarios such as talking, public speaking, and singing. However, the majority of these methods exhibit limited alignment with respect to text instructions, particularly when the prompts involve complex elements including large full-body movement, dynamic camera trajectory, background transitions, or human-object interactions. To break out this limitation, we present JoyAvatar, a framework capable of generating long duration avatar videos, featuring two key technical innovations. Firstly, we introduce a twin-teacher enhanced training algorithm that enables the model to transfer inherent text-controllability from the foundation model while simultaneously learning audio-visual synchronization. Secondly, during training, we dynamically modulate the strength of multi-modal conditions (e.g., audio and text) based on the distinct denoising timestep, aiming to mitigate conflicts between the heterogeneous conditioning signals. These two key designs serve to substantially expand the avatar model's capacity to generate natural, temporally coherent full-body motions and dynamic camera movements as well as preserve the basic avatar capabilities, such as accurate lip-sync and identity consistency. GSB evaluation results demonstrate that our JoyAvatar model outperforms the state-of-the-art models such as Omnihuman-1.5 and KlingAvatar 2.0. Moreover, our approach enables complex applications including multi-person dialogues and non-human subjects role-playing. Some video samples are provided on https://joyavatar.github.io/.
Related papers
- Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation [71.38488610271247]
Talking head generation creates lifelike avatars from static portraits for virtual communication and content creation.<n>Current models do not yet convey the feeling of truly interactive communication, often generating one-way responses that lack emotional engagement.<n>We propose Avatar Forcing, a new framework for interactive head avatar generation that models real-time user-avatar interactions through diffusion forcing.
arXiv Detail & Related papers (2026-01-02T11:58:48Z) - AvatarSync: Rethinking Talking-Head Animation through Phoneme-Guided Autoregressive Perspective [15.69417162113696]
AvatarSync is an autoregressive framework on phoneme representations that generates realistic talking-head animations from a single reference image.<n>We show that AvatarSync outperforms existing talking-head animation methods in visual fidelity, temporal consistency, and computational efficiency.
arXiv Detail & Related papers (2025-09-15T15:34:02Z) - Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis [48.47254451688591]
We introduce Kling-Avatar, a novel framework that unifies multimodal instruction understanding with portrait generation.<n>Our method is capable of generating vivid, fluent, long-duration videos at up to 1080p and 48 fps.<n>These results establish Kling-Avatar as a new benchmark for semantically grounded, high-fidelity audio synthesis.
arXiv Detail & Related papers (2025-09-11T16:34:57Z) - OmniHuman-1.5: Instilling an Active Mind in Avatars via Cognitive Simulation [29.41106195298283]
Existing video avatar models can produce fluid human animations, yet they struggle to move beyond mere physical likeness to capture a character's authentic essence.<n>textbfwe propose a framework designed to generate character animations that are not only physically plausible but also semantically coherent and expressive.
arXiv Detail & Related papers (2025-08-26T17:15:26Z) - OmniAvatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation [11.71823020976487]
We introduce OmniAvatar, an audio-driven full-body video generation model.<n>It enhances human animation with improved lip-sync accuracy and natural movements.<n>Experiments show it surpasses existing models in both facial and semi-body video generation.
arXiv Detail & Related papers (2025-06-23T17:33:03Z) - SmartAvatar: Text- and Image-Guided Human Avatar Generation with VLM AI Agents [91.26239311240873]
SmartAvatar is a vision-language-agent-driven framework for generating fully rigged, animation-ready 3D human avatars.<n>A key innovation is an autonomous verification loop, where the agent renders draft avatars.<n>The generated avatars are fully rigged and support pose manipulation with consistent identity and appearance.
arXiv Detail & Related papers (2025-06-05T03:49:01Z) - Allo-AVA: A Large-Scale Multimodal Conversational AI Dataset for Allocentric Avatar Gesture Animation [1.9797215742507548]
Allo-AVA is a dataset specifically designed for text and audio-driven avatar gesture animation in an allocentric (third person point-of-view) context.
This resource enables the development and evaluation of more natural, context-aware avatar animation models.
arXiv Detail & Related papers (2024-10-21T20:50:51Z) - TALK-Act: Enhance Textural-Awareness for 2D Speaking Avatar Reenactment with Diffusion Model [100.35665852159785]
We propose the Motion-Enhanced Textural-Aware ModeLing for SpeaKing Avatar Reenactment (TALK-Act) framework.
Our key idea is to enhance the textural awareness with explicit motion guidance in diffusion modeling.
Our model can achieve high-fidelity 2D avatar reenactment with only 30 seconds of person-specific data.
arXiv Detail & Related papers (2024-10-14T16:38:10Z) - From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations [107.88375243135579]
Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands.
We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures.
Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods.
arXiv Detail & Related papers (2024-01-03T18:55:16Z) - Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with
Image Diffusion Model [57.855362366674264]
We propose Dancing Avatar, designed to fabricate human motion videos driven by poses and textual cues.
Our approach employs a pretrained T2I diffusion model to generate each video frame in an autoregressive fashion.
arXiv Detail & Related papers (2023-08-15T13:00:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.