OmniHuman-1.5: Instilling an Active Mind in Avatars via Cognitive Simulation
- URL: http://arxiv.org/abs/2508.19209v1
- Date: Tue, 26 Aug 2025 17:15:26 GMT
- Title: OmniHuman-1.5: Instilling an Active Mind in Avatars via Cognitive Simulation
- Authors: Jianwen Jiang, Weihong Zeng, Zerong Zheng, Jiaqi Yang, Chao Liang, Wang Liao, Han Liang, Yuan Zhang, Mingyuan Gao,
- Abstract summary: Existing video avatar models can produce fluid human animations, yet they struggle to move beyond mere physical likeness to capture a character's authentic essence.<n>textbfwe propose a framework designed to generate character animations that are not only physically plausible but also semantically coherent and expressive.
- Score: 29.41106195298283
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing video avatar models can produce fluid human animations, yet they struggle to move beyond mere physical likeness to capture a character's authentic essence. Their motions typically synchronize with low-level cues like audio rhythm, lacking a deeper semantic understanding of emotion, intent, or context. To bridge this gap, \textbf{we propose a framework designed to generate character animations that are not only physically plausible but also semantically coherent and expressive.} Our model, \textbf{OmniHuman-1.5}, is built upon two key technical contributions. First, we leverage Multimodal Large Language Models to synthesize a structured textual representation of conditions that provides high-level semantic guidance. This guidance steers our motion generator beyond simplistic rhythmic synchronization, enabling the production of actions that are contextually and emotionally resonant. Second, to ensure the effective fusion of these multimodal inputs and mitigate inter-modality conflicts, we introduce a specialized Multimodal DiT architecture with a novel Pseudo Last Frame design. The synergy of these components allows our model to accurately interpret the joint semantics of audio, images, and text, thereby generating motions that are deeply coherent with the character, scene, and linguistic content. Extensive experiments demonstrate that our model achieves leading performance across a comprehensive set of metrics, including lip-sync accuracy, video quality, motion naturalness and semantic consistency with textual prompts. Furthermore, our approach shows remarkable extensibility to complex scenarios, such as those involving multi-person and non-human subjects. Homepage: \href{https://omnihuman-lab.github.io/v1_5/}
Related papers
- MIBURI: Towards Expressive Interactive Gesture Synthesis [62.45332399212876]
Embodied Conversational Agents (ECAs) aim to emulate human face-to-face interaction through speech, gestures, and facial expressions.<n>Existing solutions for ECAs produce rigid, low-diversity motions that are unsuitable for human-like interaction.<n>We present MIBURI, the first online, causal framework for generating expressive full-body gestures and facial expressions synchronized with real-time spoken dialogue.
arXiv Detail & Related papers (2026-03-03T18:59:51Z) - SynchroRaMa : Lip-Synchronized and Emotion-Aware Talking Face Generation via Multi-Modal Emotion Embedding [22.47072342385842]
SynchroRaMa is a novel framework that integrates a multi-modal emotion embedding by combining emotional signals from text and audio.<n>SynchroRaMa includes an audio-to-motion (A2M) module that generates motion frames aligned with the input audio.<n>Experiments on benchmark datasets demonstrate that SynchroRaMa outperforms the state-of-the-art.
arXiv Detail & Related papers (2025-09-24T10:21:29Z) - Think2Sing: Orchestrating Structured Motion Subtitles for Singing-Driven 3D Head Animation [69.50178144839275]
Singing involves richer emotional nuance, dynamic prosody, and lyric-based semantics.<n>Existing speech-driven approaches often produce oversimplified, emotionally flat, and semantically inconsistent results.<n>Think2Sing generates semantically coherent and temporally consistent 3D head animations conditioned on both lyrics and acoustics.
arXiv Detail & Related papers (2025-09-02T12:59:27Z) - DeepGesture: A conversational gesture synthesis system based on emotions and semantics [0.0]
DeepGesture is a diffusion-based gesture synthesis framework.<n>It generates expressive co-speech gestures conditioned on multimodal signals.<n>We show that DeepGesture produces gestures with improved human-likeness and contextual appropriateness.
arXiv Detail & Related papers (2025-07-03T20:04:04Z) - AsynFusion: Towards Asynchronous Latent Consistency Models for Decoupled Whole-Body Audio-Driven Avatars [65.53676584955686]
Whole-body audio-driven avatar pose and expression generation is a critical task for creating lifelike digital humans.<n>We propose AsynFusion, a novel framework that leverages diffusion transformers to achieve cohesive expression and gesture synthesis.<n>AsynFusion achieves state-of-the-art performance in generating real-time, synchronized whole-body animations.
arXiv Detail & Related papers (2025-05-21T03:28:53Z) - Towards Multi-Task Multi-Modal Models: A Video Generative Perspective [5.495245220300184]
This thesis chronicles our endeavor to build multi-task models for generating videos and other modalities under diverse conditions.
We unveil a novel approach to mapping bidirectionally between visual observation and interpretable lexical terms.
Our scalable visual token representation proves beneficial across generation, compression, and understanding tasks.
arXiv Detail & Related papers (2024-05-26T23:56:45Z) - Dynamic Typography: Bringing Text to Life via Video Diffusion Prior [73.72522617586593]
We present an automated text animation scheme, termed "Dynamic Typography"
It deforms letters to convey semantic meaning and infuses them with vibrant movements based on user prompts.
Our technique harnesses vector graphics representations and an end-to-end optimization-based framework.
arXiv Detail & Related papers (2024-04-17T17:59:55Z) - From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations [107.88375243135579]
Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands.
We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures.
Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods.
arXiv Detail & Related papers (2024-01-03T18:55:16Z) - TEMOS: Generating diverse human motions from textual descriptions [53.85978336198444]
We address the problem of generating diverse 3D human motions from textual descriptions.
We propose TEMOS, a text-conditioned generative model leveraging variational autoencoder (VAE) training with human motion data.
We show that TEMOS framework can produce both skeleton-based animations as in prior work, as well more expressive SMPL body motions.
arXiv Detail & Related papers (2022-04-25T14:53:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.