Related papers: AsynFusion: Towards Asynchronous Latent Consistency Models for Decoupled Whole-Body Audio-Driven Avatars

AsynFusion: Towards Asynchronous Latent Consistency Models for Decoupled Whole-Body Audio-Driven Avatars

URL: http://arxiv.org/abs/2505.15058v1
Date: Wed, 21 May 2025 03:28:53 GMT
Title: AsynFusion: Towards Asynchronous Latent Consistency Models for Decoupled Whole-Body Audio-Driven Avatars
Authors: Tianbao Zhang, Jian Zhao, Yuer Li, Zheng Zhu, Ping Hu, Zhaoxin Fan, Wenjun Wu, Xuelong Li,
Abstract summary: Whole-body audio-driven avatar pose and expression generation is a critical task for creating lifelike digital humans.<n>We propose AsynFusion, a novel framework that leverages diffusion transformers to achieve cohesive expression and gesture synthesis.<n>AsynFusion achieves state-of-the-art performance in generating real-time, synchronized whole-body animations.
Score: 65.53676584955686
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Whole-body audio-driven avatar pose and expression generation is a critical task for creating lifelike digital humans and enhancing the capabilities of interactive virtual agents, with wide-ranging applications in virtual reality, digital entertainment, and remote communication. Existing approaches often generate audio-driven facial expressions and gestures independently, which introduces a significant limitation: the lack of seamless coordination between facial and gestural elements, resulting in less natural and cohesive animations. To address this limitation, we propose AsynFusion, a novel framework that leverages diffusion transformers to achieve harmonious expression and gesture synthesis. The proposed method is built upon a dual-branch DiT architecture, which enables the parallel generation of facial expressions and gestures. Within the model, we introduce a Cooperative Synchronization Module to facilitate bidirectional feature interaction between the two modalities, and an Asynchronous LCM Sampling strategy to reduce computational overhead while maintaining high-quality outputs. Extensive experiments demonstrate that AsynFusion achieves state-of-the-art performance in generating real-time, synchronized whole-body animations, consistently outperforming existing methods in both quantitative and qualitative evaluations.

Related papers

MirrorMe: Towards Realtime and High Fidelity Audio-Driven Halfbody Animation [21.216297567167036]
MirrorMe is a real-time, controllable framework built on the LTX video model.<n>MirrorMe compresses video spatially and temporally for efficient latent space denoising.<n> experiments on the EMTD Benchmark demonstrate MirrorMe's state-of-the-art performance in fidelity, lip-sync accuracy, and temporal stability.
arXiv Detail & Related papers (2025-06-27T09:57:23Z)
MotionRAG-Diff: A Retrieval-Augmented Diffusion Framework for Long-Term Music-to-Dance Generation [10.203209816178552]
MotionRAG-Diff is a hybrid framework that integrates Retrieval-Augmented Generation and diffusion-based refinement.<n>Our method introduces three core innovations.<n>It achieves state-of-the-art performance in motion quality, diversity, and music-motion synchronization accuracy.
arXiv Detail & Related papers (2025-06-03T09:12:48Z)
ReCoM: Realistic Co-Speech Motion Generation with Recurrent Embedded Transformer [58.49950218437718]
We present ReCoM, an efficient framework for generating high-fidelity and generalizable human body motions synchronized with speech.<n>The core innovation lies in the Recurrent Embedded Transformer (RET), which integrates Dynamic Embedding Regularization (DER) into a Vision Transformer (ViT) core architecture.<n>To enhance model robustness, we incorporate the proposed DER strategy, which equips the model with dual capabilities of noise resistance and cross-domain generalization.
arXiv Detail & Related papers (2025-03-27T16:39:40Z)
Cosh-DiT: Co-Speech Gesture Video Synthesis via Hybrid Audio-Visual Diffusion Transformers [58.86974149731874]
Cosh-DiT is a Co-speech gesture video system with hybrid Diffusion Transformers.<n>We introduce an audio Diffusion Transformer to synthesize expressive gesture dynamics synchronized with speech rhythms.<n>For realistic video synthesis conditioned on the generated speech-driven motion, we design a visual Diffusion Transformer.
arXiv Detail & Related papers (2025-03-13T01:36:05Z)
SyncDiff: Synchronized Motion Diffusion for Multi-Body Human-Object Interaction Synthesis [22.14972920585117]
We introduce SyncDiff, a novel method for multi-body interaction synthesis using a synchronized motion diffusion strategy.<n>To enhance motion fidelity, we propose a frequency-domain motion decomposition scheme.<n>We also introduce a new set of alignment scores to emphasize the synchronization of different body motions.
arXiv Detail & Related papers (2024-12-28T10:12:12Z)
Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation [29.87407471246318]
This research delves into the complexities of synchronizing facial movements and creating visually appealing, temporally consistent animations. Our innovative approach embraces the end-to-end diffusion paradigm and introduces a hierarchical audio-driven visual synthesis module. The proposed hierarchical audio-driven visual synthesis offers adaptive control over expression and pose diversity, enabling more effective personalization tailored to different identities.
arXiv Detail & Related papers (2024-06-13T04:33:20Z)
FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio [45.71036380866305]
We abstract the process of people hearing speech, extracting meaningful cues, and creating dynamically audio-consistent talking faces from a single audio. Specifically, it involves two critical challenges: one is to effectively decouple identity, content, and emotion from entangled audio, and the other is to maintain intra-video diversity and inter-video consistency. We introduce the Controllable Coherent Frame generation, which involves the flexible integration of three trainable adapters with frozen Latent Diffusion Models.
arXiv Detail & Related papers (2024-03-04T09:59:48Z)
DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation [72.85685916829321]
DiffSHEG is a Diffusion-based approach for Speech-driven Holistic 3D Expression and Gesture generation with arbitrary length. By enabling the real-time generation of expressive and synchronized motions, DiffSHEG showcases its potential for various applications in the development of digital humans and embodied agents.
arXiv Detail & Related papers (2024-01-09T11:38:18Z)
Pose-Controllable 3D Facial Animation Synthesis using Hierarchical Audio-Vertex Attention [52.63080543011595]
A novel pose-controllable 3D facial animation synthesis method is proposed by utilizing hierarchical audio-vertex attention. The proposed method can produce more realistic facial expressions and head posture movements.
arXiv Detail & Related papers (2023-02-24T09:36:31Z)
MoFusion: A Framework for Denoising-Diffusion-based Motion Synthesis [73.52948992990191]
MoFusion is a new denoising-diffusion-based framework for high-quality conditional human motion synthesis. We present ways to introduce well-known kinematic losses for motion plausibility within the motion diffusion framework. We demonstrate the effectiveness of MoFusion compared to the state of the art on established benchmarks in the literature.
arXiv Detail & Related papers (2022-12-08T18:59:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.