OmniAvatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation
- URL: http://arxiv.org/abs/2506.18866v1
- Date: Mon, 23 Jun 2025 17:33:03 GMT
- Title: OmniAvatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation
- Authors: Qijun Gan, Ruizi Yang, Jianke Zhu, Shaofei Xue, Steven Hoi,
- Abstract summary: We introduce OmniAvatar, an audio-driven full-body video generation model.<n>It enhances human animation with improved lip-sync accuracy and natural movements.<n>Experiments show it surpasses existing models in both facial and semi-body video generation.
- Score: 11.71823020976487
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Significant progress has been made in audio-driven human animation, while most existing methods focus mainly on facial movements, limiting their ability to create full-body animations with natural synchronization and fluidity. They also struggle with precise prompt control for fine-grained generation. To tackle these challenges, we introduce OmniAvatar, an innovative audio-driven full-body video generation model that enhances human animation with improved lip-sync accuracy and natural movements. OmniAvatar introduces a pixel-wise multi-hierarchical audio embedding strategy to better capture audio features in the latent space, enhancing lip-syncing across diverse scenes. To preserve the capability for prompt-driven control of foundation models while effectively incorporating audio features, we employ a LoRA-based training approach. Extensive experiments show that OmniAvatar surpasses existing models in both facial and semi-body video generation, offering precise text-based control for creating videos in various domains, such as podcasts, human interactions, dynamic scenes, and singing. Our project page is https://omni-avatar.github.io/.
Related papers
- MOSPA: Human Motion Generation Driven by Spatial Audio [56.735282455483954]
We introduce the first comprehensive Spatial Audio-Driven Human Motion dataset, which contains diverse and high-quality spatial audio and motion data.<n>We develop a simple yet effective diffusion-based generative framework for human MOtion generation driven by SPatial Audio, termed MOSPA.<n>Once trained, MOSPA could generate diverse realistic human motions conditioned on varying spatial audio inputs.
arXiv Detail & Related papers (2025-07-16T06:33:11Z) - FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis [12.987186425491242]
We propose a novel framework to generate high-fidelity, coherent talking portraits with controllable motion dynamics.<n>In the first stage, we employ a clip-level training scheme to establish coherent global motion.<n>In the second stage, we refine lip movements at the frame level using a lip-tracing mask, ensuring precise synchronization with audio signals.
arXiv Detail & Related papers (2025-04-07T08:56:01Z) - AudCast: Audio-Driven Human Video Generation by Cascaded Diffusion Transformers [83.90298286498306]
Existing methods mostly focus on driving facial movements, leading to non-coherent head and body dynamics.<n>We propose AudCast, a general audio-driven human video generation framework adopting a cascade Diffusion-Transformers (DiTs) paradigm.<n>Our framework generates high-fidelity audio-driven holistic human videos with temporal coherence and fine facial and hand details.
arXiv Detail & Related papers (2025-03-25T16:38:23Z) - Versatile Multimodal Controls for Expressive Talking Human Animation [26.61771541877306]
VersaAnimator is a versatile framework that synthesizes expressive talking human videos from arbitrary portrait images.<n>We introduce a token2pose translator to smoothly map 3D motion tokens to 2D pose sequences.
arXiv Detail & Related papers (2025-03-10T08:38:25Z) - OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models [25.45077656291886]
We propose a Diffusion Transformer-based framework that scales up data by mixing motion-related conditions into the training phase.<n>These designs enable OmniHuman to fully leverage data-driven motion generation, ultimately achieving highly realistic human video generation.<n>Compared to existing end-to-end audio-driven methods, OmniHuman not only produces more realistic videos, but also offers greater flexibility in inputs.
arXiv Detail & Related papers (2025-02-03T05:17:32Z) - GaussianSpeech: Audio-Driven Gaussian Avatars [76.10163891172192]
We introduce GaussianSpeech, a novel approach that synthesizes high-fidelity animation sequences of photo-realistic, personalized 3D human head avatars from spoken audio.<n>We propose a compact and efficient 3DGS-based avatar representation that generates expression-dependent color and leverages wrinkle- and perceptually-based losses to synthesize facial details.
arXiv Detail & Related papers (2024-11-27T18:54:08Z) - Allo-AVA: A Large-Scale Multimodal Conversational AI Dataset for Allocentric Avatar Gesture Animation [1.9797215742507548]
Allo-AVA is a dataset specifically designed for text and audio-driven avatar gesture animation in an allocentric (third person point-of-view) context.
This resource enables the development and evaluation of more natural, context-aware avatar animation models.
arXiv Detail & Related papers (2024-10-21T20:50:51Z) - LinguaLinker: Audio-Driven Portraits Animation with Implicit Facial Control Enhancement [8.973545189395953]
This study focuses on the creation of visually compelling, time-synchronized animations through diffusion-based techniques.
We process audio features separately and derive the corresponding control gates, which implicitly govern the movements in the mouth, eyes, and head, irrespective of the portrait's origin.
The significant improvements in the fidelity of animated portraits, the accuracy of lip-syncing, and the appropriate motion variations achieved by our method render it a versatile tool for animating any portrait in any language.
arXiv Detail & Related papers (2024-07-26T08:30:06Z) - Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos [87.32349247938136]
Existing approaches implicitly assume total correspondence between the video and audio during training.
We propose a novel ambient-aware audio generation model, AV-LDM.
Our approach is the first to focus video-to-audio generation faithfully on the observed visual content.
arXiv Detail & Related papers (2024-06-13T16:10:19Z) - MeshTalk: 3D Face Animation from Speech using Cross-Modality
Disentanglement [142.9900055577252]
We propose a generic audio-driven facial animation approach that achieves highly realistic motion synthesis results for the entire face.
Our approach ensures highly accurate lip motion, while also plausible animation of the parts of the face that are uncorrelated to the audio signal, such as eye blinks and eye brow motion.
arXiv Detail & Related papers (2021-04-16T17:05:40Z) - Audio- and Gaze-driven Facial Animation of Codec Avatars [149.0094713268313]
We describe the first approach to animate Codec Avatars in real-time using audio and/or eye tracking.
Our goal is to display expressive conversations between individuals that exhibit important social signals.
arXiv Detail & Related papers (2020-08-11T22:28:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.