Related papers: Hallo4: High-Fidelity Dynamic Portrait Animation via Direct Preference Optimization and Temporal Motion Modulation

Hallo4: High-Fidelity Dynamic Portrait Animation via Direct Preference Optimization and Temporal Motion Modulation

URL: http://arxiv.org/abs/2505.23525v1
Date: Thu, 29 May 2025 15:04:00 GMT
Title: Hallo4: High-Fidelity Dynamic Portrait Animation via Direct Preference Optimization and Temporal Motion Modulation
Authors: Jiahao Cui, Yan Chen, Mingwang Xu, Hanlin Shang, Yuxuan Chen, Yun Zhan, Zilong Dong, Yao Yao, Jingdong Wang, Siyu Zhu,
Abstract summary: We introduce direct preference optimization tailored for human-centric animation.<n>Second, the proposed temporal motion modulation resolves resolution mismatches.<n>Experiments demonstrate obvious improvements in lip-audio synchronization, expression vividness, body motion coherence over baseline methods.
Score: 26.597877504216196
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generating highly dynamic and photorealistic portrait animations driven by audio and skeletal motion remains challenging due to the need for precise lip synchronization, natural facial expressions, and high-fidelity body motion dynamics. We propose a human-preference-aligned diffusion framework that addresses these challenges through two key innovations. First, we introduce direct preference optimization tailored for human-centric animation, leveraging a curated dataset of human preferences to align generated outputs with perceptual metrics for portrait motion-video alignment and naturalness of expression. Second, the proposed temporal motion modulation resolves spatiotemporal resolution mismatches by reshaping motion conditions into dimensionally aligned latent features through temporal channel redistribution and proportional feature expansion, preserving the fidelity of high-frequency motion details in diffusion-based synthesis. The proposed mechanism is complementary to existing UNet and DiT-based portrait diffusion approaches, and experiments demonstrate obvious improvements in lip-audio synchronization, expression vividness, body motion coherence over baseline methods, alongside notable gains in human preference metrics. Our model and source code can be found at: https://github.com/xyz123xyz456/hallo4.

Related papers

M2DAO-Talker: Harmonizing Multi-granular Motion Decoupling and Alternating Optimization for Talking-head Generation [65.08520614570288]
We reformulate talking head generation into a unified framework comprising video preprocessing, motion representation, and rendering reconstruction.<n>M2DAO-Talker achieves state-of-the-art performance, with the 2.43 dB PSNR improvement in generation quality and 0.64 gain in user-evaluated video realness.
arXiv Detail & Related papers (2025-07-11T04:48:12Z)
HyperMotion: DiT-Based Pose-Guided Human Image Animation of Complex Motions [12.46263584777151]
We introduce the textbfOpen-HyperMotionX dataset and textbfHyperMotionX Bench, which provide high-quality human pose annotations and curated video clips.<n>We also propose a simple yet powerful DiT-based video generation baseline and design spatial low-frequency enhanced RoPE.<n>Our method significantly improves structural stability and appearance consistency in highly dynamic human motion sequences.
arXiv Detail & Related papers (2025-05-29T01:30:46Z)
AsynFusion: Towards Asynchronous Latent Consistency Models for Decoupled Whole-Body Audio-Driven Avatars [65.53676584955686]
Whole-body audio-driven avatar pose and expression generation is a critical task for creating lifelike digital humans.<n>We propose AsynFusion, a novel framework that leverages diffusion transformers to achieve cohesive expression and gesture synthesis.<n>AsynFusion achieves state-of-the-art performance in generating real-time, synchronized whole-body animations.
arXiv Detail & Related papers (2025-05-21T03:28:53Z)
REWIND: Real-Time Egocentric Whole-Body Motion Diffusion with Exemplar-Based Identity Conditioning [95.07708090428814]
We present REWIND, a one-step diffusion model for real-time, high-fidelity human motion estimation from egocentric image inputs.<n>We introduce cascaded body-hand denoising diffusion, which effectively models the correlation between egocentric body and hand motions.<n>We also propose a novel identity conditioning method based on a small set of pose exemplars of the target identity, which further enhances motion estimation quality.
arXiv Detail & Related papers (2025-04-07T11:44:11Z)
EvAnimate: Event-conditioned Image-to-Video Generation for Human Animation [58.41979933166173]
EvAnimate is the first method leveraging event streams as robust and precise motion cues for conditional human image animation.<n>High-quality and temporally coherent animations are achieved through a dual-branch architecture.<n>Experiment results show EvAnimate achieves high temporal fidelity and robust performance in scenarios where traditional video-derived cues fall short.
arXiv Detail & Related papers (2025-03-24T11:05:41Z)
FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait [3.3672851080270374]
FLOAT is an audio-driven talking portrait video generation method based on flow matching generative model.<n>Our method supports speech-driven emotion enhancement, enabling a natural incorporation of expressive motions.
arXiv Detail & Related papers (2024-12-02T02:50:07Z)
Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency [15.841490425454344]
We propose an end-to-end audio-only conditioned video diffusion model named Loopy.<n>Specifically, we designed an inter- and intra-clip temporal module and an audio-to-latents module, enabling the model to leverage long-term motion information.
arXiv Detail & Related papers (2024-09-04T11:55:14Z)
Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation [29.87407471246318]
This research delves into the complexities of synchronizing facial movements and creating visually appealing, temporally consistent animations. Our innovative approach embraces the end-to-end diffusion paradigm and introduces a hierarchical audio-driven visual synthesis module. The proposed hierarchical audio-driven visual synthesis offers adaptive control over expression and pose diversity, enabling more effective personalization tailored to different identities.
arXiv Detail & Related papers (2024-06-13T04:33:20Z)
FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models [85.16273912625022]
We introduce FaceTalk, a novel generative approach designed for synthesizing high-fidelity 3D motion sequences of talking human heads from audio signal. To the best of our knowledge, this is the first work to propose a generative approach for realistic and high-quality motion synthesis of human heads.
arXiv Detail & Related papers (2023-12-13T19:01:07Z)
DiffusionPhase: Motion Diffusion in Frequency Domain [69.811762407278]
We introduce a learning-based method for generating high-quality human motion sequences from text descriptions. Existing techniques struggle with motion diversity and smooth transitions in generating arbitrary-length motion sequences. We develop a network encoder that converts the motion space into a compact yet expressive parameterized phase space.
arXiv Detail & Related papers (2023-12-07T04:39:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.