Related papers: AlignHuman: Improving Motion and Fidelity via Timestep-Segment Preference Optimization for Audio-Driven Human Animation

AlignHuman: Improving Motion and Fidelity via Timestep-Segment Preference Optimization for Audio-Driven Human Animation

URL: http://arxiv.org/abs/2506.11144v1
Date: Wed, 11 Jun 2025 05:33:03 GMT
Title: AlignHuman: Improving Motion and Fidelity via Timestep-Segment Preference Optimization for Audio-Driven Human Animation
Authors: Chao Liang, Jianwen Jiang, Wang Liao, Jiaqi Yang, Zerong zheng, Weihong Zeng, Han Liang,
Abstract summary: We propose textbfAlignHuman, a framework that combines Preference Optimization as a post-training technique with a divide-and-conquer training strategy.<n>LoRAs are trained using their respective preference data and activated in the corresponding intervals during inference to enhance motion naturalness and fidelity.<n>Experiments demonstrate that AlignHuman improves strong baselines and reduces NFEs during inference, achieving a 3.3$times$ speedup.
Score: 24.745851101654612
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Recent advancements in human video generation and animation tasks, driven by diffusion models, have achieved significant progress. However, expressive and realistic human animation remains challenging due to the trade-off between motion naturalness and visual fidelity. To address this, we propose \textbf{AlignHuman}, a framework that combines Preference Optimization as a post-training technique with a divide-and-conquer training strategy to jointly optimize these competing objectives. Our key insight stems from an analysis of the denoising process across timesteps: (1) early denoising timesteps primarily control motion dynamics, while (2) fidelity and human structure can be effectively managed by later timesteps, even if early steps are skipped. Building on this observation, we propose timestep-segment preference optimization (TPO) and introduce two specialized LoRAs as expert alignment modules, each targeting a specific dimension in its corresponding timestep interval. The LoRAs are trained using their respective preference data and activated in the corresponding intervals during inference to enhance motion naturalness and fidelity. Extensive experiments demonstrate that AlignHuman improves strong baselines and reduces NFEs during inference, achieving a 3.3$\times$ speedup (from 100 NFEs to 30 NFEs) with minimal impact on generation quality. Homepage: \href{https://alignhuman.github.io/}{https://alignhuman.github.io/}

Related papers

M2DAO-Talker: Harmonizing Multi-granular Motion Decoupling and Alternating Optimization for Talking-head Generation [65.08520614570288]
We reformulate talking head generation into a unified framework comprising video preprocessing, motion representation, and rendering reconstruction.<n>M2DAO-Talker achieves state-of-the-art performance, with the 2.43 dB PSNR improvement in generation quality and 0.64 gain in user-evaluated video realness.
arXiv Detail & Related papers (2025-07-11T04:48:12Z)
GGMotion: Group Graph Dynamics-Kinematics Networks for Human Motion Prediction [9.723217255594793]
GGMotion is a group graph dynamics-kinematics network that models human topology in groups to better leverage dynamics and kinematics priors.<n>Inter-group and intra-group interaction modules are employed to capture the dependencies of joints at different scales.<n>Our approach achieves a significant performance margin in short-term motion prediction.
arXiv Detail & Related papers (2025-07-10T08:02:01Z)
Zero-Shot Temporal Interaction Localization for Egocentric Videos [13.70694228506315]
We propose a novel zero-shot TIL approach dubbed EgoLoc to locate the timings of grasp actions for human-object interaction in egocentric videos.<n>By absorbing both 2D and 3D observations, EgoLoc directly samples high-quality initial guesses around the possible contact/separation timestamps of HOI.<n>EgoLoc achieves better temporal interaction localization for egocentric videos compared to state-of-the-art baselines.
arXiv Detail & Related papers (2025-06-04T07:52:46Z)
Hallo4: High-Fidelity Dynamic Portrait Animation via Direct Preference Optimization and Temporal Motion Modulation [26.597877504216196]
We introduce direct preference optimization tailored for human-centric animation.<n>Second, the proposed temporal motion modulation resolves resolution mismatches.<n>Experiments demonstrate obvious improvements in lip-audio synchronization, expression vividness, body motion coherence over baseline methods.
arXiv Detail & Related papers (2025-05-29T15:04:00Z)
SuperFlow++: Enhanced Spatiotemporal Consistency for Cross-Modal Data Pretraining [62.433137130087445]
SuperFlow++ is a novel framework that integrates pretraining and downstream tasks using consecutive camera pairs.<n>We show that SuperFlow++ outperforms state-of-the-art methods across diverse tasks and driving conditions.<n>With strong generalizability and computational efficiency, SuperFlow++ establishes a new benchmark for data-efficient LiDAR-based perception in autonomous driving.
arXiv Detail & Related papers (2025-03-25T17:59:57Z)
EvAnimate: Event-conditioned Image-to-Video Generation for Human Animation [58.41979933166173]
EvAnimate is the first method leveraging event streams as robust and precise motion cues for conditional human image animation.<n>High-quality and temporally coherent animations are achieved through a dual-branch architecture.<n>Experiment results show EvAnimate achieves high temporal fidelity and robust performance in scenarios where traditional video-derived cues fall short.
arXiv Detail & Related papers (2025-03-24T11:05:41Z)
GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling [32.47567372398872]
GestureLSM is a flow-matching-based approach for Co-Speech Gesture Generation with spatial-temporal modeling.<n>It achieves state-of-the-art performance on BEAT2 while significantly reducing inference time compared to existing methods.
arXiv Detail & Related papers (2025-01-31T05:34:59Z)
ALOcc: Adaptive Lifting-based 3D Semantic Occupancy and Cost Volume-based Flow Prediction [89.89610257714006]
Existing methods prioritize higher accuracy to cater to the demands of these tasks. We introduce a series of targeted improvements for 3D semantic occupancy prediction and flow estimation. Our purelytemporalal architecture framework, named ALOcc, achieves an optimal tradeoff between speed and accuracy.
arXiv Detail & Related papers (2024-11-12T11:32:56Z)
MotionRL: Align Text-to-Motion Generation to Human Preferences with Multi-Reward Reinforcement Learning [99.09906827676748]
We introduce MotionRL, the first approach to utilize Multi-Reward Reinforcement Learning (RL) for optimizing text-to-motion generation tasks. Our novel approach uses reinforcement learning to fine-tune the motion generator based on human preferences prior knowledge of the human perception model. In addition, MotionRL introduces a novel multi-objective optimization strategy to approximate optimality between text adherence, motion quality, and human preferences.
arXiv Detail & Related papers (2024-10-09T03:27:14Z)
TIMotion: Temporal and Interactive Framework for Efficient Human-Human Motion Generation [30.734182958106327]
Current methods fall into two main categories: single-person-based methods and separate modeling-based methods.<n>We introduce TIMotion (Temporal and Interactive Modeling), an efficient and effective framework for human-human motion generation.
arXiv Detail & Related papers (2024-08-30T09:22:07Z)
Text-guided 3D Human Motion Generation with Keyframe-based Parallel Skip Transformer [62.29951737214263]
Existing algorithms directly generate the full sequence which is expensive and prone to errors. We propose KeyMotion, that generates plausible human motion sequences corresponding to input text. We use a Variationalcoder (VAE) with Kullback-Leibler regularization to project the Autoencoder into a latent space. For the reverse diffusion, we propose a novel Parallel Skip Transformer that performs cross-modal attention between the design latents and text condition.
arXiv Detail & Related papers (2024-05-24T11:12:37Z)
Motion-DVAE: Unsupervised learning for fast human motion denoising [18.432026846779372]
We introduce Motion-DVAE, a motion prior to capture the short-term dependencies of human motion. Together with Motion-DVAE, we introduce an unsupervised learned denoising method unifying regression- and optimization-based approaches.
arXiv Detail & Related papers (2023-06-09T12:18:48Z)
GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking Face Generation [71.73912454164834]
A modern talking face generation method is expected to achieve the goals of generalized audio-lip synchronization, good video quality, and high system efficiency. NeRF has become a popular technique in this field since it could achieve high-fidelity and 3D-consistent talking face generation with a few-minute-long training video. We propose GeneFace++ to handle these challenges by utilizing the rendering pitch contour as an auxiliary feature and introducing a temporal loss in the facial motion prediction process.
arXiv Detail & Related papers (2023-05-01T12:24:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.