FantasyTalking2: Timestep-Layer Adaptive Preference Optimization for Audio-Driven Portrait Animation
- URL: http://arxiv.org/abs/2508.11255v1
- Date: Fri, 15 Aug 2025 06:43:46 GMT
- Title: FantasyTalking2: Timestep-Layer Adaptive Preference Optimization for Audio-Driven Portrait Animation
- Authors: MengChao Wang, Qiang Wang, Fan Jiang, Mu Xu,
- Abstract summary: We introduce Talking-Critic, a reward model that learns human-aligned reward functions to quantify how well generated videos satisfy multidimensional expectations.<n>We also propose Timestep-Layer adaptive multi-expert Preference Optimization (TLPO), a novel framework for aligning diffusion-based portrait animation models with fine-grained, multidimensional preferences.<n>Experiments demonstrate Talking-Critic significantly outperforms existing methods in aligning with human preference ratings.
- Score: 7.550875699205677
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in audio-driven portrait animation have demonstrated impressive capabilities. However, existing methods struggle to align with fine-grained human preferences across multiple dimensions, such as motion naturalness, lip-sync accuracy, and visual quality. This is due to the difficulty of optimizing among competing preference objectives, which often conflict with one another, and the scarcity of large-scale, high-quality datasets with multidimensional preference annotations. To address these, we first introduce Talking-Critic, a multimodal reward model that learns human-aligned reward functions to quantify how well generated videos satisfy multidimensional expectations. Leveraging this model, we curate Talking-NSQ, a large-scale multidimensional human preference dataset containing 410K preference pairs. Finally, we propose Timestep-Layer adaptive multi-expert Preference Optimization (TLPO), a novel framework for aligning diffusion-based portrait animation models with fine-grained, multidimensional preferences. TLPO decouples preferences into specialized expert modules, which are then fused across timesteps and network layers, enabling comprehensive, fine-grained enhancement across all dimensions without mutual interference. Experiments demonstrate that Talking-Critic significantly outperforms existing methods in aligning with human preference ratings. Meanwhile, TLPO achieves substantial improvements over baseline models in lip-sync accuracy, motion naturalness, and visual quality, exhibiting superior performance in both qualitative and quantitative evaluations. Ours project page: https://fantasy-amap.github.io/fantasy-talking2/
Related papers
- JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation [112.614973927778]
Joint audio-video generation (JAVG) produces synchronized and semantically aligned sound and vision from textual descriptions.<n>This paper presents JavisDiT++, a framework for unified modeling and optimization of JAVG.<n>Our model achieves state-of-the-art performance merely with around 1M public training entries.
arXiv Detail & Related papers (2026-02-22T12:44:28Z) - CounterVid: Counterfactual Video Generation for Mitigating Action and Temporal Hallucinations in Video-Language Models [66.56549019393042]
Video-language models (VLMs) achieve strong multimodal understanding but remain prone to hallucinations, especially when reasoning about actions and temporal order.<n>We propose a scalable framework for counterfactual video generation that synthesizes videos differing only in actions or temporal structure while preserving scene context.
arXiv Detail & Related papers (2026-01-08T10:03:07Z) - Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models [65.16788152626499]
LocalDPO builds a novel framework for aligning video diffusion models with human preferences.<n>We show that LocalDPO consistently improves video fidelity, temporal coherence and human preference scores over other post-training approaches.
arXiv Detail & Related papers (2026-01-07T16:32:17Z) - SynPO: Synergizing Descriptiveness and Preference Optimization for Video Detailed Captioning [69.34975070207763]
We leverage preference learning to enhance the performance of vision-language models in fine-grained video captioning.<n>We propose a novel optimization method offering significant advantages over DPO and its variants.<n>Results demonstrate that SynPO consistently outperforms DPO variants while achieving 20% improvement in training efficiency.
arXiv Detail & Related papers (2025-06-01T04:51:49Z) - Aligning Anime Video Generation with Human Feedback [31.701968335565393]
Anime video generation faces significant challenges due to the scarcity of anime data and unusual motion patterns.<n>Existing reward models, designed primarily for real-world videos, fail to capture the unique appearance and consistency requirements of anime.<n>We propose a pipeline to enhance anime video generation by leveraging human feedback for better alignment.
arXiv Detail & Related papers (2025-04-14T09:49:34Z) - Multi-Step Alignment as Markov Games: An Optimistic Online Gradient Descent Approach with Convergence Guarantees [91.88803125231189]
Reinforcement Learning from Human Feedback (RLHF) has been highly successful in aligning large language models with human preferences.<n>While prevalent methods like DPO have demonstrated strong performance, they frame interactions with the language model as a bandit problem.<n>In this paper, we address these challenges by modeling the alignment problem as a two-player constant-sum Markov game.
arXiv Detail & Related papers (2025-02-18T09:33:48Z) - Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization [49.302188710680866]
Preference optimization for diffusion models aims to align them with human preferences for images.<n>We show that pre-trained diffusion models are naturally suited for step-level reward modeling in the noisy latent space.<n>We introduce Latent Preference Optimization (LPO), a step-level preference optimization method conducted directly in the noisy latent space.
arXiv Detail & Related papers (2025-02-03T04:51:28Z) - CHiP: Cross-modal Hierarchical Direct Preference Optimization for Multimodal LLMs [107.21334626890713]
Multimodal Large Language Models (MLLMs) still struggle with hallucinations despite their impressive capabilities.<n>We propose a Cross-modal Hierarchical Direct Preference Optimization (CHiP) to address these limitations.<n>We evaluate CHiP through both quantitative and qualitative analyses, with results across multiple benchmarks demonstrating its effectiveness in reducing hallucinations.
arXiv Detail & Related papers (2025-01-28T02:05:38Z) - Personalized Preference Fine-tuning of Diffusion Models [75.22218338096316]
We introduce PPD, a multi-reward optimization objective that aligns diffusion models with personalized preferences.<n>With PPD, a diffusion model learns the individual preferences of a population of users in a few-shot way.<n>Our approach achieves an average win rate of 76% over Stable Cascade, generating images that more accurately reflect specific user preferences.
arXiv Detail & Related papers (2025-01-11T22:38:41Z) - VideoDPO: Omni-Preference Alignment for Video Diffusion Generation [48.36302380755874]
Direct Preference Optimization (DPO) has demonstrated significant improvements in language and image generation.<n>We propose a VideoDPO pipeline by making several key adjustments.<n>Our experiments demonstrate substantial improvements in both visual quality and semantic alignment.
arXiv Detail & Related papers (2024-12-18T18:59:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.