Identity-Preserving Image-to-Video Generation via Reward-Guided Optimization
- URL: http://arxiv.org/abs/2510.14255v3
- Date: Thu, 23 Oct 2025 11:15:51 GMT
- Title: Identity-Preserving Image-to-Video Generation via Reward-Guided Optimization
- Authors: Liao Shen, Wentao Jiang, Yiran Zhu, Jiahe Li, Tiezheng Ge, Zhiguo Cao, Bo Zheng,
- Abstract summary: We propose Identity-Preserving Reward-guided Optimization (IPRO) for image-to-video (I2V) generation.<n>IPRO is a novel video diffusion framework based on reinforcement learning to enhance identity preservation.<n>Our method backpropagates the reward signal through the last steps of the sampling chain, enabling richer feedback.
- Score: 38.70220886362519
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in image-to-video (I2V) generation have achieved remarkable progress in synthesizing high-quality, temporally coherent videos from static images. Among all the applications of I2V, human-centric video generation includes a large portion. However, existing I2V models encounter difficulties in maintaining identity consistency between the input human image and the generated video, especially when the person in the video exhibits significant expression changes and movements. This issue becomes critical when the human face occupies merely a small fraction of the image. Since humans are highly sensitive to identity variations, this poses a critical yet under-explored challenge in I2V generation. In this paper, we propose Identity-Preserving Reward-guided Optimization (IPRO), a novel video diffusion framework based on reinforcement learning to enhance identity preservation. Instead of introducing auxiliary modules or altering model architectures, our approach introduces a direct and effective tuning algorithm that optimizes diffusion models using a face identity scorer. To improve performance and accelerate convergence, our method backpropagates the reward signal through the last steps of the sampling chain, enabling richer gradient feedback. We also propose a novel facial scoring mechanism that treats faces in ground-truth videos as facial feature pools, providing multi-angle facial information to enhance generalization. A KL-divergence regularization is further incorporated to stabilize training and prevent overfitting to the reward signal. Extensive experiments on Wan 2.2 I2V model and our in-house I2V model demonstrate the effectiveness of our method. Our project and code are available at https://ipro-alimama.github.io/.
Related papers
- Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality [48.231357260785195]
We present LivingSwap, the first video reference guided face swapping model.<n>By combining video conditioning with video reference guidance, the model performs temporal stitching to ensure stable identity and high-fidelity reconstruction.<n>Our method integrates the target identity with the source video's expressions, lighting, and motion, while significantly reducing manual effort in production.
arXiv Detail & Related papers (2025-12-08T19:00:04Z) - BachVid: Training-Free Video Generation with Consistent Background and Character [62.46376250180513]
Diffusion Transformers (DiTs) have recently driven significant progress in text-to-video (T2V) generation.<n>Existing methods typically rely on reference images or extensive training, and often only address character consistency.<n>We introduce BachVid, the first training-free method that achieves consistent video generation without needing any reference images.
arXiv Detail & Related papers (2025-10-24T17:56:37Z) - VividFace: High-Quality and Efficient One-Step Diffusion For Video Face Enhancement [51.83206132052461]
Video Face Enhancement (VFE) seeks to reconstruct high-quality facial regions from degraded video sequences.<n>Current methods that rely on video super-resolution and generative frameworks face three fundamental challenges.<n>We propose VividFace, a novel and efficient one-step diffusion framework for video face enhancement.
arXiv Detail & Related papers (2025-09-28T02:39:48Z) - Identity-Preserving Text-to-Video Generation via Training-Free Prompt, Image, and Guidance Enhancement [58.85593321752693]
Identity-preserving text-to-video (IPT2V) generation creates videos faithful to both a reference subject image and a text prompt.<n>We introduce a Training-Free Prompt, Image, and Guidance Enhancement framework that bridges the semantic gap between the video description and the reference image.<n>We win first place in the ACM Multimedia 2025 Identity-Preserving Video Generation Challenge.
arXiv Detail & Related papers (2025-09-01T11:03:13Z) - Mask-Free Audio-driven Talking Face Generation for Enhanced Visual Quality and Identity Preservation [54.52905471078152]
We propose a mask-free talking face generation approach while maintaining the 2D-based face editing task.<n>We transform the input images to have closed mouths, using a two-step landmark-based approach trained in an unpaired manner.
arXiv Detail & Related papers (2025-07-28T16:03:36Z) - EchoVideo: Identity-Preserving Human Video Generation by Multimodal Feature Fusion [3.592206475366951]
Existing methods struggle with "copy-paste" artifacts and low similarity issues.<n>We propose EchoVideo, which integrates high-level semantic features from text to capture clean facial identity representations.<n>It achieves excellent results in generating high-quality, controllability and fidelity videos.
arXiv Detail & Related papers (2025-01-23T08:06:11Z) - VividFace: A Diffusion-Based Hybrid Framework for High-Fidelity Video Face Swapping [43.30061680192465]
We present the first diffusion-based framework specifically designed for video face swapping.<n>Our approach incorporates a specially designed diffusion model coupled with a VidFaceVAE.<n>Our framework achieves superior performance in identity preservation, temporal consistency, and visual quality compared to existing methods.
arXiv Detail & Related papers (2024-12-15T18:58:32Z) - Identity-Preserving Text-to-Video Generation by Frequency Decomposition [52.19475797580653]
Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity.<n>This paper pushes the technical frontier of IPT2V in two directions that have not been resolved in literature.<n>We propose ConsisID, a tuning-free DiT-based controllable IPT2V model to keep human identity consistent in the generated video.
arXiv Detail & Related papers (2024-11-26T13:58:24Z) - PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation [36.21554597804604]
Identity-specific human video generation with customized ID images is still under-explored.<n>Key challenge lies in maintaining high ID fidelity consistently while preserving the original motion dynamic and semantic following.<n>We propose a novel framework, dubbed $textbfPersonalVideo$, that applies a mixture of reward supervision on synthesized videos.
arXiv Detail & Related papers (2024-11-26T02:25:38Z) - ID-Aligner: Enhancing Identity-Preserving Text-to-Image Generation with Reward Feedback Learning [57.91881829308395]
Identity-preserving text-to-image generation (ID-T2I) has received significant attention due to its wide range of application scenarios like AI portrait and advertising.
We present textbfID-Aligner, a general feedback learning framework to enhance ID-T2I performance.
arXiv Detail & Related papers (2024-04-23T18:41:56Z) - VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis [40.869862603815875]
VLOGGER is a method for audio-driven human video generation from a single input image.
We use a novel diffusion-based architecture that augments text-to-image models with both spatial and temporal controls.
We show applications in video editing and personalization.
arXiv Detail & Related papers (2024-03-13T17:59:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.