Identity-GRPO: Optimizing Multi-Human Identity-preserving Video Generation via Reinforcement Learning
- URL: http://arxiv.org/abs/2510.14256v2
- Date: Fri, 17 Oct 2025 06:33:55 GMT
- Title: Identity-GRPO: Optimizing Multi-Human Identity-preserving Video Generation via Reinforcement Learning
- Authors: Xiangyu Meng, Zixian Zhang, Zhenghao Zhang, Junchao Liao, Long Qin, Weizhi Wang,
- Abstract summary: Identity-GRPO is a human feedback-driven optimization pipeline for multi-human identity-preserving video generation.<n>We employ a GRPO variant tailored for multi-human consistency, which greatly enhances VACE and Phantom.<n> Experiments show that Identity-GRPO achieves up to 18.9% improvement in human consistency metrics over baseline methods.
- Score: 13.0209477024596
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: While advanced methods like VACE and Phantom have advanced video generation for specific subjects in diverse scenarios, they struggle with multi-human identity preservation in dynamic interactions, where consistent identities across multiple characters are critical. To address this, we propose Identity-GRPO, a human feedback-driven optimization pipeline for refining multi-human identity-preserving video generation. First, we construct a video reward model trained on a large-scale preference dataset containing human-annotated and synthetic distortion data, with pairwise annotations focused on maintaining human consistency throughout the video. We then employ a GRPO variant tailored for multi-human consistency, which greatly enhances both VACE and Phantom. Through extensive ablation studies, we evaluate the impact of annotation quality and design choices on policy optimization. Experiments show that Identity-GRPO achieves up to 18.9% improvement in human consistency metrics over baseline methods, offering actionable insights for aligning reinforcement learning with personalized video generation.
Related papers
- PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards [86.1965460124838]
We propose a scalable multi-subject data generation pipeline.<n>We first enable single-subject personalization models to acquire knowledge of multi-image and multi-subject scenarios.<n>To enhance both subject consistency and text controllability, we design a set of Pairwise Subject-Consistency Rewards.
arXiv Detail & Related papers (2025-12-01T03:25:49Z) - Identity-Preserving Image-to-Video Generation via Reward-Guided Optimization [39.46059491176915]
We propose Identity-Preserving Reward-guided Optimization (IPRO) for image-to-video (I2V) generation.<n>IPRO is based on reinforcement learning to enhance identity preservation.<n>Our method backpropagates the reward signal through the last steps of the sampling chain, enabling richer feedback.
arXiv Detail & Related papers (2025-10-16T03:13:47Z) - DisCo: Reinforcement with Diversity Constraints for Multi-Human Generation [60.741022906593685]
DisCo is the first RL-based framework to directly optimize identity diversity in multi-human generation.<n>DisCo fine-tunes flow-matching models via Group-Relative Policy Optimization.<n>On the DiverseHumans Testset, DisCo achieves 98.6 Unique Face Accuracy and near-perfect Global Identity Spread.
arXiv Detail & Related papers (2025-10-01T19:28:51Z) - Identity-Preserving Text-to-Video Generation via Training-Free Prompt, Image, and Guidance Enhancement [58.85593321752693]
Identity-preserving text-to-video (IPT2V) generation creates videos faithful to both a reference subject image and a text prompt.<n>We introduce a Training-Free Prompt, Image, and Guidance Enhancement framework that bridges the semantic gap between the video description and the reference image.<n>We win first place in the ACM Multimedia 2025 Identity-Preserving Video Generation Challenge.
arXiv Detail & Related papers (2025-09-01T11:03:13Z) - From Large Angles to Consistent Faces: Identity-Preserving Video Generation via Mixture of Facial Experts [69.44297222099175]
We introduce a Mixture of Facial Experts (MoFE) that captures distinct but mutually reinforcing aspects of facial attributes.<n>To mitigate dataset limitations, we have tailored a data processing pipeline centered on two key aspects: Face Constraints and Identity Consistency.<n>We have curated and refined a Large Face Angles (LFA) dataset from existing open-source human video datasets.
arXiv Detail & Related papers (2025-08-13T04:10:16Z) - Multi-identity Human Image Animation with Structural Video Diffusion [64.20452431561436]
We present Structural Video Diffusion, a novel framework for generating realistic multi-human videos.<n>Our approach introduces identity-specific embeddings to maintain consistent appearances across individuals.<n>We expand existing human video dataset with 25K new videos featuring diverse multi-human and object interaction scenarios.
arXiv Detail & Related papers (2025-04-05T10:03:49Z) - MagicID: Hybrid Preference Optimization for ID-Consistent and Dynamic-Preserved Video Customization [24.398759596367103]
Video identity customization seeks to produce high-fidelity videos that maintain consistent identity and exhibit significant dynamics based on users' reference images.<n>We introduce MagicID, a novel framework designed to promote the generation of identity-consistent and dynamically rich videos tailored to user preferences.<n>Experiments show that MagicID successfully achieves consistent identity and natural dynamics, surpassing existing methods across various metrics.
arXiv Detail & Related papers (2025-03-16T23:15:09Z) - VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation [70.68566282567207]
We present VisionReward, a framework for learning human visual preferences in both image and video generation.<n>VisionReward can significantly outperform existing image and video reward models on both machine metrics and human evaluation.
arXiv Detail & Related papers (2024-12-30T16:24:09Z) - OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation [27.516068877910254]
We introduce OpenHumanVid, a large-scale and high-quality human-centric video dataset.<n>Our findings yield two critical insights: First, the incorporation of a large-scale, high-quality dataset substantially enhances evaluation metrics for generated human videos.<n>Second, the effective alignment of text with human appearance, human motion, and facial motion is essential for producing high-quality video outputs.
arXiv Detail & Related papers (2024-11-28T07:01:06Z) - ID-Animator: Zero-Shot Identity-Preserving Human Video Generation [16.438935466843304]
ID-Animator is a zero-shot human-video generation approach that can perform personalized video generation given a single reference facial image without further training.
Our method is highly compatible with popular pre-trained T2V models like animatediff and various community backbone models.
arXiv Detail & Related papers (2024-04-23T17:59:43Z) - Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation [87.50120181861362]
VisionPrefer is a high-quality and fine-grained preference dataset that captures multiple preference aspects.
We train a reward model VP-Score over VisionPrefer to guide the training of text-to-image generative models and the preference prediction accuracy of VP-Score is comparable to human annotators.
arXiv Detail & Related papers (2024-04-23T14:53:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.