Related papers: Identity-GRPO: Optimizing Multi-Human Identity-preserving Video Generation via Reinforcement Learning

Identity-GRPO: Optimizing Multi-Human Identity-preserving Video Generation via Reinforcement Learning

URL: http://arxiv.org/abs/2510.14256v2
Date: Fri, 17 Oct 2025 06:33:55 GMT
Title: Identity-GRPO: Optimizing Multi-Human Identity-preserving Video Generation via Reinforcement Learning
Authors: Xiangyu Meng, Zixian Zhang, Zhenghao Zhang, Junchao Liao, Long Qin, Weizhi Wang,
Abstract summary: Identity-GRPO is a human feedback-driven optimization pipeline for multi-human identity-preserving video generation.<n>We employ a GRPO variant tailored for multi-human consistency, which greatly enhances VACE and Phantom.<n> Experiments show that Identity-GRPO achieves up to 18.9% improvement in human consistency metrics over baseline methods.
Score: 13.0209477024596
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: While advanced methods like VACE and Phantom have advanced video generation for specific subjects in diverse scenarios, they struggle with multi-human identity preservation in dynamic interactions, where consistent identities across multiple characters are critical. To address this, we propose Identity-GRPO, a human feedback-driven optimization pipeline for refining multi-human identity-preserving video generation. First, we construct a video reward model trained on a large-scale preference dataset containing human-annotated and synthetic distortion data, with pairwise annotations focused on maintaining human consistency throughout the video. We then employ a GRPO variant tailored for multi-human consistency, which greatly enhances both VACE and Phantom. Through extensive ablation studies, we evaluate the impact of annotation quality and design choices on policy optimization. Experiments show that Identity-GRPO achieves up to 18.9% improvement in human consistency metrics over baseline methods, offering actionable insights for aligning reinforcement learning with personalized video generation.

Related papers

PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards [86.1965460124838]
We propose a scalable multi-subject data generation pipeline.<n>We first enable single-subject personalization models to acquire knowledge of multi-image and multi-subject scenarios.<n>To enhance both subject consistency and text controllability, we design a set of Pairwise Subject-Consistency Rewards.
arXiv Detail & Related papers (2025-12-01T03:25:49Z)
Identity-Preserving Image-to-Video Generation via Reward-Guided Optimization [39.46059491176915]
We propose Identity-Preserving Reward-guided Optimization (IPRO) for image-to-video (I2V) generation.<n>IPRO is based on reinforcement learning to enhance identity preservation.<n>Our method backpropagates the reward signal through the last steps of the sampling chain, enabling richer feedback.
arXiv Detail & Related papers (2025-10-16T03:13:47Z)
DisCo: Reinforcement with Diversity Constraints for Multi-Human Generation [60.741022906593685]
DisCo is the first RL-based framework to directly optimize identity diversity in multi-human generation.<n>DisCo fine-tunes flow-matching models via Group-Relative Policy Optimization.<n>On the DiverseHumans Testset, DisCo achieves 98.6 Unique Face Accuracy and near-perfect Global Identity Spread.
arXiv Detail & Related papers (2025-10-01T19:28:51Z)
Identity-Preserving Text-to-Video Generation via Training-Free Prompt, Image, and Guidance Enhancement [58.85593321752693]
Identity-preserving text-to-video (IPT2V) generation creates videos faithful to both a reference subject image and a text prompt.<n>We introduce a Training-Free Prompt, Image, and Guidance Enhancement framework that bridges the semantic gap between the video description and the reference image.<n>We win first place in the ACM Multimedia 2025 Identity-Preserving Video Generation Challenge.
arXiv Detail & Related papers (2025-09-01T11:03:13Z)
From Large Angles to Consistent Faces: Identity-Preserving Video Generation via Mixture of Facial Experts [69.44297222099175]
We introduce a Mixture of Facial Experts (MoFE) that captures distinct but mutually reinforcing aspects of facial attributes.<n>To mitigate dataset limitations, we have tailored a data processing pipeline centered on two key aspects: Face Constraints and Identity Consistency.<n>We have curated and refined a Large Face Angles (LFA) dataset from existing open-source human video datasets.
arXiv Detail & Related papers (2025-08-13T04:10:16Z)
Multi-identity Human Image Animation with Structural Video Diffusion [64.20452431561436]
We present Structural Video Diffusion, a novel framework for generating realistic multi-human videos.<n>Our approach introduces identity-specific embeddings to maintain consistent appearances across individuals.<n>We expand existing human video dataset with 25K new videos featuring diverse multi-human and object interaction scenarios.
arXiv Detail & Related papers (2025-04-05T10:03:49Z)
MagicID: Hybrid Preference Optimization for ID-Consistent and Dynamic-Preserved Video Customization [24.398759596367103]
Video identity customization seeks to produce high-fidelity videos that maintain consistent identity and exhibit significant dynamics based on users' reference images.<n>We introduce MagicID, a novel framework designed to promote the generation of identity-consistent and dynamically rich videos tailored to user preferences.<n>Experiments show that MagicID successfully achieves consistent identity and natural dynamics, surpassing existing methods across various metrics.
arXiv Detail & Related papers (2025-03-16T23:15:09Z)
VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation [70.68566282567207]
We present VisionReward, a framework for learning human visual preferences in both image and video generation.<n>VisionReward can significantly outperform existing image and video reward models on both machine metrics and human evaluation.
arXiv Detail & Related papers (2024-12-30T16:24:09Z)
OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation [27.516068877910254]
We introduce OpenHumanVid, a large-scale and high-quality human-centric video dataset.<n>Our findings yield two critical insights: First, the incorporation of a large-scale, high-quality dataset substantially enhances evaluation metrics for generated human videos.<n>Second, the effective alignment of text with human appearance, human motion, and facial motion is essential for producing high-quality video outputs.
arXiv Detail & Related papers (2024-11-28T07:01:06Z)
ID-Animator: Zero-Shot Identity-Preserving Human Video Generation [16.438935466843304]
ID-Animator is a zero-shot human-video generation approach that can perform personalized video generation given a single reference facial image without further training. Our method is highly compatible with popular pre-trained T2V models like animatediff and various community backbone models.
arXiv Detail & Related papers (2024-04-23T17:59:43Z)
Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation [87.50120181861362]
VisionPrefer is a high-quality and fine-grained preference dataset that captures multiple preference aspects. We train a reward model VP-Score over VisionPrefer to guide the training of text-to-image generative models and the preference prediction accuracy of VP-Score is comparable to human annotators.
arXiv Detail & Related papers (2024-04-23T14:53:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.