Related papers: Proteus-ID: ID-Consistent and Motion-Coherent Video Customization

Proteus-ID: ID-Consistent and Motion-Coherent Video Customization

URL: http://arxiv.org/abs/2506.23729v1
Date: Mon, 30 Jun 2025 11:05:32 GMT
Title: Proteus-ID: ID-Consistent and Motion-Coherent Video Customization
Authors: Guiyu Zhang, Chen Shi, Zijian Jiang, Xunzhi Xiang, Jingjing Qian, Shaoshuai Shi, Li Jiang,
Abstract summary: Video identity customization seeks to synthesize realistic, temporally coherent videos of a specific subject, given a single reference image and a text prompt.<n>This task presents two core challenges: maintaining identity consistency while aligning with the described appearance and actions, and generating natural, fluid motion without unrealistic stiffness.<n>We introduce Proteus-ID, a novel diffusion-based framework for identity-consistent and motion-coherent video customization.
Score: 17.792780924370103
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video identity customization seeks to synthesize realistic, temporally coherent videos of a specific subject, given a single reference image and a text prompt. This task presents two core challenges: (1) maintaining identity consistency while aligning with the described appearance and actions, and (2) generating natural, fluid motion without unrealistic stiffness. To address these challenges, we introduce Proteus-ID, a novel diffusion-based framework for identity-consistent and motion-coherent video customization. First, we propose a Multimodal Identity Fusion (MIF) module that unifies visual and textual cues into a joint identity representation using a Q-Former, providing coherent guidance to the diffusion model and eliminating modality imbalance. Second, we present a Time-Aware Identity Injection (TAII) mechanism that dynamically modulates identity conditioning across denoising steps, improving fine-detail reconstruction. Third, we propose Adaptive Motion Learning (AML), a self-supervised strategy that reweights the training loss based on optical-flow-derived motion heatmaps, enhancing motion realism without requiring additional inputs. To support this task, we construct Proteus-Bench, a high-quality dataset comprising 200K curated clips for training and 150 individuals from diverse professions and ethnicities for evaluation. Extensive experiments demonstrate that Proteus-ID outperforms prior methods in identity preservation, text alignment, and motion quality, establishing a new benchmark for video identity customization. Codes and data are publicly available at https://grenoble-zhang.github.io/Proteus-ID/.

Related papers

Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations [66.97034863216892]
Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity.<n>Current end-to-end frameworks suffer a critical spatial-temporal trade-off.<n>We propose a simple yet effective spatial-temporal decoupled framework that decomposes representations into spatial features for layouts and temporal features for motion dynamics.
arXiv Detail & Related papers (2025-07-07T06:54:44Z)
Subject-driven Video Generation via Disentangled Identity and Motion [52.54835936914813]
We propose to train a subject-driven customized video generation model through decoupling the subject-specific learning from temporal dynamics in zero-shot without additional tuning.<n>Our method achieves strong subject consistency and scalability, outperforming existing video customization models in zero-shot settings.
arXiv Detail & Related papers (2025-04-23T06:48:31Z)
Motion by Queries: Identity-Motion Trade-offs in Text-to-Video Generation [47.61288672890036]
We investigate how self-attention query features govern motion, structure, and identity in text-to-video models.<n>We demonstrate two applications: a zero-shot motion transfer method and a training-free technique for consistent multi-shot video generation.
arXiv Detail & Related papers (2024-12-10T18:49:39Z)
MotionCharacter: Identity-Preserving and Motion Controllable Human Video Generation [7.474418338825595]
MotionCharacter is an efficient and high-fidelity human video generation framework.<n>We introduce an ID-preserving module to maintain identity fidelity while allowing flexible attribute modifications.<n>We also introduce ID-consistency and region-aware loss mechanisms, significantly enhancing identity consistency and detail fidelity.
arXiv Detail & Related papers (2024-11-27T12:15:52Z)
PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation [36.21554597804604]
Identity-specific human video generation with customized ID images is still under-explored.<n>Key challenge lies in maintaining high ID fidelity consistently while preserving the original motion dynamic and semantic following.<n>We propose a novel framework, dubbed $textbfPersonalVideo$, that applies a mixture of reward supervision on synthesized videos.
arXiv Detail & Related papers (2024-11-26T02:25:38Z)
ID-Aligner: Enhancing Identity-Preserving Text-to-Image Generation with Reward Feedback Learning [57.91881829308395]
Identity-preserving text-to-image generation (ID-T2I) has received significant attention due to its wide range of application scenarios like AI portrait and advertising. We present textbfID-Aligner, a general feedback learning framework to enhance ID-T2I performance.
arXiv Detail & Related papers (2024-04-23T18:41:56Z)
Infinite-ID: Identity-preserved Personalization via ID-semantics Decoupling Paradigm [31.06269858216316]
We propose Infinite-ID, an ID-semantics decoupling paradigm for identity-preserved personalization. We introduce an identity-enhanced training, incorporating an additional image cross-attention module to capture sufficient ID information. We also introduce a feature interaction mechanism that combines a mixed attention module with an AdaIN-mean operation to seamlessly merge the two streams.
arXiv Detail & Related papers (2024-03-18T13:39:53Z)
MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing [90.06041718086317]
We propose a unified Multi-alignment Diffusion, dubbed as MagDiff, for both tasks of high-fidelity video generation and editing. The proposed MagDiff introduces three types of alignments, including subject-driven alignment, adaptive prompts alignment, and high-fidelity alignment.
arXiv Detail & Related papers (2023-11-29T03:36:07Z)
Multi-Stage Spatio-Temporal Aggregation Transformer for Video Person Re-identification [78.08536797239893]
We propose a novel Multi-Stage Spatial-Temporal Aggregation Transformer (MSTAT) with two novel designed proxy embedding modules. MSTAT consists of three stages to encode the attribute-associated, the identity-associated, and the attribute-identity-associated information from the video clips. We show that MSTAT can achieve state-of-the-art accuracies on various standard benchmarks.
arXiv Detail & Related papers (2023-01-02T05:17:31Z)
An Identity-Preserved Framework for Human Motion Transfer [3.6286856791379463]
Human motion transfer (HMT) aims to generate a video clip for the target subject by imitating the source subject's motion. Previous methods have achieved good results in good-quality videos, but lose sight of individualized motion information from the source and target motions. We propose a novel identity-preserved HMT network, termed textitIDPres.
arXiv Detail & Related papers (2022-04-14T10:27:19Z)
Identity-Aware Multi-Sentence Video Description [105.13845996039277]
We introduce an auxiliary task of Fill-in the Identity, that aims to predict persons' IDs consistently within a set of clips. One of the key components is a gender-aware textual representation as well an additional gender prediction objective in the main model. Experiments show that our proposed Fill-in the Identity model is superior to several baselines and recent works.
arXiv Detail & Related papers (2020-08-22T09:50:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.