Proteus-ID: ID-Consistent and Motion-Coherent Video Customization
- URL: http://arxiv.org/abs/2506.23729v1
- Date: Mon, 30 Jun 2025 11:05:32 GMT
- Title: Proteus-ID: ID-Consistent and Motion-Coherent Video Customization
- Authors: Guiyu Zhang, Chen Shi, Zijian Jiang, Xunzhi Xiang, Jingjing Qian, Shaoshuai Shi, Li Jiang,
- Abstract summary: Video identity customization seeks to synthesize realistic, temporally coherent videos of a specific subject, given a single reference image and a text prompt.<n>This task presents two core challenges: maintaining identity consistency while aligning with the described appearance and actions, and generating natural, fluid motion without unrealistic stiffness.<n>We introduce Proteus-ID, a novel diffusion-based framework for identity-consistent and motion-coherent video customization.
- Score: 17.792780924370103
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video identity customization seeks to synthesize realistic, temporally coherent videos of a specific subject, given a single reference image and a text prompt. This task presents two core challenges: (1) maintaining identity consistency while aligning with the described appearance and actions, and (2) generating natural, fluid motion without unrealistic stiffness. To address these challenges, we introduce Proteus-ID, a novel diffusion-based framework for identity-consistent and motion-coherent video customization. First, we propose a Multimodal Identity Fusion (MIF) module that unifies visual and textual cues into a joint identity representation using a Q-Former, providing coherent guidance to the diffusion model and eliminating modality imbalance. Second, we present a Time-Aware Identity Injection (TAII) mechanism that dynamically modulates identity conditioning across denoising steps, improving fine-detail reconstruction. Third, we propose Adaptive Motion Learning (AML), a self-supervised strategy that reweights the training loss based on optical-flow-derived motion heatmaps, enhancing motion realism without requiring additional inputs. To support this task, we construct Proteus-Bench, a high-quality dataset comprising 200K curated clips for training and 150 individuals from diverse professions and ethnicities for evaluation. Extensive experiments demonstrate that Proteus-ID outperforms prior methods in identity preservation, text alignment, and motion quality, establishing a new benchmark for video identity customization. Codes and data are publicly available at https://grenoble-zhang.github.io/Proteus-ID/.
Related papers
- Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations [66.97034863216892]
Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity.<n>Current end-to-end frameworks suffer a critical spatial-temporal trade-off.<n>We propose a simple yet effective spatial-temporal decoupled framework that decomposes representations into spatial features for layouts and temporal features for motion dynamics.
arXiv Detail & Related papers (2025-07-07T06:54:44Z) - Subject-driven Video Generation via Disentangled Identity and Motion [52.54835936914813]
We propose to train a subject-driven customized video generation model through decoupling the subject-specific learning from temporal dynamics in zero-shot without additional tuning.<n>Our method achieves strong subject consistency and scalability, outperforming existing video customization models in zero-shot settings.
arXiv Detail & Related papers (2025-04-23T06:48:31Z) - Motion by Queries: Identity-Motion Trade-offs in Text-to-Video Generation [47.61288672890036]
We investigate how self-attention query features govern motion, structure, and identity in text-to-video models.<n>We demonstrate two applications: a zero-shot motion transfer method and a training-free technique for consistent multi-shot video generation.
arXiv Detail & Related papers (2024-12-10T18:49:39Z) - MotionCharacter: Identity-Preserving and Motion Controllable Human Video Generation [7.474418338825595]
MotionCharacter is an efficient and high-fidelity human video generation framework.<n>We introduce an ID-preserving module to maintain identity fidelity while allowing flexible attribute modifications.<n>We also introduce ID-consistency and region-aware loss mechanisms, significantly enhancing identity consistency and detail fidelity.
arXiv Detail & Related papers (2024-11-27T12:15:52Z) - PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation [36.21554597804604]
Identity-specific human video generation with customized ID images is still under-explored.<n>Key challenge lies in maintaining high ID fidelity consistently while preserving the original motion dynamic and semantic following.<n>We propose a novel framework, dubbed $textbfPersonalVideo$, that applies a mixture of reward supervision on synthesized videos.
arXiv Detail & Related papers (2024-11-26T02:25:38Z) - ID-Aligner: Enhancing Identity-Preserving Text-to-Image Generation with Reward Feedback Learning [57.91881829308395]
Identity-preserving text-to-image generation (ID-T2I) has received significant attention due to its wide range of application scenarios like AI portrait and advertising.
We present textbfID-Aligner, a general feedback learning framework to enhance ID-T2I performance.
arXiv Detail & Related papers (2024-04-23T18:41:56Z) - Infinite-ID: Identity-preserved Personalization via ID-semantics Decoupling Paradigm [31.06269858216316]
We propose Infinite-ID, an ID-semantics decoupling paradigm for identity-preserved personalization.
We introduce an identity-enhanced training, incorporating an additional image cross-attention module to capture sufficient ID information.
We also introduce a feature interaction mechanism that combines a mixed attention module with an AdaIN-mean operation to seamlessly merge the two streams.
arXiv Detail & Related papers (2024-03-18T13:39:53Z) - MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing [90.06041718086317]
We propose a unified Multi-alignment Diffusion, dubbed as MagDiff, for both tasks of high-fidelity video generation and editing.
The proposed MagDiff introduces three types of alignments, including subject-driven alignment, adaptive prompts alignment, and high-fidelity alignment.
arXiv Detail & Related papers (2023-11-29T03:36:07Z) - Multi-Stage Spatio-Temporal Aggregation Transformer for Video Person
Re-identification [78.08536797239893]
We propose a novel Multi-Stage Spatial-Temporal Aggregation Transformer (MSTAT) with two novel designed proxy embedding modules.
MSTAT consists of three stages to encode the attribute-associated, the identity-associated, and the attribute-identity-associated information from the video clips.
We show that MSTAT can achieve state-of-the-art accuracies on various standard benchmarks.
arXiv Detail & Related papers (2023-01-02T05:17:31Z) - An Identity-Preserved Framework for Human Motion Transfer [3.6286856791379463]
Human motion transfer (HMT) aims to generate a video clip for the target subject by imitating the source subject's motion.
Previous methods have achieved good results in good-quality videos, but lose sight of individualized motion information from the source and target motions.
We propose a novel identity-preserved HMT network, termed textitIDPres.
arXiv Detail & Related papers (2022-04-14T10:27:19Z) - Identity-Aware Multi-Sentence Video Description [105.13845996039277]
We introduce an auxiliary task of Fill-in the Identity, that aims to predict persons' IDs consistently within a set of clips.
One of the key components is a gender-aware textual representation as well an additional gender prediction objective in the main model.
Experiments show that our proposed Fill-in the Identity model is superior to several baselines and recent works.
arXiv Detail & Related papers (2020-08-22T09:50:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.