Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation
- URL: http://arxiv.org/abs/2508.07901v2
- Date: Tue, 12 Aug 2025 02:50:34 GMT
- Title: Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation
- Authors: Bowen Xue, Qixin Yan, Wenjing Wang, Hao Liu, Chen Li,
- Abstract summary: We propose Stand-In, a lightweight and plug-and-play framework for identity preservation in video generation.<n>Our framework achieves excellent results in video quality and identity preservation, outperforming other full- parameter training methods.<n>Our framework can be seamlessly integrated for other tasks, such as subject-driven video generation, pose-referenced video generation, stylization, and face swapping.
- Score: 12.243958169714166
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generating high-fidelity human videos that match user-specified identities is important yet challenging in the field of generative AI. Existing methods often rely on an excessive number of training parameters and lack compatibility with other AIGC tools. In this paper, we propose Stand-In, a lightweight and plug-and-play framework for identity preservation in video generation. Specifically, we introduce a conditional image branch into the pre-trained video generation model. Identity control is achieved through restricted self-attentions with conditional position mapping, and can be learned quickly with only 2000 pairs. Despite incorporating and training just $\sim$1% additional parameters, our framework achieves excellent results in video quality and identity preservation, outperforming other full-parameter training methods. Moreover, our framework can be seamlessly integrated for other tasks, such as subject-driven video generation, pose-referenced video generation, stylization, and face swapping.
Related papers
- AlcheMinT: Fine-grained Temporal Control for Multi-Reference Consistent Video Generation [58.844504598618094]
We propose AlcheMinT, a unified framework that introduces explicit timestamps conditioning for subject-driven video generation.<n>Our approach introduces a novel positional encoding mechanism that unlocks the encoding of temporal intervals, associated in our case with subject identities.<n>We incorporate subject-descriptive text tokens to strengthen binding between visual identity and video captions, mitigating ambiguity during generation.
arXiv Detail & Related papers (2025-12-11T18:59:34Z) - UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation [61.98887854225878]
We introduce UnityVideo, a unified framework for world-aware video generation.<n>Our approach features two core components: (1) dynamic noising to unify heterogeneous training paradigms, and (2) a modality switcher with an in-context learner.<n>We demonstrate that UnityVideo achieves superior video quality, consistency, and improved alignment with physical world constraints.
arXiv Detail & Related papers (2025-12-08T18:59:01Z) - ID-Composer: Multi-Subject Video Synthesis with Hierarchical Identity Preservation [48.59900036213667]
Video generative models pretrained on large-scale datasets can produce high-quality videos, but are often conditioned on text or a single image.<n>We introduce ID-Composer, a novel framework that tackles multi-subject video generation from a text prompt and reference images.
arXiv Detail & Related papers (2025-11-01T11:29:14Z) - BachVid: Training-Free Video Generation with Consistent Background and Character [62.46376250180513]
Diffusion Transformers (DiTs) have recently driven significant progress in text-to-video (T2V) generation.<n>Existing methods typically rely on reference images or extensive training, and often only address character consistency.<n>We introduce BachVid, the first training-free method that achieves consistent video generation without needing any reference images.
arXiv Detail & Related papers (2025-10-24T17:56:37Z) - Identity-Preserving Text-to-Video Generation via Training-Free Prompt, Image, and Guidance Enhancement [58.85593321752693]
Identity-preserving text-to-video (IPT2V) generation creates videos faithful to both a reference subject image and a text prompt.<n>We introduce a Training-Free Prompt, Image, and Guidance Enhancement framework that bridges the semantic gap between the video description and the reference image.<n>We win first place in the ACM Multimedia 2025 Identity-Preserving Video Generation Challenge.
arXiv Detail & Related papers (2025-09-01T11:03:13Z) - Controllable Video Generation: A Survey [72.38313362192784]
We provide a systematic review of controllable video generation, covering both theoretical foundations and recent advances in the field.<n>We begin by introducing the key concepts and commonly used open-source video generation models.<n>We then focus on control mechanisms in video diffusion models, analyzing how different types of conditions can be incorporated into the denoising process to guide generation.
arXiv Detail & Related papers (2025-07-22T06:05:34Z) - Proteus-ID: ID-Consistent and Motion-Coherent Video Customization [17.792780924370103]
Video identity customization seeks to synthesize realistic, temporally coherent videos of a specific subject, given a single reference image and a text prompt.<n>This task presents two core challenges: maintaining identity consistency while aligning with the described appearance and actions, and generating natural, fluid motion without unrealistic stiffness.<n>We introduce Proteus-ID, a novel diffusion-based framework for identity-consistent and motion-coherent video customization.
arXiv Detail & Related papers (2025-06-30T11:05:32Z) - Concat-ID: Towards Universal Identity-Preserving Video Synthesis [23.40342294656802]
Concat-ID is a unified framework for identity-preserving video synthesis.<n>It relies exclusively on inherent 3D self-attention mechanisms to incorporate them.<n>Concat-ID establishes a new benchmark for identity-preserving video synthesis.
arXiv Detail & Related papers (2025-03-18T11:17:32Z) - VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention [70.61101071902596]
Current video generation models excel at short clips but fail to produce cohesive multi-shot narratives due to disjointed visual dynamics and fractured storylines.<n>We introduce VideoGen-of-Thought (VGoT), a step-by-step framework that automates multi-shot video synthesis from a single sentence.<n>VGoT generates multi-shot videos that outperform state-of-the-art baselines by 20.4% in within-shot face consistency and 17.4% in style consistency.
arXiv Detail & Related papers (2024-12-03T08:33:50Z) - PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation [36.21554597804604]
Identity-specific human video generation with customized ID images is still under-explored.<n>Key challenge lies in maintaining high ID fidelity consistently while preserving the original motion dynamic and semantic following.<n>We propose a novel framework, dubbed $textbfPersonalVideo$, that applies a mixture of reward supervision on synthesized videos.
arXiv Detail & Related papers (2024-11-26T02:25:38Z) - ID-Animator: Zero-Shot Identity-Preserving Human Video Generation [16.438935466843304]
ID-Animator is a zero-shot human-video generation approach that can perform personalized video generation given a single reference facial image without further training.
Our method is highly compatible with popular pre-trained T2V models like animatediff and various community backbone models.
arXiv Detail & Related papers (2024-04-23T17:59:43Z) - MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing [90.06041718086317]
We propose a unified Multi-alignment Diffusion, dubbed as MagDiff, for both tasks of high-fidelity video generation and editing.
The proposed MagDiff introduces three types of alignments, including subject-driven alignment, adaptive prompts alignment, and high-fidelity alignment.
arXiv Detail & Related papers (2023-11-29T03:36:07Z) - Multi-object Video Generation from Single Frame Layouts [84.55806837855846]
We propose a video generative framework capable of synthesizing global scenes with local objects.
Our framework is a non-trivial adaptation from image generation methods, and is new to this field.
Our model has been evaluated on two widely-used video recognition benchmarks.
arXiv Detail & Related papers (2023-05-06T09:07:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.