Concat-ID: Towards Universal Identity-Preserving Video Synthesis
- URL: http://arxiv.org/abs/2503.14151v2
- Date: Sat, 19 Apr 2025 09:26:43 GMT
- Title: Concat-ID: Towards Universal Identity-Preserving Video Synthesis
- Authors: Yong Zhong, Zhuoyi Yang, Jiayan Teng, Xiaotao Gu, Chongxuan Li,
- Abstract summary: We present Concat-ID, a unified framework for identity-preserving video synthesis.<n>Concat-ID employs Autoencoders to extract image features, which are latent with video sequence latents.<n>A novel cross-video pairing strategy and a multi-stage training regimen are introduced to balance consistency and facial editability.
- Score: 23.40342294656802
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Concat-ID, a unified framework for identity-preserving video generation. Concat-ID employs Variational Autoencoders to extract image features, which are concatenated with video latents along the sequence dimension, leveraging solely 3D self-attention mechanisms without the need for additional modules. A novel cross-video pairing strategy and a multi-stage training regimen are introduced to balance identity consistency and facial editability while enhancing video naturalness. Extensive experiments demonstrate Concat-ID's superiority over existing methods in both single and multi-identity generation, as well as its seamless scalability to multi-subject scenarios, including virtual try-on and background-controllable generation. Concat-ID establishes a new benchmark for identity-preserving video synthesis, providing a versatile and scalable solution for a wide range of applications.
Related papers
- ChatReID: Open-ended Interactive Person Retrieval via Hierarchical Progressive Tuning for Vision Language Models [49.09606704563898]
Person re-identification (Re-ID) is a critical task in human-centric intelligent systems.<n>Recent studies have successfully integrated LVLMs with person Re-ID, yielding promising results.<n>We propose a novel, versatile, one-for-all person Re-ID framework, ChatReID.
arXiv Detail & Related papers (2025-02-27T10:34:14Z) - Magic Mirror: ID-Preserved Video Generation in Video Diffusion Transformers [42.910185323392554]
We present Magic Mirror, a framework for generating identity-preserved videos with cinematic-level quality and dynamic motion.<n>Our method introduces three key components: (1) a dual-branch facial feature extractor that captures both identity and structural features, (2) a lightweight cross-modal adapter with Conditioned Adaptive Normalization for efficient identity integration, and (3) a two-stage training strategy combining synthetic identity pairs with video data.
arXiv Detail & Related papers (2025-01-07T16:48:31Z) - DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation [54.30327187663316]
DiTCtrl is a training-free multi-prompt video generation method under MM-DiT architectures for the first time.<n>We analyze MM-DiT's attention mechanism, finding that the 3D full attention behaves similarly to that of the cross/self-attention blocks in the UNet-like diffusion models.<n>Based on our careful design, the video generated by DiTCtrl achieves smooth transitions and consistent object motion given multiple sequential prompts.
arXiv Detail & Related papers (2024-12-24T18:51:19Z) - VividFace: A Diffusion-Based Hybrid Framework for High-Fidelity Video Face Swapping [43.30061680192465]
We present the first diffusion-based framework specifically designed for video face swapping.<n>Our approach incorporates a specially designed diffusion model coupled with a VidFaceVAE.<n>Our framework achieves superior performance in identity preservation, temporal consistency, and visual quality compared to existing methods.
arXiv Detail & Related papers (2024-12-15T18:58:32Z) - StableAnimator: High-Quality Identity-Preserving Human Image Animation [64.63765800569935]
This paper presents StableAnimator, the first end-to-end ID-preserving video diffusion framework.<n>It synthesizes high-quality videos without any post-processing, conditioned on a reference image and a sequence of poses.<n>During inference, we propose a novel Hamilton-JacobiBellman (HJB) equation-based optimization to further enhance the face quality.
arXiv Detail & Related papers (2024-11-26T18:59:22Z) - InstantFamily: Masked Attention for Zero-shot Multi-ID Image Generation [0.0]
"InstantFamily" is an approach that employs a novel cross-attention mechanism and a multimodal embedding stack to achieve zero-shot multi-ID image generation.
Our method effectively preserves ID as it utilizes global and local features from a pre-trained face recognition model integrated with text conditions.
arXiv Detail & Related papers (2024-04-30T10:16:21Z) - ID-Animator: Zero-Shot Identity-Preserving Human Video Generation [16.438935466843304]
ID-Animator is a zero-shot human-video generation approach that can perform personalized video generation given a single reference facial image without further training.
Our method is highly compatible with popular pre-trained T2V models like animatediff and various community backbone models.
arXiv Detail & Related papers (2024-04-23T17:59:43Z) - Magic-Me: Identity-Specific Video Customized Diffusion [72.05925155000165]
We propose a controllable subject identity controllable video generation framework, termed Video Custom Diffusion (VCD)
With a specified identity defined by a few images, VCD reinforces the identity characteristics and injects frame-wise correlation for stable video outputs.
We conducted extensive experiments to verify that VCD is able to generate stable videos with better ID over the baselines.
arXiv Detail & Related papers (2024-02-14T18:13:51Z) - A Video Is Worth Three Views: Trigeminal Transformers for Video-based
Person Re-identification [77.08204941207985]
Video-based person re-identification (Re-ID) aims to retrieve video sequences of the same person under non-overlapping cameras.
We propose a novel framework named Trigeminal Transformers (TMT) for video-based person Re-ID.
arXiv Detail & Related papers (2021-04-05T02:50:16Z) - Attribute-aware Identity-hard Triplet Loss for Video-based Person
Re-identification [51.110453988705395]
Video-based person re-identification (Re-ID) is an important computer vision task.
We introduce a new metric learning method called Attribute-aware Identity-hard Triplet Loss (AITL)
To achieve a complete model of video-based person Re-ID, a multi-task framework with Attribute-driven Spatio-Temporal Attention (ASTA) mechanism is also proposed.
arXiv Detail & Related papers (2020-06-13T09:15:38Z) - Towards Precise Intra-camera Supervised Person Re-identification [54.86892428155225]
Intra-camera supervision (ICS) for person re-identification (Re-ID) assumes that identity labels are independently annotated within each camera view.
Lack of inter-camera labels makes the ICS Re-ID problem much more challenging than the fully supervised counterpart.
Our approach performs even comparable to state-of-the-art fully supervised methods in two of the datasets.
arXiv Detail & Related papers (2020-02-12T11:56:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.