Related papers: A Training-Free Approach for Multi-ID Customization via Attention Adjustment and Spatial Control

A Training-Free Approach for Multi-ID Customization via Attention Adjustment and Spatial Control

URL: http://arxiv.org/abs/2511.20401v1
Date: Tue, 25 Nov 2025 15:28:10 GMT
Title: A Training-Free Approach for Multi-ID Customization via Attention Adjustment and Spatial Control
Authors: Jiawei Lin, Guanlong Jiao, Jianjin Xu,
Abstract summary: Multi-ID customization is much more difficult and poses two major challenges.<n>It often encounters the copy-paste issue during inference, leading to lower quality.<n>We present an ID-decoupled cross-attention mechanism, injecting distinct ID embeddings into the corresponding image regions.
Score: 7.810140287905315
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Multi-ID customization is an interesting topic in computer vision and attracts considerable attention recently. Given the ID images of multiple individuals, its purpose is to generate a customized image that seamlessly integrates them while preserving their respective identities. Compared to single-ID customization, multi-ID customization is much more difficult and poses two major challenges. First, since the multi-ID customization model is trained to reconstruct an image from the cropped person regions, it often encounters the copy-paste issue during inference, leading to lower quality. Second, the model also suffers from inferior text controllability. The generated result simply combines multiple persons into one image, regardless of whether it is aligned with the input text. In this work, we propose MultiID to tackle this challenging task in a training-free manner. Since the existing single-ID customization models have less copy-paste issue, our key idea is to adapt these models to achieve multi-ID customization. To this end, we present an ID-decoupled cross-attention mechanism, injecting distinct ID embeddings into the corresponding image regions and thus generating multi-ID outputs. To enhance the generation controllability, we introduce three critical strategies, namely the local prompt, depth-guided spatial control, and extended self-attention, making the results more consistent with the text prompts and ID images. We also carefully build a benchmark, called IDBench, for evaluation. The extensive qualitative and quantitative results demonstrate the effectiveness of MultiID in solving the aforementioned two challenges. Its performance is comparable or even better than the training-based multi-ID customization methods.

Related papers

Optimizing ID Consistency in Multimodal Large Models: Facial Restoration via Alignment, Entanglement, and Disentanglement [54.199726425201895]
Multimodal editing large models have demonstrated powerful editing capabilities across diverse tasks.<n>Current facial ID preservation methods struggle to achieve consistent restoration of both facial identity and edited element IP.<n>We propose EditedID, an Alignment-Disentanglement-Entanglement framework for robust identity-specific facial restoration.
arXiv Detail & Related papers (2026-02-21T08:24:42Z)
WithAnyone: Towards Controllable and ID Consistent Image Generation [83.55786496542062]
Identity-consistent generation has become an important focus in text-to-image research.<n>We develop a large-scale paired dataset tailored for multi-person scenarios.<n>We propose a novel training paradigm with a contrastive identity loss that leverages paired data to balance fidelity with diversity.
arXiv Detail & Related papers (2025-10-16T17:59:54Z)
UMO: Scaling Multi-Identity Consistency for Image Customization via Matching Reward [15.094319754425468]
We present UMO, a framework designed to maintain high-fidelity identity preservation and alleviate identity confusion with scalability.<n>With "multi-to-multi matching" paradigm, UMO reformulates multi-identity generation as a global assignment optimization problem.<n>We develop a scalable customization dataset with multi-reference images, consisting of both synthesised and real parts.
arXiv Detail & Related papers (2025-09-08T15:54:55Z)
PS-ReID: Advancing Person Re-Identification and Precise Segmentation with Multimodal Retrieval [38.530536338075684]
Person re-identification (ReID) plays a critical role in applications such as security surveillance and criminal investigations.<n>We propose bf PS-ReID, a multimodal model that combines image and text inputs to enhance ReID performance.<n> Experimental results demonstrate that PS-ReID significantly outperforms unimodal query-based models in both ReID and segmentation tasks.
arXiv Detail & Related papers (2025-03-27T15:14:03Z)
DynamicID: Zero-Shot Multi-ID Image Personalization with Flexible Facial Editability [12.692129257068085]
We present DynamicID, a tuning-free framework that inherently facilitates both single-ID and multi-ID personalized generation.<n>Our key innovations include: 1) Semantic-Activated Attention (SAA), which employs query-level activation gating to minimize disruption to the base model when injecting ID features and achieve multi-ID personalization without requiring multi-ID samples during training; 2) Identity-Motion Reconfigurator (IMR), which applies feature-space manipulation to effectively disentangle facial motion and identity features, supporting flexible facial editing; and 3) a task-decoupled training paradigm that reduces data dependency, together with VariFace-10k,
arXiv Detail & Related papers (2025-03-09T08:16:19Z)
ChatReID: Open-ended Interactive Person Retrieval via Hierarchical Progressive Tuning for Vision Language Models [49.09606704563898]
Person re-identification is a crucial task in computer vision, aiming to recognize individuals across non-overlapping camera views.<n>We propose a novel framework ChatReID, that shifts the focus towards a text-side-dominated retrieval paradigm, enabling flexible and interactive re-identification.<n>We introduce a hierarchical progressive tuning strategy, which endows Re-ID ability through three stages of tuning, i.e., from person attribute understanding to fine-grained image retrieval and to multi-modal task reasoning.
arXiv Detail & Related papers (2025-02-27T10:34:14Z)
Omni-ID: Holistic Identity Representation Designed for Generative Tasks [75.29174595706533]
Omni-ID encodes holistic information about an individual's appearance across diverse expressions.<n>It consolidates information from a varied number of unstructured input images into a structured representation.<n>It demonstrates substantial improvements over conventional representations across various generative tasks.
arXiv Detail & Related papers (2024-12-12T19:21:20Z)
Synthesizing Efficient Data with Diffusion Models for Person Re-Identification Pre-Training [51.87027943520492]
We present a novel paradigm Diffusion-ReID to efficiently augment and generate diverse images based on known identities. Benefiting from our proposed paradigm, we first create a new large-scale person Re-ID dataset Diff-Person, which consists of over 777K images from 5,183 identities.
arXiv Detail & Related papers (2024-06-10T06:26:03Z)
InstantFamily: Masked Attention for Zero-shot Multi-ID Image Generation [0.0]
"InstantFamily" is an approach that employs a novel cross-attention mechanism and a multimodal embedding stack to achieve zero-shot multi-ID image generation. Our method effectively preserves ID as it utilizes global and local features from a pre-trained face recognition model integrated with text conditions.
arXiv Detail & Related papers (2024-04-30T10:16:21Z)
ID-Aligner: Enhancing Identity-Preserving Text-to-Image Generation with Reward Feedback Learning [57.91881829308395]
Identity-preserving text-to-image generation (ID-T2I) has received significant attention due to its wide range of application scenarios like AI portrait and advertising. We present textbfID-Aligner, a general feedback learning framework to enhance ID-T2I performance.
arXiv Detail & Related papers (2024-04-23T18:41:56Z)
Identity Decoupling for Multi-Subject Personalization of Text-to-Image Models [66.05234562835136]
We present MuDI, a novel framework that enables multi-subject personalization. Our main idea is to utilize segmented subjects generated by a foundation model for segmentation. Experimental results show that our MuDI can produce high-quality personalized images without identity mixing.
arXiv Detail & Related papers (2024-04-05T17:45:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.