AnyMS: Bottom-up Attention Decoupling for Layout-guided and Training-free Multi-subject Customization
- URL: http://arxiv.org/abs/2512.23537v2
- Date: Fri, 02 Jan 2026 06:21:26 GMT
- Title: AnyMS: Bottom-up Attention Decoupling for Layout-guided and Training-free Multi-subject Customization
- Authors: Binhe Yu, Zhen Wang, Kexin Li, Yuqian Yuan, Wenqiao Zhang, Long Chen, Juncheng Li, Jun Xiao, Yueting Zhuang,
- Abstract summary: We present AnyMS, a training-free framework for layout-guided multi-subject customization.<n>AnyMS leverages three input conditions: text prompt, subject images, and layout constraints.<n>AnyMS achieves state-of-the-art performance, supporting complex compositions and scaling to a larger number of subjects.
- Score: 55.06425570300248
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-subject customization aims to synthesize multiple user-specified subjects into a coherent image. To address issues such as subjects missing or conflicts, recent works incorporate layout guidance to provide explicit spatial constraints. However, existing methods still struggle to balance three critical objectives: text alignment, subject identity preservation, and layout control, while the reliance on additional training further limits their scalability and efficiency. In this paper, we present AnyMS, a novel training-free framework for layout-guided multi-subject customization. AnyMS leverages three input conditions: text prompt, subject images, and layout constraints, and introduces a bottom-up dual-level attention decoupling mechanism to harmonize their integration during generation. Specifically, global decoupling separates cross-attention between textual and visual conditions to ensure text alignment. Local decoupling confines each subject's attention to its designated area, which prevents subject conflicts and thus guarantees identity preservation and layout control. Moreover, AnyMS employs pre-trained image adapters to extract subject-specific features aligned with the diffusion model, removing the need for subject learning or adapter tuning. Extensive experiments demonstrate that AnyMS achieves state-of-the-art performance, supporting complex compositions and scaling to a larger number of subjects.
Related papers
- Multi-Grained Text-Guided Image Fusion for Multi-Exposure and Multi-Focus Scenarios [12.461120447513487]
Multi-grained Text-guided Image Fusion (MTIF) is a novel fusion paradigm with three key designs.<n>First, it introduces multi-grained textual descriptions that separately capture fine details, structural cues, and semantic content.<n>Second, it involves supervision signals at each to facilitate alignment between visual and textual features.<n>Third, it adopts a saliency-driven enrichment module to augment training data with dense semantic content.
arXiv Detail & Related papers (2025-12-23T17:55:35Z) - 3SGen: Unified Subject, Style, and Structure-Driven Image Generation with Adaptive Task-specific Memory [54.056509629389915]
3SGen is a task-aware unified framework that performs all three conditioning modes within a single model.<n>At its core, an Adaptive Task-specific Memory (ATM) module dynamically disentangles, stores, and retrieves condition-specific priors.<n>We propose 3SGen-Bench, a unified image-driven generation benchmark with standardized metrics for evaluating cross-task fidelity and controllability.
arXiv Detail & Related papers (2025-12-22T11:07:27Z) - ConsistCompose: Unified Multimodal Layout Control for Image Composition [56.909072845166264]
We present ConsistCompose, a unified framework that embeds layout coordinates directly into language prompts.<n>We show that ConsistCompose substantially improves spatial accuracy over layout-controlled baselines.
arXiv Detail & Related papers (2025-11-23T08:14:53Z) - MUSE: Multi-Subject Unified Synthesis via Explicit Layout Semantic Expansion [15.787883177836362]
We address the task of layout-controllable multi-subject synthesis (LMS), which requires both faithful reconstruction of reference subjects and their accurate placement in specified regions within a unified image.<n>We propose MUSE, a unified synthesis framework that seamlessly integrates layout specifications with textual guidance through explicit semantic expansion.
arXiv Detail & Related papers (2025-08-20T05:52:26Z) - MUSAR: Exploring Multi-Subject Customization from Single-Subject Dataset via Attention Routing [14.88610127301938]
MUSAR is a framework to achieve robust multi-subject customization while requiring only single-subject training data.<n>It constructs diptych training pairs from single-subject images to facilitate multi-subject learning, while actively correcting the distribution bias introduced by diptych construction.<n>Experiments demonstrate that our MUSAR outperforms existing methods - even those trained on multi-subject dataset.
arXiv Detail & Related papers (2025-05-05T17:50:24Z) - Nested Attention: Semantic-aware Attention Values for Concept Personalization [78.90196530697897]
We introduce Nested Attention, a novel mechanism that injects a rich and expressive image representation into the model's existing cross-attention layers.<n>Our key idea is to generate query-dependent subject values, derived from nested attention layers that learn to select relevant subject features for each region in the generated image.
arXiv Detail & Related papers (2025-01-02T18:52:11Z) - MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance [5.452759083801634]
This research introduces the MS-Diffusion framework for layout-guided zero-shot image personalization with multi-subjects.<n>The proposed multi-subject cross-attention orchestrates inter-subject compositions while preserving the control of texts.
arXiv Detail & Related papers (2024-06-11T12:32:53Z) - LoCo: Locally Constrained Training-Free Layout-to-Image Synthesis [24.925757148750684]
We propose a training-free approach for layout-to-image Synthesis that excels in producing high-quality images aligned with both textual prompts and layout instructions.
LoCo seamlessly integrates into existing text-to-image and layout-to-image models, enhancing their performance in spatial control and addressing semantic failures observed in prior methods.
arXiv Detail & Related papers (2023-11-21T04:28:12Z) - Cones 2: Customizable Image Synthesis with Multiple Subjects [50.54010141032032]
We study how to efficiently represent a particular subject as well as how to appropriately compose different subjects.
By rectifying the activations in the cross-attention map, the layout appoints and separates the location of different subjects in the image.
arXiv Detail & Related papers (2023-05-30T18:00:06Z) - Image-Specific Information Suppression and Implicit Local Alignment for
Text-based Person Search [61.24539128142504]
Text-based person search (TBPS) is a challenging task that aims to search pedestrian images with the same identity from an image gallery given a query text.
Most existing methods rely on explicitly generated local parts to model fine-grained correspondence between modalities.
We propose an efficient joint Multi-level Alignment Network (MANet) for TBPS, which can learn aligned image/text feature representations between modalities at multiple levels.
arXiv Detail & Related papers (2022-08-30T16:14:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.