Related papers: MUSE: Multi-Subject Unified Synthesis via Explicit Layout Semantic Expansion

MUSE: Multi-Subject Unified Synthesis via Explicit Layout Semantic Expansion

URL: http://arxiv.org/abs/2508.14440v1
Date: Wed, 20 Aug 2025 05:52:26 GMT
Title: MUSE: Multi-Subject Unified Synthesis via Explicit Layout Semantic Expansion
Authors: Fei Peng, Junqiang Wu, Yan Li, Tingting Gao, Di Zhang, Huiyuan Fu,
Abstract summary: We address the task of layout-controllable multi-subject synthesis (LMS), which requires both faithful reconstruction of reference subjects and their accurate placement in specified regions within a unified image.<n>We propose MUSE, a unified synthesis framework that seamlessly integrates layout specifications with textual guidance through explicit semantic expansion.
Score: 15.787883177836362
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing text-to-image diffusion models have demonstrated remarkable capabilities in generating high-quality images guided by textual prompts. However, achieving multi-subject compositional synthesis with precise spatial control remains a significant challenge. In this work, we address the task of layout-controllable multi-subject synthesis (LMS), which requires both faithful reconstruction of reference subjects and their accurate placement in specified regions within a unified image. While recent advancements have separately improved layout control and subject synthesis, existing approaches struggle to simultaneously satisfy the dual requirements of spatial precision and identity preservation in this composite task. To bridge this gap, we propose MUSE, a unified synthesis framework that employs concatenated cross-attention (CCA) to seamlessly integrate layout specifications with textual guidance through explicit semantic space expansion. The proposed CCA mechanism enables bidirectional modality alignment between spatial constraints and textual descriptions without interference. Furthermore, we design a progressive two-stage training strategy that decomposes the LMS task into learnable sub-objectives for effective optimization. Extensive experiments demonstrate that MUSE achieves zero-shot end-to-end generation with superior spatial accuracy and identity consistency compared to existing solutions, advancing the frontier of controllable image synthesis. Our code and model are available at https://github.com/pf0607/MUSE.

Related papers

AnyMS: Bottom-up Attention Decoupling for Layout-guided and Training-free Multi-subject Customization [55.06425570300248]
We present AnyMS, a training-free framework for layout-guided multi-subject customization.<n>AnyMS leverages three input conditions: text prompt, subject images, and layout constraints.<n>AnyMS achieves state-of-the-art performance, supporting complex compositions and scaling to a larger number of subjects.
arXiv Detail & Related papers (2025-12-29T15:26:25Z)
A Two-Stage System for Layout-Controlled Image Generation using Large Language Models and Diffusion Models [0.0]
Text-to-image diffusion models exhibit remarkable generative capabilities, but lack precise control over object counts and spatial arrangements.<n>This work introduces a two-stage system to address these compositional limitations.<n>The first stage employs a Large Language Model (LLM) to generate a structured layout from a list of objects.<n>The second stage uses a layout-conditioned diffusion model to synthesize a photorealistic image adhering to this layout.
arXiv Detail & Related papers (2025-11-10T09:40:48Z)
UniAlignment: Semantic Alignment for Unified Image Generation, Understanding, Manipulation and Perception [54.53657134205492]
UniAlignment is a unified multimodal generation framework within a single diffusion transformer.<n>It incorporates both intrinsic-modal semantic alignment and cross-modal semantic alignment, thereby enhancing the model's cross-modal consistency and instruction-following robustness.<n>We present SemGen-Bench, a new benchmark specifically designed to evaluate multimodal semantic consistency under complex textual instructions.
arXiv Detail & Related papers (2025-09-28T09:11:30Z)
Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations [66.97034863216892]
Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity.<n>Current end-to-end frameworks suffer a critical spatial-temporal trade-off.<n>We propose a simple yet effective spatial-temporal decoupled framework that decomposes representations into spatial features for layouts and temporal features for motion dynamics.
arXiv Detail & Related papers (2025-07-07T06:54:44Z)
HCMA: Hierarchical Cross-model Alignment for Grounded Text-to-Image Generation [27.770224730465237]
We propose a Hierarchical Cross-Modal Alignment (HCMA) framework for grounded text-to-image generation.<n>HCMA integrates two alignment modules into each diffusion sampling step.<n>Experiments on the MS-COCO 2014 validation set show that HCMA surpasses state-of-the-art baselines.
arXiv Detail & Related papers (2025-05-10T05:02:58Z)
Spatial Transport Optimization by Repositioning Attention Map for Training-Free Text-to-Image Synthesis [5.869767284889891]
Diffusion-based text-to-image (T2I) models have excelled in high-quality image generation.<n>We propose STORM, a novel training-free approach for spatially coherent T2I synthesis.
arXiv Detail & Related papers (2025-03-28T06:12:25Z)
Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation [50.433911327489554]
The goal of referring remote sensing image segmentation (RRSIS) is to generate a pixel-level mask of the target object identified by the referring expression.<n>To address the aforementioned challenges, a novel RRSIS framework is proposed, termed the cross-modal bidirectional interaction model (CroBIM)<n>To further forster the research of RRSIS, we also construct RISBench, a new large-scale benchmark dataset comprising 52,472 image-language-label triplets.
arXiv Detail & Related papers (2024-10-11T08:28:04Z)
Layered Rendering Diffusion Model for Controllable Zero-Shot Image Synthesis [15.76266032768078]
This paper introduces innovative solutions to enhance spatial controllability in diffusion models reliant on text queries.<n>We first introduce vision guidance as a foundational spatial cue within the perturbed distribution.<n>We propose a universal framework, Layered Rendering Diffusion (LRDiff), which constructs an image-rendering process with multiple layers.
arXiv Detail & Related papers (2023-11-30T10:36:19Z)
SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-form Layout-to-Image Generation [68.42476385214785]
We propose a novel Spatial-Semantic Map Guided (SSMG) diffusion model that adopts the feature map, derived from the layout, as guidance. SSMG achieves superior generation quality with sufficient spatial and semantic controllability compared to previous works. We also propose the Relation-Sensitive Attention (RSA) and Location-Sensitive Attention (LSA) mechanisms.
arXiv Detail & Related papers (2023-08-20T04:09:12Z)
LAW-Diffusion: Complex Scene Generation by Diffusion with Layouts [107.11267074981905]
We propose a semantically controllable layout-AWare diffusion model, termed LAW-Diffusion. We show that LAW-Diffusion yields the state-of-the-art generative performance, especially with coherent object relations.
arXiv Detail & Related papers (2023-08-13T08:06:18Z)
LeftRefill: Filling Right Canvas based on Left Reference through Generalized Text-to-Image Diffusion Model [55.20469538848806]
LeftRefill is an innovative approach to harness large Text-to-Image (T2I) diffusion models for reference-guided image synthesis. This paper introduces LeftRefill, an innovative approach to efficiently harness large Text-to-Image (T2I) diffusion models for reference-guided image synthesis.
arXiv Detail & Related papers (2023-05-19T10:29:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.