InstanceAssemble: Layout-Aware Image Generation via Instance Assembling Attention
- URL: http://arxiv.org/abs/2509.16691v2
- Date: Tue, 28 Oct 2025 03:37:32 GMT
- Title: InstanceAssemble: Layout-Aware Image Generation via Instance Assembling Attention
- Authors: Qiang Xiang, Shuang Sun, Binglei Li, Dejia Song, Huaxia Li, Nemo Chen, Xu Tang, Yao Hu, Junping Zhang,
- Abstract summary: Diffusion models have demonstrated remarkable capabilities in generating high-quality images.<n>We propose InstanceAssemble, a novel architecture that incorporates layout conditions via instance-assembling attention.<n>Our method achieves flexible adaption to existing DiT-based T2I models through light-weighted LoRA modules.
- Score: 42.72578449522177
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Diffusion models have demonstrated remarkable capabilities in generating high-quality images. Recent advancements in Layout-to-Image (L2I) generation have leveraged positional conditions and textual descriptions to facilitate precise and controllable image synthesis. Despite overall progress, current L2I methods still exhibit suboptimal performance. Therefore, we propose InstanceAssemble, a novel architecture that incorporates layout conditions via instance-assembling attention, enabling position control with bounding boxes (bbox) and multimodal content control including texts and additional visual content. Our method achieves flexible adaption to existing DiT-based T2I models through light-weighted LoRA modules. Additionally, we propose a Layout-to-Image benchmark, Denselayout, a comprehensive benchmark for layout-to-image generation, containing 5k images with 90k instances in total. We further introduce Layout Grounding Score (LGS), an interpretable evaluation metric to more precisely assess the accuracy of L2I generation. Experiments demonstrate that our InstanceAssemble method achieves state-of-the-art performance under complex layout conditions, while exhibiting strong compatibility with diverse style LoRA modules. The code and pretrained models are publicly available at https://github.com/FireRedTeam/InstanceAssemble.
Related papers
- Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing [53.197392152109636]
We introduce Draw-In-Mind (DIM), a dataset consisting of two complementary subsets: (i) DIM-T2I, containing 14M long-context image-text pairs to enhance complex instruction comprehension; and (ii) DIM-Edit, consisting of 233K chain-of-thought imaginations generated by GPT-4o, serving as explicit design blueprints for image edits.<n>DIM-4.6B-T2I/Edit achieves competitive performance on the ImgEdit and GEdit-Bench benchmarks, outperforming much larger models such as UniWorld-V1 and Step1X-Edit.
arXiv Detail & Related papers (2025-09-02T06:06:52Z) - LAMIC: Layout-Aware Multi-Image Composition via Scalability of Multimodal Diffusion Transformer [32.9330637921386]
LAMIC is a Layout-Aware Multi-Image Composition framework.<n>It extends single-reference diffusion models to multi-reference scenarios in a training-free manner.<n>It consistently outperforms existing multi-reference baselines in ID-S, BG-S, IN-R and AVG scores across all settings.
arXiv Detail & Related papers (2025-08-01T09:51:54Z) - Draw ALL Your Imagine: A Holistic Benchmark and Agent Framework for Complex Instruction-based Image Generation [9.978181430065987]
LongBench-T2I is a benchmark for evaluating text-to-image (T2I) models under complex instructions.<n>LongBench-T2I consists of 500 intricately designed prompts spanning nine diverse visual evaluation dimensions.<n>Plan2Gen is a framework that facilitates complex instruction-driven image generation without requiring additional model training.
arXiv Detail & Related papers (2025-05-30T16:48:14Z) - Efficient Multi-Instance Generation with Janus-Pro-Dirven Prompt Parsing [53.295515505026096]
Janus-Pro-driven Prompt Parsing is a prompt- parsing module that bridges text understanding and layout generation.<n>MIGLoRA is a parameter-efficient plug-in integrating Low-Rank Adaptation into UNet (SD1.5) and DiT (SD3) backbones.<n>The proposed method achieves state-of-the-art performance on COCO and LVIS benchmarks while maintaining parameter efficiency.
arXiv Detail & Related papers (2025-03-27T00:59:14Z) - STAY Diffusion: Styled Layout Diffusion Model for Diverse Layout-to-Image Generation [4.769823364778397]
We propose a diffusion-based model that produces photo-realistic images and provides fine-grained control of stylized objects in scenes.<n>Our approach learns a global condition for each layout, and a self-supervised semantic map for weight modulation.<n>A new Styled-Mask Attention (SM Attention) is also introduced to cross-condition the global condition and image feature for capturing the objects' relationships.
arXiv Detail & Related papers (2025-03-15T17:36:24Z) - ToLo: A Two-Stage, Training-Free Layout-To-Image Generation Framework For High-Overlap Layouts [2.799190378263432]
ToLo is a two-stage, training-free layout-to-image generation framework for high-overlap layouts.<n>We show that ToLo significantly enhances the performance of existing methods when dealing with high-overlap layouts.
arXiv Detail & Related papers (2025-03-03T15:41:51Z) - CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation [78.21134311493303]
Diffusion models have been recognized for their ability to generate images that are not only visually appealing but also of high artistic quality.<n> layout-to-image generation has been proposed to leverage region-specific positions and descriptions to enable more precise and controllable generation.<n>We present a systematic solution that integrates the layout model, dataset, and planner for creative layout-to-image generation.
arXiv Detail & Related papers (2024-12-05T04:09:47Z) - Boundary Attention Constrained Zero-Shot Layout-To-Image Generation [47.435234391588494]
Recent text-to-image diffusion models excel at generating high-resolution images from text but struggle with precise control over spatial composition and object counting.
We propose a novel zero-shot L2I approach, BACON, which eliminates the need for additional modules or fine-tuning.
We leverage pixel-to-pixel correlations in the self-attention feature maps to align cross-attention maps and combine three loss functions constrained by boundary attention to update latent features.
arXiv Detail & Related papers (2024-11-15T05:44:45Z) - DivCon: Divide and Conquer for Complex Numerical and Spatial Reasoning in Text-to-Image Generation [0.0]
Diffusion-driven text-to-image (T2I) generation has achieved remarkable advancements in recent years.<n> layout is employed as an intermedium to bridge large language models and layout-based diffusion models.<n>We introduce a divide-and-conquer approach which decouples the generation task into multiple subtasks.
arXiv Detail & Related papers (2024-03-11T03:24:44Z) - SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-form
Layout-to-Image Generation [68.42476385214785]
We propose a novel Spatial-Semantic Map Guided (SSMG) diffusion model that adopts the feature map, derived from the layout, as guidance.
SSMG achieves superior generation quality with sufficient spatial and semantic controllability compared to previous works.
We also propose the Relation-Sensitive Attention (RSA) and Location-Sensitive Attention (LSA) mechanisms.
arXiv Detail & Related papers (2023-08-20T04:09:12Z) - SceneComposer: Any-Level Semantic Image Synthesis [80.55876413285587]
We propose a new framework for conditional image synthesis from semantic layouts of any precision levels.
The framework naturally reduces to text-to-image (T2I) at the lowest level with no shape information, and it becomes segmentation-to-image (S2I) at the highest level.
We introduce several novel techniques to address the challenges coming with this new setup.
arXiv Detail & Related papers (2022-11-21T18:59:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.