Be Decisive: Noise-Induced Layouts for Multi-Subject Generation
- URL: http://arxiv.org/abs/2505.21488v1
- Date: Tue, 27 May 2025 17:54:24 GMT
- Title: Be Decisive: Noise-Induced Layouts for Multi-Subject Generation
- Authors: Omer Dahary, Yehonathan Cohen, Or Patashnik, Kfir Aberman, Daniel Cohen-Or,
- Abstract summary: Complex prompts lead to subject leakage, causing inaccuracies in quantities, attributes, and visual features.<n>We introduce a new approach that predicts a spatial layout aligned with the prompt, derived from the initial noise, and refines it throughout the denoising process.<n>Our method employs a small neural network to predict and refine the evolving noise-induced layout at each denoising step.
- Score: 56.80513553424086
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Generating multiple distinct subjects remains a challenge for existing text-to-image diffusion models. Complex prompts often lead to subject leakage, causing inaccuracies in quantities, attributes, and visual features. Preventing leakage among subjects necessitates knowledge of each subject's spatial location. Recent methods provide these spatial locations via an external layout control. However, enforcing such a prescribed layout often conflicts with the innate layout dictated by the sampled initial noise, leading to misalignment with the model's prior. In this work, we introduce a new approach that predicts a spatial layout aligned with the prompt, derived from the initial noise, and refines it throughout the denoising process. By relying on this noise-induced layout, we avoid conflicts with externally imposed layouts and better preserve the model's prior. Our method employs a small neural network to predict and refine the evolving noise-induced layout at each denoising step, ensuring clear boundaries between subjects while maintaining consistency. Experimental results show that this noise-aligned strategy achieves improved text-image alignment and more stable multi-subject generation compared to existing layout-guided techniques, while preserving the rich diversity of the model's original distribution.
Related papers
- Spatial Reasoners for Continuous Variables in Any Domain [49.83744014336816]
We present a framework to perform spatial reasoning over continuous variables with generative denoising models.<n>We provide interfaces to control variable mapping from arbitrary data domains, generative model paradigms, and inference strategies.
arXiv Detail & Related papers (2025-07-14T19:46:54Z) - Denoising as Adaptation: Noise-Space Domain Adaptation for Image Restoration [64.84134880709625]
We show that it is possible to perform domain adaptation via the noise space using diffusion models.<n>In particular, by leveraging the unique property of how auxiliary conditional inputs influence the multi-step denoising process, we derive a meaningful diffusion loss.<n>We present crucial strategies such as channel-shuffling layer and residual-swapping contrastive learning in the diffusion model.
arXiv Detail & Related papers (2024-06-26T17:40:30Z) - The Crystal Ball Hypothesis in diffusion models: Anticipating object positions from initial noise [92.53724347718173]
Diffusion models have achieved remarkable success in text-to-image generation tasks.
We identify specific regions within the initial noise image, termed trigger patches, that play a key role for object generation in the resulting images.
arXiv Detail & Related papers (2024-06-04T05:06:00Z) - InitNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization [27.508861002013358]
InitNO is a paradigm that refines the initial noise in semantically-faithful images.
A strategically crafted noise optimization pipeline is developed to guide the initial noise towards valid regions.
Our method, validated through rigorous experimentation, shows a commendable proficiency in generating images in strict accordance with text prompts.
arXiv Detail & Related papers (2024-04-06T14:56:59Z) - Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation [60.943159830780154]
We introduce Bounded Attention, a training-free method for bounding the information flow in the sampling process.
We demonstrate that our method empowers the generation of multiple subjects that better align with given prompts and layouts.
arXiv Detail & Related papers (2024-03-25T17:52:07Z) - Spatial-Aware Latent Initialization for Controllable Image Generation [9.23227552726271]
Text-to-image diffusion models have demonstrated impressive ability to generate high-quality images conditioned on the textual input.
Previous research has primarily focused on aligning cross-attention maps with layout conditions.
We propose leveraging a spatial-aware initialization noise during the denoising process to achieve better layout control.
arXiv Detail & Related papers (2024-01-29T13:42:01Z) - Denoising Diffusion Semantic Segmentation with Mask Prior Modeling [61.73352242029671]
We propose to ameliorate the semantic segmentation quality of existing discriminative approaches with a mask prior modeled by a denoising diffusion generative model.
We evaluate the proposed prior modeling with several off-the-shelf segmentors, and our experimental results on ADE20K and Cityscapes demonstrate that our approach could achieve competitively quantitative performance.
arXiv Detail & Related papers (2023-06-02T17:47:01Z) - Empowering Diffusion Models on the Embedding Space for Text Generation [38.664533078347304]
We study the optimization challenges encountered with both the embedding space and the denoising model.
Data distribution is learnable for embeddings, which may lead to the collapse of the embedding space and unstable training.
Based on the above analysis, we propose Difformer, an embedding diffusion model based on Transformer.
arXiv Detail & Related papers (2022-12-19T12:44:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.