Related papers: Obtaining Favorable Layouts for Multiple Object Generation

Obtaining Favorable Layouts for Multiple Object Generation

URL: http://arxiv.org/abs/2405.00791v1
Date: Wed, 1 May 2024 18:07:48 GMT
Title: Obtaining Favorable Layouts for Multiple Object Generation
Authors: Barak Battash, Amit Rozner, Lior Wolf, Ofir Lindenbaum,
Abstract summary: Large-scale text-to-image models can generate high-quality and diverse images based on textual prompts. However, the existing state-of-the-art diffusion models face difficulty when generating images that involve multiple subjects. We propose a novel approach based on a guiding principle. We allow the diffusion model to initially propose a layout, and then we rearrange the layout grid. This is achieved by enforcing cross-attention maps (XAMs) to adhere to proposed masks and by migrating pixels from latent maps to new locations determined by us.
Score: 50.616875565173274
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large-scale text-to-image models that can generate high-quality and diverse images based on textual prompts have shown remarkable success. These models aim ultimately to create complex scenes, and addressing the challenge of multi-subject generation is a critical step towards this goal. However, the existing state-of-the-art diffusion models face difficulty when generating images that involve multiple subjects. When presented with a prompt containing more than one subject, these models may omit some subjects or merge them together. To address this challenge, we propose a novel approach based on a guiding principle. We allow the diffusion model to initially propose a layout, and then we rearrange the layout grid. This is achieved by enforcing cross-attention maps (XAMs) to adhere to proposed masks and by migrating pixels from latent maps to new locations determined by us. We introduce new loss terms aimed at reducing XAM entropy for clearer spatial definition of subjects, reduce the overlap between XAMs, and ensure that XAMs align with their respective masks. We contrast our approach with several alternative methods and show that it more faithfully captures the desired concepts across a variety of text prompts.

Related papers

CubeDiff: Repurposing Diffusion-Based Image Models for Panorama Generation [59.257513664564996]
We introduce a novel method for generating 360deg panoramas from text prompts or images. We employ multi-view diffusion models to jointly synthesize the six faces of a cubemap. Our model allows for fine-grained text control, generates high resolution panorama images and generalizes well beyond its training set.
arXiv Detail & Related papers (2025-01-28T18:59:49Z)
Nested Attention: Semantic-aware Attention Values for Concept Personalization [78.90196530697897]
We introduce Nested Attention, a novel mechanism that injects a rich and expressive image representation into the model's existing cross-attention layers. Our key idea is to generate query-dependent subject values, derived from nested attention layers that learn to select relevant subject features for each region in the generated image.
arXiv Detail & Related papers (2025-01-02T18:52:11Z)
Scene Graph Disentanglement and Composition for Generalizable Complex Image Generation [44.457347230146404]
We leverage the scene graph, a powerful structured representation, for complex image generation. We employ the generative capabilities of variational autoencoders and diffusion models in a generalizable manner. Our method outperforms recent competitors based on text, layout, or scene graph.
arXiv Detail & Related papers (2024-10-01T07:02:46Z)
MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance [6.4680449907623006]
This research introduces the MS-Diffusion framework for layout-guided zero-shot image personalization with multi-subjects. The proposed multi-subject cross-attention orchestrates inter-subject compositions while preserving the control of texts.
arXiv Detail & Related papers (2024-06-11T12:32:53Z)
Many-to-many Image Generation with Auto-regressive Diffusion Models [59.5041405824704]
This paper introduces a domain-general framework for many-to-many image generation, capable of producing interrelated image series from a given set of images. We present MIS, a novel large-scale multi-image dataset, containing 12M synthetic multi-image samples, each with 25 interconnected images. We learn M2M, an autoregressive model for many-to-many generation, where each image is modeled within a diffusion framework.
arXiv Detail & Related papers (2024-04-03T23:20:40Z)
Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation [60.943159830780154]
We introduce Bounded Attention, a training-free method for bounding the information flow in the sampling process. We demonstrate that our method empowers the generation of multiple subjects that better align with given prompts and layouts.
arXiv Detail & Related papers (2024-03-25T17:52:07Z)
DivCon: Divide and Conquer for Progressive Text-to-Image Generation [0.0]
Diffusion-driven text-to-image (T2I) generation has achieved remarkable advancements. layout is employed as an intermedium to bridge large language models and layout-based diffusion models. We introduce a divide-and-conquer approach which decouples the T2I generation task into simple subtasks.
arXiv Detail & Related papers (2024-03-11T03:24:44Z)
Pick-and-Draw: Training-free Semantic Guidance for Text-to-Image Personalization [56.12990759116612]
Pick-and-Draw is a training-free semantic guidance approach to boost identity consistency and generative diversity for personalization methods. The proposed approach can be applied to any personalized diffusion models and requires as few as a single reference image.
arXiv Detail & Related papers (2024-01-30T05:56:12Z)
LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation [121.45667242282721]
We propose a coarse-to-fine paradigm to achieve layout planning and image generation. Our proposed method outperforms the state-of-the-art models in terms of photorealistic layout and image generation.
arXiv Detail & Related papers (2023-08-09T17:45:04Z)
Break-A-Scene: Extracting Multiple Concepts from a Single Image [80.47666266017207]
We introduce the task of textual scene decomposition. We propose augmenting the input image with masks that indicate the presence of target concepts. We then present a novel two-phase customization process.
arXiv Detail & Related papers (2023-05-25T17:59:04Z)
Blended Latent Diffusion [18.043090347648157]
We present an accelerated solution to the task of local text-driven editing of generic images, where the desired edits are confined to a user-provided mask. Our solution leverages a recent text-to-image Latent Diffusion Model (LDM), which speeds up diffusion by operating in a lower-dimensional latent space.
arXiv Detail & Related papers (2022-06-06T17:58:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.