MALeR: Improving Compositional Fidelity in Layout-Guided Generation
- URL: http://arxiv.org/abs/2511.06002v1
- Date: Sat, 08 Nov 2025 13:16:19 GMT
- Title: MALeR: Improving Compositional Fidelity in Layout-Guided Generation
- Authors: Shivank Saxena, Dhruv Srivastava, Makarand Tapaswi,
- Abstract summary: We present MALeR, a layout-guided method for generating images with multiple subjects and attributes.<n>Our method achieves superior performance in compositional accuracy, generation consistency, and attribute binding compared to previous work.
- Score: 18.038208275014917
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recent advances in text-to-image models have enabled a new era of creative and controllable image generation. However, generating compositional scenes with multiple subjects and attributes remains a significant challenge. To enhance user control over subject placement, several layout-guided methods have been proposed. However, these methods face numerous challenges, particularly in compositional scenes. Unintended subjects often appear outside the layouts, generated images can be out-of-distribution and contain unnatural artifacts, or attributes bleed across subjects, leading to incorrect visual outputs. In this work, we propose MALeR, a method that addresses each of these challenges. Given a text prompt and corresponding layouts, our method prevents subjects from appearing outside the given layouts while being in-distribution. Additionally, we propose a masked, attribute-aware binding mechanism that prevents attribute leakage, enabling accurate rendering of subjects with multiple attributes, even in complex compositional scenes. Qualitative and quantitative evaluation demonstrates that our method achieves superior performance in compositional accuracy, generation consistency, and attribute binding compared to previous work. MALeR is particularly adept at generating images of scenes with multiple subjects and multiple attributes per subject.
Related papers
- VSC: Visual Search Compositional Text-to-Image Diffusion Model [15.682990658945682]
We introduce a novel compositional generation method that leverages pairwise image embeddings to improve attribute-object binding.<n>Our approach decomposes complex prompts into sub-prompts, generates corresponding images, and computes visual prototypes that fuse with text embeddings to enhance representation.<n>Our approaches outperform existing compositional text-to-image diffusion models on the benchmark T2I CompBench, achieving better image quality, evaluated by humans, and emerging robustness under scaling number of binding pairs in the prompt.
arXiv Detail & Related papers (2025-05-02T08:31:43Z) - Object-level Visual Prompts for Compositional Image Generation [75.6085388740087]
We introduce a method for composing object-level visual prompts within a text-to-image diffusion model.<n>A key challenge in this task is to preserve the identity of the objects depicted in the input visual prompts.<n>We introduce a new KV-mixed cross-attention mechanism, in which keys and values are learned from distinct visual representations.
arXiv Detail & Related papers (2025-01-02T18:59:44Z) - Nested Attention: Semantic-aware Attention Values for Concept Personalization [78.90196530697897]
We introduce Nested Attention, a novel mechanism that injects a rich and expressive image representation into the model's existing cross-attention layers.<n>Our key idea is to generate query-dependent subject values, derived from nested attention layers that learn to select relevant subject features for each region in the generated image.
arXiv Detail & Related papers (2025-01-02T18:52:11Z) - Scene Graph Disentanglement and Composition for Generalizable Complex Image Generation [44.457347230146404]
We leverage the scene graph, a powerful structured representation, for complex image generation.
We employ the generative capabilities of variational autoencoders and diffusion models in a generalizable manner.
Our method outperforms recent competitors based on text, layout, or scene graph.
arXiv Detail & Related papers (2024-10-01T07:02:46Z) - MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance [5.452759083801634]
This research introduces the MS-Diffusion framework for layout-guided zero-shot image personalization with multi-subjects.<n>The proposed multi-subject cross-attention orchestrates inter-subject compositions while preserving the control of texts.
arXiv Detail & Related papers (2024-06-11T12:32:53Z) - Training-free Subject-Enhanced Attention Guidance for Compositional Text-to-image Generation [22.949365270116335]
We propose a subject-driven generation framework and introduce training-free guidance to intervene in the generative process during inference time.
Notably, our method exhibits exceptional zero-shot generation ability, especially in the challenging task of compositional generation.
arXiv Detail & Related papers (2024-05-11T08:11:25Z) - Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation [60.943159830780154]
We introduce Bounded Attention, a training-free method for bounding the information flow in the sampling process.
We demonstrate that our method empowers the generation of multiple subjects that better align with given prompts and layouts.
arXiv Detail & Related papers (2024-03-25T17:52:07Z) - Decoupled Textual Embeddings for Customized Image Generation [62.98933630971543]
Customized text-to-image generation aims to learn user-specified concepts with a few images.
Existing methods usually suffer from overfitting issues and entangle the subject-unrelated information with the learned concept.
We propose the DETEX, a novel approach that learns the disentangled concept embedding for flexible customized text-to-image generation.
arXiv Detail & Related papers (2023-12-19T03:32:10Z) - Cones 2: Customizable Image Synthesis with Multiple Subjects [50.54010141032032]
We study how to efficiently represent a particular subject as well as how to appropriately compose different subjects.
By rectifying the activations in the cross-attention map, the layout appoints and separates the location of different subjects in the image.
arXiv Detail & Related papers (2023-05-30T18:00:06Z) - ManiCLIP: Multi-Attribute Face Manipulation from Text [104.30600573306991]
We present a novel multi-attribute face manipulation method based on textual descriptions.
Our method generates natural manipulated faces with minimal text-irrelevant attribute editing.
arXiv Detail & Related papers (2022-10-02T07:22:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.