Related papers: LayoutAgent: A Vision-Language Agent Guided Compositional Diffusion for Spatial Layout Planning

LayoutAgent: A Vision-Language Agent Guided Compositional Diffusion for Spatial Layout Planning

URL: http://arxiv.org/abs/2509.22720v1
Date: Wed, 24 Sep 2025 20:41:04 GMT
Title: LayoutAgent: A Vision-Language Agent Guided Compositional Diffusion for Spatial Layout Planning
Authors: Zezhong Fan, Xiaohan Li, Luyi Ma, Kai Zhao, Liang Peng, Topojoy Biswas, Evren Korpeoglu, Kaushiki Nag, Kannan Achan,
Abstract summary: Designing realistic multi-object scenes requires planning spatial layouts that respect semantic relations and physical plausibility.<n>We propose LayoutAgent, an agentic framework that unifies vision-language reasoning with compositional diffusion for layout generation.<n>Our method first employs visual-language model to preprocess the inputs through segmentation, object size estimation, scene graph construction, and prompt rewriting.<n>In the end, a foreground-conditioned image generator composes the complete scene by rendering the objects into the planned layout guided by designed prompts.
Score: 18.207887244259897
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Designing realistic multi-object scenes requires not only generating images, but also planning spatial layouts that respect semantic relations and physical plausibility. On one hand, while recent advances in diffusion models have enabled high-quality image generation, they lack explicit spatial reasoning, leading to unrealistic object layouts. On the other hand, traditional spatial planning methods in robotics emphasize geometric and relational consistency, but they struggle to capture semantic richness in visual scenes. To bridge this gap, in this paper, we propose LayoutAgent, an agentic framework that unifies vision-language reasoning with compositional diffusion for layout generation. Given multiple input images with target objects in them, our method first employs visual-language model to preprocess the inputs through segmentation, object size estimation, scene graph construction, and prompt rewriting. Then we leverage compositional diffusion-a method traditionally used in robotics-to synthesize bounding boxes that respect object relations encoded in the scene graph for spatial layouts. In the end, a foreground-conditioned image generator composes the complete scene by rendering the objects into the planned layout guided by designed prompts. Experiments demonstrate that LayoutAgent outperforms other state-of-the-art layout generation models in layout coherence, spatial realism and aesthetic alignment.

Related papers

RoamScene3D: Immersive Text-to-3D Scene Generation via Adaptive Object-aware Roaming [79.81527946524098]
RoamScene3D is a novel framework that bridges the gap between semantic guidance and spatial generation.<n>We employ a vision-language model (VLM) to construct a scene graph that encodes object relations.<n>To mitigate the limitations of static 2D priors, we introduce a Motion-Injected Inpainting model that is fine-tuned on a synthetic panoramic dataset.
arXiv Detail & Related papers (2026-01-27T10:10:55Z)
HiScene: Creating Hierarchical 3D Scenes with Isometric View Generation [50.206100327643284]
HiScene is a novel hierarchical framework that bridges the gap between 2D image generation and 3D object generation.<n>We generate 3D content that aligns with 2D representations while maintaining compositional structure.
arXiv Detail & Related papers (2025-04-17T16:33:39Z)
Scene Graph Disentanglement and Composition for Generalizable Complex Image Generation [44.457347230146404]
We leverage the scene graph, a powerful structured representation, for complex image generation. We employ the generative capabilities of variational autoencoders and diffusion models in a generalizable manner. Our method outperforms recent competitors based on text, layout, or scene graph.
arXiv Detail & Related papers (2024-10-01T07:02:46Z)
GroundingBooth: Grounding Text-to-Image Customization [17.185571339157075]
GroundingBooth achieves zero-shot, instance-level spatial grounding on both foreground subjects and background objects.<n>Our proposed grounding module and subject-grounded cross-attention layer enable the creation of personalized images.
arXiv Detail & Related papers (2024-09-13T03:40:58Z)
Move Anything with Layered Scene Diffusion [77.45870343845492]
We propose SceneDiffusion to optimize a layered scene representation during the diffusion sampling process. Our key insight is that spatial disentanglement can be obtained by jointly denoising scene renderings at different spatial layouts. Our generated scenes support a wide range of spatial editing operations, including moving, resizing, cloning, and layer-wise appearance editing operations.
arXiv Detail & Related papers (2024-04-10T17:28:16Z)
LoCo: Locally Constrained Training-Free Layout-to-Image Synthesis [24.925757148750684]
We propose a training-free approach for layout-to-image Synthesis that excels in producing high-quality images aligned with both textual prompts and layout instructions. LoCo seamlessly integrates into existing text-to-image and layout-to-image models, enhancing their performance in spatial control and addressing semantic failures observed in prior methods.
arXiv Detail & Related papers (2023-11-21T04:28:12Z)
LAW-Diffusion: Complex Scene Generation by Diffusion with Layouts [107.11267074981905]
We propose a semantically controllable layout-AWare diffusion model, termed LAW-Diffusion. We show that LAW-Diffusion yields the state-of-the-art generative performance, especially with coherent object relations.
arXiv Detail & Related papers (2023-08-13T08:06:18Z)
Zero-shot spatial layout conditioning for text-to-image diffusion models [52.24744018240424]
Large-scale text-to-image diffusion models have significantly improved the state of the art in generative image modelling. We consider image generation from text associated with segments on the image canvas, which combines an intuitive natural language interface with precise spatial control over the generated content. We propose ZestGuide, a zero-shot segmentation guidance approach that can be plugged into pre-trained text-to-image diffusion models.
arXiv Detail & Related papers (2023-06-23T19:24:48Z)
Geometry Aligned Variational Transformer for Image-conditioned Layout Generation [38.747175229902396]
We propose an Image-Conditioned Variational Transformer (ICVT) that autoregressively generates various layouts in an image. First, self-attention mechanism is adopted to model the contextual relationship within layout elements, while cross-attention mechanism is used to fuse the visual information of conditional images. We construct a large-scale advertisement poster layout designing dataset with delicate layout and saliency map annotations.
arXiv Detail & Related papers (2022-09-02T07:19:12Z)
Interactive Image Synthesis with Panoptic Layout Generation [14.1026819862002]
We propose Panoptic Layout Generative Adversarial Networks (PLGAN) to address this challenge. PLGAN employs panoptic theory which distinguishes object categories between "stuff" with amorphous boundaries and "things" with well-defined shapes. We experimentally compare our PLGAN with state-of-the-art layout-based models on the COCO-Stuff, Visual Genome, and Landscape datasets.
arXiv Detail & Related papers (2022-03-04T02:45:27Z)
Learned Spatial Representations for Few-shot Talking-Head Synthesis [68.3787368024951]
We propose a novel approach for few-shot talking-head synthesis. We show that this disentangled representation leads to a significant improvement over previous methods.
arXiv Detail & Related papers (2021-04-29T17:59:42Z)
Object-Centric Image Generation from Layouts [93.10217725729468]
We develop a layout-to-image-generation method to generate complex scenes with multiple objects. Our method learns representations of the spatial relationships between objects in the scene, which lead to our model's improved layout-fidelity. We introduce SceneFID, an object-centric adaptation of the popular Fr'echet Inception Distance metric, that is better suited for multi-object images.
arXiv Detail & Related papers (2020-03-16T21:40:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.