Related papers: PlanGen: Towards Unified Layout Planning and Image Generation in Auto-Regressive Vision Language Models

PlanGen: Towards Unified Layout Planning and Image Generation in Auto-Regressive Vision Language Models

URL: http://arxiv.org/abs/2503.10127v2
Date: Sun, 30 Mar 2025 08:24:33 GMT
Title: PlanGen: Towards Unified Layout Planning and Image Generation in Auto-Regressive Vision Language Models
Authors: Runze He, Bo Cheng, Yuhang Ma, Qingxiang Jia, Shanyuan Liu, Ao Ma, Xiaoyu Wu, Liebucha Wu, Dawei Leng, Yuhui Yin,
Abstract summary: We propose a unified layout planning and image generation model, PlanGen, which can pre-plan spatial layout conditions before generating images.<n>PlanGen integrates layout conditions into the model as context without requiring specialized encoding of local captions and bounding box coordinates.<n>In addition, PlanGen can be seamlessly expanded to layout-guided image manipulation thanks to the well-designed modeling.
Score: 10.341382572198254
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we propose a unified layout planning and image generation model, PlanGen, which can pre-plan spatial layout conditions before generating images. Unlike previous diffusion-based models that treat layout planning and layout-to-image as two separate models, PlanGen jointly models the two tasks into one autoregressive transformer using only next-token prediction. PlanGen integrates layout conditions into the model as context without requiring specialized encoding of local captions and bounding box coordinates, which provides significant advantages over the previous embed-and-pool operations on layout conditions, particularly when dealing with complex layouts. Unified prompting allows PlanGen to perform multitasking training related to layout, including layout planning, layout-to-image generation, image layout understanding, etc. In addition, PlanGen can be seamlessly expanded to layout-guided image manipulation thanks to the well-designed modeling, with teacher-forcing content manipulation policy and negative layout guidance. Extensive experiments verify the effectiveness of our PlanGen in multiple layoutrelated tasks, showing its great potential. Code is available at: https://360cvgroup.github.io/PlanGen.

Related papers

CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation [75.01950130227996]
Diffusion models have been recognized for their ability to generate images that are not only visually appealing but also of high artistic quality.<n>Previous methods primarily focus on UNet-based models (e.g., SD1.5 and SDXL), and limited effort has explored Multimodal Diffusion Transformers (MM-DiTs)<n>Inherit the advantages of MM-DiT, we use a separate set network weights to process the image and text modalities.<n>We contribute a large-scale layout dataset, named LayoutSAM, which includes 2.7 million image-text pairs and 10.7 million entities.
arXiv Detail & Related papers (2024-12-05T04:09:47Z)
PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM [58.67882997399021]
Our research introduces a unified framework for automated graphic layout generation. Our data-driven method employs structured text (JSON format) and visual instruction tuning to generate layouts. We develop an automated text-to-poster system that generates editable posters based on users' design intentions.
arXiv Detail & Related papers (2024-06-05T03:05:52Z)
LayoutGPT: Compositional Visual Planning and Generation with Large Language Models [98.81962282674151]
Large Language Models (LLMs) can serve as visual planners by generating layouts from text conditions. We propose LayoutGPT, a method to compose in-context visual demonstrations in style sheet language.
arXiv Detail & Related papers (2023-05-24T17:56:16Z)
Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation [147.81509219686419]
We propose a diagnostic benchmark for layout-guided image generation that examines four categories of spatial control skills: number, position, size, and shape. Next, we propose IterInpaint, a new baseline that generates foreground and background regions step-by-step via inpainting. We show comprehensive ablation studies on IterInpaint, including training task ratio, crop&paste vs. repaint, and generation order.
arXiv Detail & Related papers (2023-04-13T16:58:33Z)
LayoutDiffusion: Improving Graphic Layout Generation by Discrete Diffusion Probabilistic Models [50.73105631853759]
We present a novel generative model named LayoutDiffusion for automatic layout generation. It learns to reverse a mild forward process, in which layouts become increasingly chaotic with the growth of forward steps. It enables two conditional layout generation tasks in a plug-and-play manner without re-training and achieves better performance than existing methods.
arXiv Detail & Related papers (2023-03-21T04:41:02Z)
LayoutDiffuse: Adapting Foundational Diffusion Models for Layout-to-Image Generation [24.694298869398033]
Our method trains efficiently, generates images with both high perceptual quality and layout alignment. Our method significantly outperforms other 10 generative models based on GANs, VQ-VAE, and diffusion models.
arXiv Detail & Related papers (2023-02-16T14:20:25Z)
Geometry Aligned Variational Transformer for Image-conditioned Layout Generation [38.747175229902396]
We propose an Image-Conditioned Variational Transformer (ICVT) that autoregressively generates various layouts in an image. First, self-attention mechanism is adopted to model the contextual relationship within layout elements, while cross-attention mechanism is used to fuse the visual information of conditional images. We construct a large-scale advertisement poster layout designing dataset with delicate layout and saliency map annotations.
arXiv Detail & Related papers (2022-09-02T07:19:12Z)
Constrained Graphic Layout Generation via Latent Optimization [17.05026043385661]
We generate graphic layouts that can flexibly incorporate design semantics, either specified implicitly or explicitly by a user. Our approach builds on a generative layout model based on a Transformer architecture, and formulates the layout generation as a constrained optimization problem. We show in the experiments that our approach is capable of generating realistic layouts in both constrained and unconstrained generation tasks with a single model.
arXiv Detail & Related papers (2021-08-02T13:04:11Z)
Semantic Palette: Guiding Scene Generation with Class Proportions [34.746963256847145]
We introduce a conditional framework with novel architecture designs and learning objectives, which effectively accommodates class proportions to guide the scene generation process. Thanks to the semantic control, we can produce layouts close to the real distribution, helping enhance the whole scene generation process. We demonstrate the merit of our approach for data augmentation: semantic segmenters trained on real layout-image pairs outperform models only trained on real pairs.
arXiv Detail & Related papers (2021-06-03T07:04:00Z)
LayoutTransformer: Layout Generation and Completion with Self-attention [105.21138914859804]
We address the problem of scene layout generation for diverse domains such as images, mobile applications, documents, and 3D objects. We propose LayoutTransformer, a novel framework that leverages self-attention to learn contextual relationships between layout elements. Our framework allows us to generate a new layout either from an empty set or from an initial seed set of primitives, and can easily scale to support an arbitrary of primitives per layout.
arXiv Detail & Related papers (2020-06-25T17:56:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.