Related papers: Layout-Conditioned Autoregressive Text-to-Image Generation via Structured Masking

Layout-Conditioned Autoregressive Text-to-Image Generation via Structured Masking

URL: http://arxiv.org/abs/2509.12046v1
Date: Mon, 15 Sep 2025 15:27:29 GMT
Title: Layout-Conditioned Autoregressive Text-to-Image Generation via Structured Masking
Authors: Zirui Zheng, Takashi Isobe, Tong Shen, Xu Jia, Jianbin Zhao, Xiaomin Li, Mengmeng Ge, Baolu Li, Qinghe Wang, Dong Li, Dong Zhou, Yunzhi Zhuge, Huchuan Lu, Emad Barsoum,
Abstract summary: We present Structured Masking for AR-based Layout-to-Image (SMARLI)<n>SMARLI integrates spatial layout constraints into AR-based image generation.<n>It achieves superior layoutaware control while maintaining the structural simplicity and generation efficiency of AR models.
Score: 58.238858463243396
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While autoregressive (AR) models have demonstrated remarkable success in image generation, extending them to layout-conditioned generation remains challenging due to the sparse nature of layout conditions and the risk of feature entanglement. We present Structured Masking for AR-based Layout-to-Image (SMARLI), a novel framework for layoutto-image generation that effectively integrates spatial layout constraints into AR-based image generation. To equip AR model with layout control, a specially designed structured masking strategy is applied to attention computation to govern the interaction among the global prompt, layout, and image tokens. This design prevents mis-association between different regions and their descriptions while enabling sufficient injection of layout constraints into the generation process. To further enhance generation quality and layout accuracy, we incorporate Group Relative Policy Optimization (GRPO) based post-training scheme with specially designed layout reward functions for next-set-based AR models. Experimental results demonstrate that SMARLI is able to seamlessly integrate layout tokens with text and image tokens without compromising generation quality. It achieves superior layoutaware control while maintaining the structural simplicity and generation efficiency of AR models.

Related papers

Conditional Panoramic Image Generation via Masked Autoregressive Modeling [35.624070746282186]
We propose a unified framework, Panoramic AutoRegressive model (PAR), which leverages masked autoregressive modeling to address these challenges.<n>To address the inherent discontinuity in existing generative models, we introduce circular padding to enhance spatial coherence.<n>Experiments demonstrate competitive performance in text-to-image generation and panorama outpainting tasks.
arXiv Detail & Related papers (2025-05-22T16:20:12Z)
Anchor Token Matching: Implicit Structure Locking for Training-free AR Image Editing [60.102602955261084]
Implicit Structure Locking (ISLock) is the first training-free editing strategy for AR visual models.<n>Our method preserves structural blueprints by dynamically aligning self-attention patterns with reference images.<n>Our findings pioneer the way for efficient and flexible AR-based image editing, further bridging the performance gap between diffusion and autoregressive generative models.
arXiv Detail & Related papers (2025-04-14T17:25:19Z)
CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation [78.21134311493303]
Diffusion models have been recognized for their ability to generate images that are not only visually appealing but also of high artistic quality.<n> layout-to-image generation has been proposed to leverage region-specific positions and descriptions to enable more precise and controllable generation.<n>We present a systematic solution that integrates the layout model, dataset, and planner for creative layout-to-image generation.
arXiv Detail & Related papers (2024-12-05T04:09:47Z)
LayoutDiT: Exploring Content-Graphic Balance in Layout Generation with Diffusion Transformer [46.67415676699221]
We introduce a framework that balances content and graphic features to generate high-quality, visually appealing layouts. Specifically, we design an adaptive factor that optimize the model's awareness of the layout generation space. We also introduce a graphic condition, the saliency bounding box, to bridge the modality difference between images in the visual domain and layouts in the geometric parameter domain.
arXiv Detail & Related papers (2024-07-21T17:58:21Z)
PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM [58.67882997399021]
Our research introduces a unified framework for automated graphic layout generation.<n>Our data-driven method employs structured text (JSON format) and visual instruction tuning to generate layouts.<n>We develop an automated text-to-poster system that generates editable posters based on users' design intentions.
arXiv Detail & Related papers (2024-06-05T03:05:52Z)
Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation [30.101562738257588]
Content-aware graphic layout generation aims to automatically arrange visual elements along with a given content, such as an e-commerce product image. We show that a simple retrieval augmentation can significantly improve the generation quality. Our model, which is named Retrieval-Augmented Layout Transformer (RALF), retrieves nearest neighbor layout examples based on an input image and feeds these results into an autoregressive generator.
arXiv Detail & Related papers (2023-11-22T18:59:53Z)
Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation [147.81509219686419]
We propose a diagnostic benchmark for layout-guided image generation that examines four categories of spatial control skills: number, position, size, and shape. Next, we propose IterInpaint, a new baseline that generates foreground and background regions step-by-step via inpainting. We show comprehensive ablation studies on IterInpaint, including training task ratio, crop&paste vs. repaint, and generation order.
arXiv Detail & Related papers (2023-04-13T16:58:33Z)
ALR-GAN: Adaptive Layout Refinement for Text-to-Image Synthesis [42.86424135174045]
We propose a novel Text-to-Image Generation Network, Adaptive Layout Refinement Generative Adversarial Network (ALR-GAN) The ALR-GAN includes an Adaptive Layout Refinement (ALR) module and a Layout Visual Refinement (LVR) loss. Experimental results on two widely-used datasets show that ALR-GAN performs competitively at the Text-to-Image generation task.
arXiv Detail & Related papers (2023-04-13T07:07:01Z)
Semantic Palette: Guiding Scene Generation with Class Proportions [34.746963256847145]
We introduce a conditional framework with novel architecture designs and learning objectives, which effectively accommodates class proportions to guide the scene generation process. Thanks to the semantic control, we can produce layouts close to the real distribution, helping enhance the whole scene generation process. We demonstrate the merit of our approach for data augmentation: semantic segmenters trained on real layout-image pairs outperform models only trained on real pairs.
arXiv Detail & Related papers (2021-06-03T07:04:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.