Related papers: Laytrol: Preserving Pretrained Knowledge in Layout Control for Multimodal Diffusion Transformers

Laytrol: Preserving Pretrained Knowledge in Layout Control for Multimodal Diffusion Transformers

URL: http://arxiv.org/abs/2511.07934v1
Date: Wed, 12 Nov 2025 01:29:06 GMT
Title: Laytrol: Preserving Pretrained Knowledge in Layout Control for Multimodal Diffusion Transformers
Authors: Sida Huang, Siqi Huang, Ping Luo, Hongyuan Zhang,
Abstract summary: layout-to-image generation aims to generate images that are spatially consistent with the given layout condition.<n>Existing layout-to-image methods typically introduce the layout condition by integrating adapter modules into the base generative model.<n>We propose the layout control network, in which parameters are inherited from MM-DiT to preserve the pretrained knowledge of the base model.
Score: 30.863250877729612
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the development of diffusion models, enhancing spatial controllability in text-to-image generation has become a vital challenge. As a representative task for addressing this challenge, layout-to-image generation aims to generate images that are spatially consistent with the given layout condition. Existing layout-to-image methods typically introduce the layout condition by integrating adapter modules into the base generative model. However, the generated images often exhibit low visual quality and stylistic inconsistency with the base model, indicating a loss of pretrained knowledge. To alleviate this issue, we construct the Layout Synthesis (LaySyn) dataset, which leverages images synthesized by the base model itself to mitigate the distribution shift from the pretraining data. Moreover, we propose the Layout Control (Laytrol) Network, in which parameters are inherited from MM-DiT to preserve the pretrained knowledge of the base model. To effectively activate the copied parameters and avoid disturbance from unstable control conditions, we adopt a dedicated initialization scheme for Laytrol. In this scheme, the layout encoder is initialized as a pure text encoder to ensure that its output tokens remain within the data domain of MM-DiT. Meanwhile, the outputs of the layout control network are initialized to zero. In addition, we apply Object-level Rotary Position Embedding to the layout tokens to provide coarse positional information. Qualitative and quantitative experiments demonstrate the effectiveness of our method.

Related papers

A Two-Stage System for Layout-Controlled Image Generation using Large Language Models and Diffusion Models [0.0]
Text-to-image diffusion models exhibit remarkable generative capabilities, but lack precise control over object counts and spatial arrangements.<n>This work introduces a two-stage system to address these compositional limitations.<n>The first stage employs a Large Language Model (LLM) to generate a structured layout from a list of objects.<n>The second stage uses a layout-conditioned diffusion model to synthesize a photorealistic image adhering to this layout.
arXiv Detail & Related papers (2025-11-10T09:40:48Z)
EditInfinity: Image Editing with Binary-Quantized Generative Models [64.05135380710749]
We investigate the parameter-efficient adaptation of binary-quantized generative models for image editing.<n>Specifically, we propose EditInfinity, which adapts emphInfinity, a binary-quantized generative model, for image editing.<n>We propose an efficient yet effective image inversion mechanism that integrates text prompting rectification and image style preservation.
arXiv Detail & Related papers (2025-10-23T05:06:24Z)
ScaleWeaver: Weaving Efficient Controllable T2I Generation with Multi-Scale Reference Attention [86.93601565563954]
ScaleWeaver is a framework designed to achieve high-fidelity, controllable generation upon advanced visual autoregressive( VAR) models.<n>The proposed Reference Attention module discards the unnecessary attention from image$rightarrow$condition, reducing computational cost.<n>Experiments show that ScaleWeaver delivers high-quality generation and precise control while attaining superior efficiency over diffusion-based methods.
arXiv Detail & Related papers (2025-10-16T17:00:59Z)
STAY Diffusion: Styled Layout Diffusion Model for Diverse Layout-to-Image Generation [4.769823364778397]
We propose a diffusion-based model that produces photo-realistic images and provides fine-grained control of stylized objects in scenes.<n>Our approach learns a global condition for each layout, and a self-supervised semantic map for weight modulation.<n>A new Styled-Mask Attention (SM Attention) is also introduced to cross-condition the global condition and image feature for capturing the objects' relationships.
arXiv Detail & Related papers (2025-03-15T17:36:24Z)
OminiControl: Minimal and Universal Control for Diffusion Transformer [68.3243031301164]
We present OminiControl, a novel approach that rethinks how image conditions are integrated into Diffusion Transformer (DiT) architectures.<n>OminiControl addresses these limitations through three key innovations.
arXiv Detail & Related papers (2024-11-22T17:55:15Z)
PLACE: Adaptive Layout-Semantic Fusion for Semantic Image Synthesis [62.29033292210752]
High-quality images with consistent semantics and layout remains a challenge. We propose the adaPtive LAyout-semantiC fusion modulE (PLACE) that harnesses pre-trained models to alleviate the aforementioned issues. Our approach performs favorably in terms of visual quality, semantic consistency, and layout alignment.
arXiv Detail & Related papers (2024-03-04T09:03:16Z)
Spatial-Aware Latent Initialization for Controllable Image Generation [9.23227552726271]
Text-to-image diffusion models have demonstrated impressive ability to generate high-quality images conditioned on the textual input. Previous research has primarily focused on aligning cross-attention maps with layout conditions. We propose leveraging a spatial-aware initialization noise during the denoising process to achieve better layout control.
arXiv Detail & Related papers (2024-01-29T13:42:01Z)
SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-form Layout-to-Image Generation [68.42476385214785]
We propose a novel Spatial-Semantic Map Guided (SSMG) diffusion model that adopts the feature map, derived from the layout, as guidance. SSMG achieves superior generation quality with sufficient spatial and semantic controllability compared to previous works. We also propose the Relation-Sensitive Attention (RSA) and Location-Sensitive Attention (LSA) mechanisms.
arXiv Detail & Related papers (2023-08-20T04:09:12Z)
iEdit: Localised Text-guided Image Editing with Weak Supervision [53.082196061014734]
We propose a novel learning method for text-guided image editing. It generates images conditioned on a source image and a textual edit prompt. It shows favourable results against its counterparts in terms of image fidelity, CLIP alignment score and qualitatively for editing both generated and real images.
arXiv Detail & Related papers (2023-05-10T07:39:14Z)
LayoutDM: Discrete Diffusion Model for Controllable Layout Generation [27.955214767628107]
Controllable layout generation aims at synthesizing plausible arrangement of element bounding boxes with optional constraints. In this work, we try to solve a broad range of layout generation tasks in a single model that is based on discrete state-space diffusion models. Our model, named LayoutDM, naturally handles the structured layout data in the discrete representation and learns to progressively infer a noiseless layout from the initial input.
arXiv Detail & Related papers (2023-03-14T17:59:47Z)
Global and Local Alignment Networks for Unpaired Image-to-Image Translation [170.08142745705575]
The goal of unpaired image-to-image translation is to produce an output image reflecting the target domain's style. Due to the lack of attention to the content change in existing methods, semantic information from source images suffers from degradation during translation. We introduce a novel approach, Global and Local Alignment Networks (GLA-Net) Our method effectively generates sharper and more realistic images than existing approaches.
arXiv Detail & Related papers (2021-11-19T18:01:54Z)
Learning Layout and Style Reconfigurable GANs for Controllable Image Synthesis [12.449076001538552]
This paper focuses on a recent emerged task, layout-to-image, to learn generative models capable of synthesizing photo-realistic images from spatial layout. Style control at the image level is the same as in vanilla GANs, while style control at the object mask level is realized by a proposed novel feature normalization scheme. In experiments, the proposed method is tested in the COCO-Stuff dataset and the Visual Genome dataset with state-of-the-art performance obtained.
arXiv Detail & Related papers (2020-03-25T18:16:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.