Related papers: ConsistCompose: Unified Multimodal Layout Control for Image Composition

ConsistCompose: Unified Multimodal Layout Control for Image Composition

URL: http://arxiv.org/abs/2511.18333v1
Date: Sun, 23 Nov 2025 08:14:53 GMT
Title: ConsistCompose: Unified Multimodal Layout Control for Image Composition
Authors: Xuanke Shi, Boxuan Li, Xiaoyang Han, Zhongang Cai, Lei Yang, Dahua Lin, Quan Wang,
Abstract summary: We present ConsistCompose, a unified framework that embeds layout coordinates directly into language prompts.<n>We show that ConsistCompose substantially improves spatial accuracy over layout-controlled baselines.
Score: 56.909072845166264
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Unified multimodal models that couple visual understanding with image generation have advanced rapidly, yet most systems still focus on visual grounding-aligning language with image regions-while their generative counterpart, linguistic-embedded layout-grounded generation (LELG) for layout-controllable multi-instance generation, remains underexplored and limits precise compositional control. We present ConsistCompose, a unified multimodal framework that embeds layout coordinates directly into language prompts, enabling layout-controlled multi-instance image generation from Interleaved Image-Text within a single generative interface. We further construct ConsistCompose3M, a 3.4M multi-instance generation dataset with layout and identity annotations (2.6M text-guided and 0.8M image-guided data pairs) that provides large-scale supervision for layout-conditioned generation. Within this framework, LELG is instantiated through instance-coordinate binding prompts and coordinate-aware classifier-free guidance, which translate linguistic layout cues into precise spatial control without task-specific branches. Experiments on COCO-Position and MS-Bench show that ConsistCompose substantially improves spatial accuracy over layout-controlled baselines while preserving identity fidelity and competitive general multimodal understanding, establishing a unified paradigm for layout-controllable multimodal image generation.

Related papers

MoGen: A Unified Collaborative Framework for Controllable Multi-Object Image Generation [76.94658056824422]
MoGen is a user-friendly multi-object image generation method.<n>First, we design a Regional Semantic Anchor (RSA) module that precisely anchors phrase units in language descriptions to their corresponding image regions.<n>We introduce an Adaptive Multi-modal Guidance (AMG) module, which adaptively parses and integrates various combinations of multi-source control signals.
arXiv Detail & Related papers (2026-01-09T05:57:48Z)
AnyMS: Bottom-up Attention Decoupling for Layout-guided and Training-free Multi-subject Customization [55.06425570300248]
We present AnyMS, a training-free framework for layout-guided multi-subject customization.<n>AnyMS leverages three input conditions: text prompt, subject images, and layout constraints.<n>AnyMS achieves state-of-the-art performance, supporting complex compositions and scaling to a larger number of subjects.
arXiv Detail & Related papers (2025-12-29T15:26:25Z)
Canvas-to-Image: Compositional Image Generation with Multimodal Controls [51.44122945214702]
We introduce Canvas-to-Image, a unified framework that consolidates heterogeneous controls into a single canvas interface.<n>Our key idea is to encode diverse control signals into a single composite canvas image that the model can interpret for integrated visual-spatial reasoning.
arXiv Detail & Related papers (2025-11-26T18:59:56Z)
ContextGen: Contextual Layout Anchoring for Identity-Consistent Multi-Instance Generation [24.487453636504707]
We introduce ContextGen, a novel Diffusion Transformer framework for multi-instance generation.<n>We show that ContextGen sets a new state-of-the-art, outperforming existing methods in control precision, identity fidelity, and overall visual quality.
arXiv Detail & Related papers (2025-10-13T04:21:19Z)
Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think [38.258453761376586]
We propose Dream Engine, an efficient framework designed for arbitrary text-image interleaved control in image generation models.<n>Our approach utilizes a two-stage training paradigm, consisting of joint text-image alignment and multimodal interleaved instruction tuning.<n>Our experiments demonstrate that this training method is effective, achieving a 0.69 overall score on the GenEval benchmark.
arXiv Detail & Related papers (2025-02-27T15:08:39Z)
UNIC-Adapter: Unified Image-instruction Adapter with Multi-modal Transformer for Image Generation [64.8341372591993]
We propose a new approach to unify controllable generation within a single framework.<n>Specifically, we propose the unified image-instruction adapter (UNIC-Adapter) built on the Multi-Modal-Diffusion Transformer architecture.<n>Our UNIC-Adapter effectively extracts multi-modal instruction information by incorporating both conditional images and task instructions.
arXiv Detail & Related papers (2024-12-25T15:19:02Z)
CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation [78.21134311493303]
Diffusion models have been recognized for their ability to generate images that are not only visually appealing but also of high artistic quality.<n> layout-to-image generation has been proposed to leverage region-specific positions and descriptions to enable more precise and controllable generation.<n>We present a systematic solution that integrates the layout model, dataset, and planner for creative layout-to-image generation.
arXiv Detail & Related papers (2024-12-05T04:09:47Z)
OmniBooth: Learning Latent Control for Image Synthesis with Multi-modal Instruction [32.08995899903304]
We present OmniBooth, an image generation framework that enables spatial control with instance-level multi-modal customization. Our approach significantly expands the scope of text-to-image generation, and elevates it to a more versatile and practical dimension in controllability.
arXiv Detail & Related papers (2024-10-07T11:26:13Z)
Kosmos-G: Generating Images in Context with Multimodal Large Language Models [117.0259361818715]
Current subject-driven image generation methods require test-time tuning and cannot accept interleaved multi-image and text input. This paper presents Kosmos-G, a model that leverages the advanced multimodal perception capabilities of Multimodal Large Language Models. Kosmos-G demonstrates an impressive capability of zero-shot subject-driven generation with interleaved multi-image and text input.
arXiv Detail & Related papers (2023-10-04T17:28:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.