ConsistCompose: Unified Multimodal Layout Control for Image Composition
- URL: http://arxiv.org/abs/2511.18333v1
- Date: Sun, 23 Nov 2025 08:14:53 GMT
- Title: ConsistCompose: Unified Multimodal Layout Control for Image Composition
- Authors: Xuanke Shi, Boxuan Li, Xiaoyang Han, Zhongang Cai, Lei Yang, Dahua Lin, Quan Wang,
- Abstract summary: We present ConsistCompose, a unified framework that embeds layout coordinates directly into language prompts.<n>We show that ConsistCompose substantially improves spatial accuracy over layout-controlled baselines.
- Score: 56.909072845166264
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unified multimodal models that couple visual understanding with image generation have advanced rapidly, yet most systems still focus on visual grounding-aligning language with image regions-while their generative counterpart, linguistic-embedded layout-grounded generation (LELG) for layout-controllable multi-instance generation, remains underexplored and limits precise compositional control. We present ConsistCompose, a unified multimodal framework that embeds layout coordinates directly into language prompts, enabling layout-controlled multi-instance image generation from Interleaved Image-Text within a single generative interface. We further construct ConsistCompose3M, a 3.4M multi-instance generation dataset with layout and identity annotations (2.6M text-guided and 0.8M image-guided data pairs) that provides large-scale supervision for layout-conditioned generation. Within this framework, LELG is instantiated through instance-coordinate binding prompts and coordinate-aware classifier-free guidance, which translate linguistic layout cues into precise spatial control without task-specific branches. Experiments on COCO-Position and MS-Bench show that ConsistCompose substantially improves spatial accuracy over layout-controlled baselines while preserving identity fidelity and competitive general multimodal understanding, establishing a unified paradigm for layout-controllable multimodal image generation.
Related papers
- MoGen: A Unified Collaborative Framework for Controllable Multi-Object Image Generation [76.94658056824422]
MoGen is a user-friendly multi-object image generation method.<n>First, we design a Regional Semantic Anchor (RSA) module that precisely anchors phrase units in language descriptions to their corresponding image regions.<n>We introduce an Adaptive Multi-modal Guidance (AMG) module, which adaptively parses and integrates various combinations of multi-source control signals.
arXiv Detail & Related papers (2026-01-09T05:57:48Z) - AnyMS: Bottom-up Attention Decoupling for Layout-guided and Training-free Multi-subject Customization [55.06425570300248]
We present AnyMS, a training-free framework for layout-guided multi-subject customization.<n>AnyMS leverages three input conditions: text prompt, subject images, and layout constraints.<n>AnyMS achieves state-of-the-art performance, supporting complex compositions and scaling to a larger number of subjects.
arXiv Detail & Related papers (2025-12-29T15:26:25Z) - Canvas-to-Image: Compositional Image Generation with Multimodal Controls [51.44122945214702]
We introduce Canvas-to-Image, a unified framework that consolidates heterogeneous controls into a single canvas interface.<n>Our key idea is to encode diverse control signals into a single composite canvas image that the model can interpret for integrated visual-spatial reasoning.
arXiv Detail & Related papers (2025-11-26T18:59:56Z) - ContextGen: Contextual Layout Anchoring for Identity-Consistent Multi-Instance Generation [24.487453636504707]
We introduce ContextGen, a novel Diffusion Transformer framework for multi-instance generation.<n>We show that ContextGen sets a new state-of-the-art, outperforming existing methods in control precision, identity fidelity, and overall visual quality.
arXiv Detail & Related papers (2025-10-13T04:21:19Z) - Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think [38.258453761376586]
We propose Dream Engine, an efficient framework designed for arbitrary text-image interleaved control in image generation models.<n>Our approach utilizes a two-stage training paradigm, consisting of joint text-image alignment and multimodal interleaved instruction tuning.<n>Our experiments demonstrate that this training method is effective, achieving a 0.69 overall score on the GenEval benchmark.
arXiv Detail & Related papers (2025-02-27T15:08:39Z) - UNIC-Adapter: Unified Image-instruction Adapter with Multi-modal Transformer for Image Generation [64.8341372591993]
We propose a new approach to unify controllable generation within a single framework.<n>Specifically, we propose the unified image-instruction adapter (UNIC-Adapter) built on the Multi-Modal-Diffusion Transformer architecture.<n>Our UNIC-Adapter effectively extracts multi-modal instruction information by incorporating both conditional images and task instructions.
arXiv Detail & Related papers (2024-12-25T15:19:02Z) - CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation [78.21134311493303]
Diffusion models have been recognized for their ability to generate images that are not only visually appealing but also of high artistic quality.<n> layout-to-image generation has been proposed to leverage region-specific positions and descriptions to enable more precise and controllable generation.<n>We present a systematic solution that integrates the layout model, dataset, and planner for creative layout-to-image generation.
arXiv Detail & Related papers (2024-12-05T04:09:47Z) - OmniBooth: Learning Latent Control for Image Synthesis with Multi-modal Instruction [32.08995899903304]
We present OmniBooth, an image generation framework that enables spatial control with instance-level multi-modal customization.
Our approach significantly expands the scope of text-to-image generation, and elevates it to a more versatile and practical dimension in controllability.
arXiv Detail & Related papers (2024-10-07T11:26:13Z) - Kosmos-G: Generating Images in Context with Multimodal Large Language Models [117.0259361818715]
Current subject-driven image generation methods require test-time tuning and cannot accept interleaved multi-image and text input.
This paper presents Kosmos-G, a model that leverages the advanced multimodal perception capabilities of Multimodal Large Language Models.
Kosmos-G demonstrates an impressive capability of zero-shot subject-driven generation with interleaved multi-image and text input.
arXiv Detail & Related papers (2023-10-04T17:28:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.