MoGen: A Unified Collaborative Framework for Controllable Multi-Object Image Generation
- URL: http://arxiv.org/abs/2601.05546v1
- Date: Fri, 09 Jan 2026 05:57:48 GMT
- Title: MoGen: A Unified Collaborative Framework for Controllable Multi-Object Image Generation
- Authors: Yanfeng Li, Yue Sun, Keren Fu, Sio-Kei Im, Xiaoming Liu, Guangtao Zhai, Xiaohong Liu, Tao Tan,
- Abstract summary: MoGen is a user-friendly multi-object image generation method.<n>First, we design a Regional Semantic Anchor (RSA) module that precisely anchors phrase units in language descriptions to their corresponding image regions.<n>We introduce an Adaptive Multi-modal Guidance (AMG) module, which adaptively parses and integrates various combinations of multi-source control signals.
- Score: 76.94658056824422
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing multi-object image generation methods face difficulties in achieving precise alignment between localized image generation regions and their corresponding semantics based on language descriptions, frequently resulting in inconsistent object quantities and attribute aliasing. To mitigate this limitation, mainstream approaches typically rely on external control signals to explicitly constrain the spatial layout, local semantic and visual attributes of images. However, this strong dependency makes the input format rigid, rendering it incompatible with the heterogeneous resource conditions of users and diverse constraint requirements. To address these challenges, we propose MoGen, a user-friendly multi-object image generation method. First, we design a Regional Semantic Anchor (RSA) module that precisely anchors phrase units in language descriptions to their corresponding image regions during the generation process, enabling text-to-image generation that follows quantity specifications for multiple objects. Building upon this foundation, we further introduce an Adaptive Multi-modal Guidance (AMG) module, which adaptively parses and integrates various combinations of multi-source control signals to formulate corresponding structured intent. This intent subsequently guides selective constraints on scene layouts and object attributes, achieving dynamic fine-grained control. Experimental results demonstrate that MoGen significantly outperforms existing methods in generation quality, quantity consistency, and fine-grained control, while exhibiting superior accessibility and control flexibility. Code is available at: https://github.com/Tear-kitty/MoGen/tree/master.
Related papers
- Canvas-to-Image: Compositional Image Generation with Multimodal Controls [51.44122945214702]
We introduce Canvas-to-Image, a unified framework that consolidates heterogeneous controls into a single canvas interface.<n>Our key idea is to encode diverse control signals into a single composite canvas image that the model can interpret for integrated visual-spatial reasoning.
arXiv Detail & Related papers (2025-11-26T18:59:56Z) - ConsistCompose: Unified Multimodal Layout Control for Image Composition [56.909072845166264]
We present ConsistCompose, a unified framework that embeds layout coordinates directly into language prompts.<n>We show that ConsistCompose substantially improves spatial accuracy over layout-controlled baselines.
arXiv Detail & Related papers (2025-11-23T08:14:53Z) - Condition Weaving Meets Expert Modulation: Towards Universal and Controllable Image Generation [17.898556887669997]
We propose a Unified image-to-image Generation (UniGen) framework that supports diverse conditional inputs.<n>Condition Modulated Expert (CoMoE) module aggregates semantically similar patch features for visual representation and conditional modeling.<n>We also propose WeaveNet, a dynamic, snake-like connection mechanism that enables effective interaction between global text-level control from the backbone and fine-grained control from conditional branches.
arXiv Detail & Related papers (2025-08-24T13:47:10Z) - ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning [76.2503352325492]
ControlThinker is a novel framework that employs a "comprehend-then-generate" paradigm.<n>Latent semantics from control images are mined to enrich text prompts.<n>This enriched semantic understanding then seamlessly aids in image generation without the need for additional complex modifications.
arXiv Detail & Related papers (2025-06-04T05:56:19Z) - UNIC-Adapter: Unified Image-instruction Adapter with Multi-modal Transformer for Image Generation [64.8341372591993]
We propose a new approach to unify controllable generation within a single framework.<n>Specifically, we propose the unified image-instruction adapter (UNIC-Adapter) built on the Multi-Modal-Diffusion Transformer architecture.<n>Our UNIC-Adapter effectively extracts multi-modal instruction information by incorporating both conditional images and task instructions.
arXiv Detail & Related papers (2024-12-25T15:19:02Z) - Generating Compositional Scenes via Text-to-image RGBA Instance Generation [82.63805151691024]
Text-to-image diffusion generative models can generate high quality images at the cost of tedious prompt engineering.
We propose a novel multi-stage generation paradigm that is designed for fine-grained control, flexibility and interactivity.
Our experiments show that our RGBA diffusion model is capable of generating diverse and high quality instances with precise control over object attributes.
arXiv Detail & Related papers (2024-11-16T23:44:14Z) - OmniBooth: Learning Latent Control for Image Synthesis with Multi-modal Instruction [32.08995899903304]
We present OmniBooth, an image generation framework that enables spatial control with instance-level multi-modal customization.
Our approach significantly expands the scope of text-to-image generation, and elevates it to a more versatile and practical dimension in controllability.
arXiv Detail & Related papers (2024-10-07T11:26:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.