CoLoGen: Progressive Learning of Concept-Localization Duality for Unified Image Generation
- URL: http://arxiv.org/abs/2602.22150v2
- Date: Thu, 26 Feb 2026 06:25:16 GMT
- Title: CoLoGen: Progressive Learning of Concept-Localization Duality for Unified Image Generation
- Authors: YuXin Song, Yu Lu, Haoyuan Sun, Huanjin Yao, Fanglong Liu, Yifan Sun, Haocheng Feng, Hang Zhou, Jingdong Wang,
- Abstract summary: CoLoGen is a unified diffusion framework that progressively learns and reconciles concept-localization duality.<n>CoLoGen uses a staged curriculum that first builds core conceptual and localization abilities, then adapts them to diverse visual conditions, and finally refines their synergy for complex instruction-driven tasks.<n>Experiments on editing, controllable generation, and customized generation show that CoLoGen achieves competitive or superior performance.
- Score: 55.409963941827044
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Unified conditional image generation remains difficult because different tasks depend on fundamentally different internal representations. Some require conceptual understanding for semantic synthesis, while others rely on localization cues for spatial precision. Forcing these heterogeneous tasks to share a single representation leads to concept-localization representational conflict. To address this issue, we propose CoLoGen, a unified diffusion framework that progressively learns and reconciles this concept-localization duality. CoLoGen uses a staged curriculum that first builds core conceptual and localization abilities, then adapts them to diverse visual conditions, and finally refines their synergy for complex instruction-driven tasks. Central to this process is the Progressive Representation Weaving (PRW) module, which dynamically routes features to specialized experts and stably integrates their outputs across stages. Experiments on editing, controllable generation, and customized generation show that CoLoGen achieves competitive or superior performance, offering a principled representational perspective for unified image generation.
Related papers
- PosterOmni: Generalized Artistic Poster Creation via Task Distillation and Unified Reward Feedback [30.88155039139322]
Poster Omni is a generalized artistic poster creation framework.<n>It integrates the two regimes, namely local editing and global creation, within a single system.<n>It significantly enhances reference adherence, global composition quality, and aesthetic harmony.
arXiv Detail & Related papers (2026-02-12T16:16:38Z) - UniReason 1.0: A Unified Reasoning Framework for World Knowledge Aligned Image Generation and Editing [44.071171929398076]
multimodal models often struggle with complex synthesis tasks that demand deep reasoning.<n>We propose UniReason, a unified framework that harmonizes text-to-image generation and image editing.<n>We support this framework by systematically constructing a large-scale reasoning-centric dataset.
arXiv Detail & Related papers (2026-02-02T18:34:35Z) - HiCoGen: Hierarchical Compositional Text-to-Image Generation in Diffusion Models via Reinforcement Learning [66.99487505369254]
HiCoGen is built upon a novel Chain of Synthesis paradigm.<n>It decomposes complex prompts into minimal semantic units.<n>It then synthesizes these units iteratively, where the image generated in each step provides crucial visual context for the next.<n>Experiments show our approach significantly outperforms existing methods in both concept coverage and compositional accuracy.
arXiv Detail & Related papers (2025-11-25T06:24:25Z) - Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer [50.69959748410398]
We introduce MingTok, a new family of visual tokenizers with a continuous latent space for unified autoregressive generation and understanding.<n>MingTok adopts a three-stage sequential architecture involving low-level encoding, semantic expansion, and visual reconstruction.<n>Built on top of it, Ming-UniVision eliminates the need for task-specific visual representations, and unifies diverse vision-language tasks under a single autoregrsssive prediction paradigm.
arXiv Detail & Related papers (2025-10-08T02:50:14Z) - Neural Scene Designer: Self-Styled Semantic Image Manipulation [67.43125248646653]
We introduce the Neural Scene Designer (NSD), a novel framework that enables photo-realistic manipulation of user-specified scene regions.<n>NSD ensures both semantic alignment with user intent and stylistic consistency with the surrounding environment.<n>To capture fine-grained style representations, we propose the Progressive Self-style Representational Learning (PSRL) module.
arXiv Detail & Related papers (2025-09-01T11:59:03Z) - Subject-Consistent and Pose-Diverse Text-to-Image Generation [36.67159307721023]
We propose subject-Consistent and pose-Diverse T2I framework, dubbed as CoDi.<n>It enables consistent subject generation with diverse pose and layout.<n>CoDi achieves both better visual perception and stronger performance across all metrics.
arXiv Detail & Related papers (2025-07-11T08:15:56Z) - Harmonizing Visual Representations for Unified Multimodal Understanding and Generation [53.01486796503091]
We present emphHarmon, a unified autoregressive framework that harmonizes understanding and generation tasks with a shared MAR encoder.<n>Harmon achieves state-of-the-art image generation results on the GenEval, MJHQ30K and WISE benchmarks.
arXiv Detail & Related papers (2025-03-27T20:50:38Z) - Unified Autoregressive Visual Generation and Understanding with Continuous Tokens [52.21981295470491]
We present UniFluid, a unified autoregressive framework for joint visual generation and understanding.<n>Our unified autoregressive architecture processes multimodal image and text inputs, generating discrete tokens for text and continuous tokens for image.<n>We find though there is an inherent trade-off between the image generation and understanding task, a carefully tuned training recipe enables them to improve each other.
arXiv Detail & Related papers (2025-03-17T17:58:30Z) - SpotActor: Training-Free Layout-Controlled Consistent Image Generation [43.2870588035256]
We present a new formalization of dual energy guidance with optimization in a dual semantic-latent space.
We propose a training-free pipeline, SpotActor, which features a layout-conditioned backward update stage and a consistent forward sampling stage.
The results prove that SpotActor fulfills the expectations of this task and showcases the potential for practical applications.
arXiv Detail & Related papers (2024-09-07T11:52:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.