UniReason 1.0: A Unified Reasoning Framework for World Knowledge Aligned Image Generation and Editing
- URL: http://arxiv.org/abs/2602.02437v3
- Date: Sat, 07 Feb 2026 16:35:02 GMT
- Title: UniReason 1.0: A Unified Reasoning Framework for World Knowledge Aligned Image Generation and Editing
- Authors: Dianyi Wang, Chaofan Ma, Feng Han, Size Wu, Wei Song, Yibin Wang, Zhixiong Zhang, Tianhang Wang, Siyuan Wang, Zhongyu Wei, Jiaqi Wang,
- Abstract summary: multimodal models often struggle with complex synthesis tasks that demand deep reasoning.<n>We propose UniReason, a unified framework that harmonizes text-to-image generation and image editing.<n>We support this framework by systematically constructing a large-scale reasoning-centric dataset.
- Score: 44.071171929398076
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unified multimodal models often struggle with complex synthesis tasks that demand deep reasoning, and typically treat text-to-image generation and image editing as isolated capabilities rather than interconnected reasoning steps. To address this, we propose UniReason, a unified framework that harmonizes these two tasks through two complementary reasoning paradigms. We incorporate world knowledge-enhanced textual reasoning into generation to infer implicit knowledge, and leverage editing capabilities for fine-grained editing-like visual refinement to further correct visual errors via self-reflection. This approach unifies generation and editing within a shared architecture, mirroring the human cognitive process of planning followed by refinement. We support this framework by systematically constructing a large-scale reasoning-centric dataset (~300k samples) covering five major knowledge domains (e.g., cultural commonsense, physics, etc.) for textual reasoning, alongside an agent-generated corpus for visual refinement. Extensive experiments demonstrate that UniReason achieves advanced performance on reasoning-intensive benchmarks such as WISE, KrisBench and UniREditBench, while maintaining superior general synthesis capabilities.
Related papers
- Forge-and-Quench: Enhancing Image Generation for Higher Fidelity in Unified Multimodal Models [23.529904770014735]
This paper introduces a novel perspective: leveraging understanding to enhance the fidelity and detail richness of generated images.<n>We propose Forge-and-Quench, a new unified framework that puts this principle into practice.<n>Experiments show that Forge-and-Quench significantly improves image fidelity and detail across multiple models.
arXiv Detail & Related papers (2026-01-08T08:18:44Z) - Unified Thinker: A General Reasoning Modular Core for Image Generation [57.665309753609144]
We propose Unified Thinker, a task-agnostic reasoning architecture for general image generation.<n>Unified Thinker decouples a dedicated Thinker from the image Generator, enabling modular upgrades of reasoning without retraining the entire generative model.<n>Experiments on text-to-image generation and image editing show that Unified Thinker substantially improves image reasoning and generation quality.
arXiv Detail & Related papers (2026-01-06T15:59:33Z) - ReViSE: Towards Reason-Informed Video Editing in Unified Models with Self-Reflective Learning [57.08352504712699]
Video unified models exhibit strong capabilities in understanding and generation, yet they struggle with reason-informed visual editing.<n>We introduce the Reason-Informed Video Editing (RVE) task, which requires reasoning about physical plausibility and causal dynamics during editing.<n>We propose ReViSE, a framework that unifies generation and evaluation within a single architecture.
arXiv Detail & Related papers (2025-12-10T18:57:09Z) - UniREditBench: A Unified Reasoning-based Image Editing Benchmark [52.54256348710893]
This work proposes UniREditBench, a unified benchmark for reasoning-based image editing evaluation.<n>It comprises 2,700 meticulously curated samples, covering both real- and game-world scenarios across 8 primary dimensions and 18 sub-dimensions.<n>We fine-tune Bagel on this dataset and develop UniREdit-Bagel, demonstrating substantial improvements in both in-domain and out-of-distribution settings.
arXiv Detail & Related papers (2025-11-03T07:24:57Z) - Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark [69.8473923357969]
Unified multimodal models aim to jointly enable visual understanding and generation, yet current benchmarks rarely examine their true integration.<n>We present Uni-MMMU, a comprehensive benchmark that unfolds the bidirectional synergy between generation and understanding across eight reasoning-centric domains.
arXiv Detail & Related papers (2025-10-15T17:10:35Z) - Factuality Matters: When Image Generation and Editing Meet Structured Visuals [46.627460447235855]
We construct a large-scale dataset of 1.3 million high-quality structured image pairs.<n>We train a unified model that integrates a VLM with FLUX.1 Kontext.<n>A three-stage training curriculum enables progressive feature alignment, knowledge infusion, and reasoning-augmented generation.
arXiv Detail & Related papers (2025-10-06T17:56:55Z) - Unified Multimodal Model as Auto-Encoder [69.38946823657592]
We introduce a paradigm regarding understanding as the encoder (I2T) that compresses images into text, and generation as the decoder (T2I) that reconstructs images from that text.<n>Our empirical results suggest that understanding can largely enhance generation (verified on GenEval), while generation, in turn, notably strengthens fine-grained visual perception.
arXiv Detail & Related papers (2025-09-11T17:57:59Z) - Interleaving Reasoning for Better Text-to-Image Generation [83.69082794730664]
We introduce Interleaving Reasoning Generation (IRG), a framework that alternates between text-based thinking and image synthesis.<n>To train IRG effectively, we propose Interleaving Reasoning Generation Learning (IRGL), which targets two sub-goals.<n>Experiments show SoTA performance, yielding absolute gains of 5-10 points on GenEval, WISE, TIIF, GenAI-Bench, and OneIG-EN.
arXiv Detail & Related papers (2025-09-08T17:56:23Z) - Video Event Reasoning and Prediction by Fusing World Knowledge from LLMs with Vision Foundation Models [10.1080193179562]
Current understanding models excel at recognizing "what" but fall short in high-level cognitive tasks like causal reasoning and future prediction.<n>We propose a novel framework that fuses a powerful Vision Foundation Model for deep visual perception with a Large Language Model (LLM) serving as a knowledge-driven reasoning core.
arXiv Detail & Related papers (2025-07-08T09:43:17Z) - UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation [39.921363034430875]
Unified image understanding and generation has emerged as a promising paradigm in multimodal artificial intelligence.<n>We study the modality alignment behaviors of task-specific expert models for understanding and generation.<n>We introduce UniFork, a novel Y-shaped architecture that shares the shallow layers for cross-task representation learning, while employing task-specific branches in deeper layers to avoid task interference.
arXiv Detail & Related papers (2025-06-20T17:52:31Z) - R-Genie: Reasoning-Guided Generative Image Editing [41.87126578621796]
We introduce a new image editing paradigm: reasoning-guided generative editing, which synthesizes images based on complex, multi-faceted textual queries.<n>R-Genie is a reasoning-guided generative image editor, which synergizes the generation power of diffusion models with advanced reasoning capabilities.
arXiv Detail & Related papers (2025-05-23T11:41:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.