Related papers: Unified Thinker: A General Reasoning Modular Core for Image Generation

Unified Thinker: A General Reasoning Modular Core for Image Generation

URL: http://arxiv.org/abs/2601.03127v1
Date: Tue, 06 Jan 2026 15:59:33 GMT
Title: Unified Thinker: A General Reasoning Modular Core for Image Generation
Authors: Sashuai Zhou, Qiang Zhou, Jijin Hu, Hanqing Yang, Yue Cao, Junpeng Ma, Yinchao Ma, Jun Song, Tiezheng Ge, Cheng Yu, Bo Zheng, Zhou Zhao,
Abstract summary: We propose Unified Thinker, a task-agnostic reasoning architecture for general image generation.<n>Unified Thinker decouples a dedicated Thinker from the image Generator, enabling modular upgrades of reasoning without retraining the entire generative model.<n>Experiments on text-to-image generation and image editing show that Unified Thinker substantially improves image reasoning and generation quality.
Score: 57.665309753609144
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite impressive progress in high-fidelity image synthesis, generative models still struggle with logic-intensive instruction following, exposing a persistent reasoning--execution gap. Meanwhile, closed-source systems (e.g., Nano Banana) have demonstrated strong reasoning-driven image generation, highlighting a substantial gap to current open-source models. We argue that closing this gap requires not merely better visual generators, but executable reasoning: decomposing high-level intents into grounded, verifiable plans that directly steer the generative process. To this end, we propose Unified Thinker, a task-agnostic reasoning architecture for general image generation, designed as a unified planning core that can plug into diverse generators and workflows. Unified Thinker decouples a dedicated Thinker from the image Generator, enabling modular upgrades of reasoning without retraining the entire generative model. We further introduce a two-stage training paradigm: we first build a structured planning interface for the Thinker, then apply reinforcement learning to ground its policy in pixel-level feedback, encouraging plans that optimize visual correctness over textual plausibility. Extensive experiments on text-to-image generation and image editing show that Unified Thinker substantially improves image reasoning and generation quality.

Related papers

RL-RIG: A Generative Spatial Reasoner via Intrinsic Reflection [18.52946282633359]
RL-RIG is a Reinforcement Learning framework for Reflection-based Image Generation.<n>We develop Reflection-GRPO to train the VLM Actor for edit prompts and the Image Editor for better image quality under a given prompt.<n> Experimental results show that RL-RIG outperforms existing state-of-the-art open-source models by up to 11% in terms of controllable and precise spatial reasoning in image generation.
arXiv Detail & Related papers (2026-02-23T15:39:53Z)
Mind-Brush: Integrating Agentic Cognitive Search and Reasoning into Image Generation [47.97278965762397]
We introduce Mind-Brush, a unified agentic framework that transforms generation into a dynamic, knowledge-driven workflow.<n>Simulating a human-like 'think-research-create' paradigm, Mind-Brush actively retrieves multimodal evidence to ground out-of-distribution concepts.<n>Extensive experiments demonstrate that Mind-Brush significantly enhances the capabilities of unified models.
arXiv Detail & Related papers (2026-02-02T07:42:13Z)
HiCoGen: Hierarchical Compositional Text-to-Image Generation in Diffusion Models via Reinforcement Learning [66.99487505369254]
HiCoGen is built upon a novel Chain of Synthesis paradigm.<n>It decomposes complex prompts into minimal semantic units.<n>It then synthesizes these units iteratively, where the image generated in each step provides crucial visual context for the next.<n>Experiments show our approach significantly outperforms existing methods in both concept coverage and compositional accuracy.
arXiv Detail & Related papers (2025-11-25T06:24:25Z)
ThinkFake: Reasoning in Multimodal Large Language Models for AI-Generated Image Detection [51.93101033997245]
Increasing realism of AI-generated images has raised serious concerns about misinformation and privacy violations.<n>We propose ThinkFake, a novel reasoning-based and generalizable framework for AI-generated image detection.<n>We show that ThinkFake outperforms state-of-the-art methods on the GenImage benchmark and demonstrates strong zero-shot generalization on the challenging LOKI benchmark.
arXiv Detail & Related papers (2025-09-24T07:34:09Z)
Understanding-in-Generation: Reinforcing Generative Capability of Unified Model via Infusing Understanding into Generation [43.98469957837991]
We propose a novel reasoning framework for unified models, Understanding-in-Generation (UiG)<n>The core insight of our UiG is to integrate generative guidance by the strong understanding capabilities during the reasoning process.<n>Our UiG framework demonstrates a significant performance improvement in text-to-image generation over existing text-to-image reasoning methods.
arXiv Detail & Related papers (2025-09-23T04:52:39Z)
Interleaving Reasoning for Better Text-to-Image Generation [83.69082794730664]
We introduce Interleaving Reasoning Generation (IRG), a framework that alternates between text-based thinking and image synthesis.<n>To train IRG effectively, we propose Interleaving Reasoning Generation Learning (IRGL), which targets two sub-goals.<n>Experiments show SoTA performance, yielding absolute gains of 5-10 points on GenEval, WISE, TIIF, GenAI-Bench, and OneIG-EN.
arXiv Detail & Related papers (2025-09-08T17:56:23Z)
ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL [54.100889131719626]
Chain-of-thought reasoning and reinforcement learning have driven breakthroughs in NLP.<n>We introduce ReasonGen-R1, a framework that imbues an autoregressive image generator with explicit text-based "thinking" skills.<n>We show that ReasonGen-R1 consistently outperforms strong baselines and prior state-of-the-art models.
arXiv Detail & Related papers (2025-05-30T17:59:48Z)
Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model [118.52589065972795]
We introduce Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities.<n>Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder.
arXiv Detail & Related papers (2025-05-29T16:15:48Z)
Thinking with Generated Images [30.28526622443551]
We present Thinking with Generated Images, a novel paradigm that transforms how large multimodal models (LMMs) engage with visual reasoning.<n>Our approach enables AI models to engage in the kind of visual imagination and iterative refinement that characterizes human creative, analytical, and strategic thinking.
arXiv Detail & Related papers (2025-05-28T16:12:45Z)
Autoregressive Image Generation with Vision Full-view Prompt [18.569610688433745]
We propose Vision Full-view prompt (VF prompt) to enhance autoregressive image generation.<n>Inspired by prompt engineering from the field of NLP, we propose Vision Full-view prompt (VF prompt) to enhance autoregressive image generation.
arXiv Detail & Related papers (2025-02-24T08:44:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.