Related papers: Mind-Brush: Integrating Agentic Cognitive Search and Reasoning into Image Generation

Mind-Brush: Integrating Agentic Cognitive Search and Reasoning into Image Generation

URL: http://arxiv.org/abs/2602.01756v1
Date: Mon, 02 Feb 2026 07:42:13 GMT
Title: Mind-Brush: Integrating Agentic Cognitive Search and Reasoning into Image Generation
Authors: Jun He, Junyan Ye, Zilong Huang, Dongzhi Jiang, Chenjue Zhang, Leqi Zhu, Renrui Zhang, Xiang Zhang, Weijia Li,
Abstract summary: We introduce Mind-Brush, a unified agentic framework that transforms generation into a dynamic, knowledge-driven workflow.<n>Simulating a human-like 'think-research-create' paradigm, Mind-Brush actively retrieves multimodal evidence to ground out-of-distribution concepts.<n>Extensive experiments demonstrate that Mind-Brush significantly enhances the capabilities of unified models.
Score: 47.97278965762397
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While text-to-image generation has achieved unprecedented fidelity, the vast majority of existing models function fundamentally as static text-to-pixel decoders. Consequently, they often fail to grasp implicit user intentions. Although emerging unified understanding-generation models have improved intent comprehension, they still struggle to accomplish tasks involving complex knowledge reasoning within a single model. Moreover, constrained by static internal priors, these models remain unable to adapt to the evolving dynamics of the real world. To bridge these gaps, we introduce Mind-Brush, a unified agentic framework that transforms generation into a dynamic, knowledge-driven workflow. Simulating a human-like 'think-research-create' paradigm, Mind-Brush actively retrieves multimodal evidence to ground out-of-distribution concepts and employs reasoning tools to resolve implicit visual constraints. To rigorously evaluate these capabilities, we propose Mind-Bench, a comprehensive benchmark comprising 500 distinct samples spanning real-time news, emerging concepts, and domains such as mathematical and Geo-Reasoning. Extensive experiments demonstrate that Mind-Brush significantly enhances the capabilities of unified models, realizing a zero-to-one capability leap for the Qwen-Image baseline on Mind-Bench, while achieving superior results on established benchmarks like WISE and RISE.

Related papers

DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories [52.57197752244638]
We introduce DeepImageSearch, a novel agentic paradigm that reformulates image retrieval as an autonomous exploration task.<n>Models must plan and perform multi-step reasoning over raw visual histories to locate targets based on implicit contextual cues.<n>We construct DISBench, a challenging benchmark built on interconnected visual data.
arXiv Detail & Related papers (2026-02-11T12:51:10Z)
MentisOculi: Revealing the Limits of Reasoning with Mental Imagery [63.285794947638614]
We develop MentisOculi, a suite of multi-step reasoning problems amenable to visual solution.<n> evaluating visual strategies ranging from latent tokens to explicit generated imagery, we find they generally fail to improve performance.<n>Our findings suggest that despite their inherent appeal, visual thoughts do not yet benefit model reasoning.
arXiv Detail & Related papers (2026-02-02T18:49:06Z)
Reasoning Models Generate Societies of Thought [9.112083442162671]
We show that enhanced reasoning emerges from simulating multi-agent-like interactions.<n>We find that reasoning models like DeepSeek-R1 and QwQ-32B exhibit much greater perspective diversity than instruction-tuned models.
arXiv Detail & Related papers (2026-01-15T19:52:33Z)
Guiding the Inner Eye: A Framework for Hierarchical and Flexible Visual Grounded Reasoning [6.800544911407401]
GRiP (Guided Reasoning and Perception) is a novel training framework for visual grounded reasoning.<n> GRiP cultivates robust and flexible visual grounded reasoning by explicitly guiding the model's perceptual focus and logical pathways.<n> GRiP achieves state-of-the-art results among open-source models on the highly challenging TreeBench and V* Bench.
arXiv Detail & Related papers (2025-11-27T07:18:25Z)
Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark [69.8473923357969]
Unified multimodal models aim to jointly enable visual understanding and generation, yet current benchmarks rarely examine their true integration.<n>We present Uni-MMMU, a comprehensive benchmark that unfolds the bidirectional synergy between generation and understanding across eight reasoning-centric domains.
arXiv Detail & Related papers (2025-10-15T17:10:35Z)
Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing [62.447497430479174]
Drawing to reason in space is a novel paradigm that enables LVLMs to reason through elementary drawing operations in the visual space.<n>Our model, named VILASR, consistently outperforms existing methods across diverse spatial reasoning benchmarks.
arXiv Detail & Related papers (2025-06-11T17:41:50Z)
Thinking with Generated Images [30.28526622443551]
We present Thinking with Generated Images, a novel paradigm that transforms how large multimodal models (LMMs) engage with visual reasoning.<n>Our approach enables AI models to engage in the kind of visual imagination and iterative refinement that characterizes human creative, analytical, and strategic thinking.
arXiv Detail & Related papers (2025-05-28T16:12:45Z)
Enhancing Zero-Shot Image Recognition in Vision-Language Models through Human-like Concept Guidance [41.6755826072905]
In zero-shot image recognition tasks, humans demonstrate remarkable flexibility in classifying unseen categories.<n>Existing vision-language models often underperform in real-world applications because of sub-optimal prompt engineering.<n>We propose a Concept-guided Human-like Bayesian Reasoning framework to address these issues.
arXiv Detail & Related papers (2025-03-20T06:20:13Z)
ViRAC: A Vision-Reasoning Agent Head Movement Control Framework in Arbitrary Virtual Environments [0.13654846342364302]
We propose ViRAC, which exploits the common-sense knowledge and reasoning capabilities of large-scale models.<n>ViRAC produces more natural and context-aware head rotations than recent state-of-the-art techniques.
arXiv Detail & Related papers (2025-02-14T09:46:43Z)
On the Fairness, Diversity and Reliability of Text-to-Image Generative Models [68.62012304574012]
multimodal generative models have sparked critical discussions on their reliability, fairness and potential for misuse.<n>We propose an evaluation framework to assess model reliability by analyzing responses to global and local perturbations in the embedding space.<n>Our method lays the groundwork for detecting unreliable, bias-injected models and tracing the provenance of embedded biases.
arXiv Detail & Related papers (2024-11-21T09:46:55Z)
WenLan 2.0: Make AI Imagine via a Multimodal Foundation Model [74.4875156387271]
We develop a novel foundation model pre-trained with huge multimodal (visual and textual) data. We show that state-of-the-art results can be obtained on a wide range of downstream tasks.
arXiv Detail & Related papers (2021-10-27T12:25:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.