ReasonEdit: Towards Reasoning-Enhanced Image Editing Models
- URL: http://arxiv.org/abs/2511.22625v2
- Date: Mon, 01 Dec 2025 13:51:36 GMT
- Title: ReasonEdit: Towards Reasoning-Enhanced Image Editing Models
- Authors: Fukun Yin, Shiyu Liu, Yucheng Han, Zhibo Wang, Peng Xing, Rui Wang, Wei Cheng, Yingming Wang, Aojie Li, Zixin Yin, Pengtao Chen, Xiangyu Zhang, Daxin Jiang, Xianfang Zeng, Gang Yu,
- Abstract summary: A common architectural design couples a multimodal large language model (MLLM) encoder with a diffusion decoder.<n>We show that unlocking the reasoning capabilities of MLLM can push the boundaries of editing models.<n>Our proposed framework enables image editing in a thinking-editing-reflection loop.
- Score: 60.902953259781675
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in image editing models have shown remarkable progress. A common architectural design couples a multimodal large language model (MLLM) encoder with a diffusion decoder, as seen in systems such as Step1X-Edit and Qwen-Image-Edit, where the MLLM encodes both the reference image and the instruction but remains frozen during training. In this work, we demonstrate that unlocking the reasoning capabilities of MLLM can further push the boundaries of editing models. Specifically, we explore two reasoning mechanisms, thinking and reflection, which enhance instruction understanding and editing accuracy. Based on that, our proposed framework enables image editing in a thinking-editing-reflection loop: the thinking mechanism leverages the world knowledge of MLLM to interpret abstract instructions, while the reflection reviews editing results, automatically corrects unintended manipulations, and identifies the stopping round. Extensive experiments demonstrate that our reasoning approach achieves significant performance gains, with improvements of ImgEdit (+4.3%), GEdit (+4.7%), and Kris (+8.2%) when initializing our DiT from the Step1X-Edit (ReasonEdit-S), and also outperforms previous open-source methods on both GEdit and Kris when integrated with Qwen-Image-Edit (ReasonEdit-Q).
Related papers
- RetouchIQ: MLLM Agents for Instruction-Based Image Retouching with Generalist Reward [64.78078130943489]
We introduce RetouchIQ, a framework that performs instruction-based executable image editing through MLLM agents guided by a reward model.<n>We show that RetouchIQ substantially improves both semantic consistency and perceptual quality over previous MLLM-based and diffusion-based editing systems.
arXiv Detail & Related papers (2026-02-19T17:11:59Z) - EditThinker: Unlocking Iterative Reasoning for Any Image Editor [72.28251670314451]
We propose a deliberative editing framework to 'think' while they edit.<n>We train a single MLLM, EditThinker, to act as the reasoning engine of this framework.<n>We employ reinforcement learning to align the EditThinker's thinking with its editing, thereby generating more targeted instruction improvements.
arXiv Detail & Related papers (2025-12-05T18:58:09Z) - An LLM-LVLM Driven Agent for Iterative and Fine-Grained Image Editing [5.192553173010677]
RefineEdit-Agent is a novel, training-free intelligent agent framework for complex, iterative, and context-aware image editing.<n>Our framework comprises an LVI-driven instruction and scene understanding module, a multi-level editing planner, an iterative image editing module, and a crucial LVLM-driven feedback and evaluation loop.
arXiv Detail & Related papers (2025-08-24T16:28:18Z) - MIND-Edit: MLLM Insight-Driven Editing via Language-Vision Projection [13.467269066605452]
We propose MIND-Edit, an end-to-end image-editing framework integrating pretrained diffusion model with MLLM.<n> MIND-Edit introduces two complementary strategies: (1) a text instruction optimization strategy that clarifies ambiguous user instructions based on semantic reasoning from the MLLM, and (2) an MLLM insight-driven editing strategy that explicitly leverages the intrinsic visual understanding capability of the MLLM to infer editing intent.<n>Extensive experiments demonstrate that MIND-Edit outperforms state-of-the-art image editing methods in both quantitative metrics and visual quality, particularly under complex and challenging scenarios.
arXiv Detail & Related papers (2025-05-25T13:54:31Z) - FireEdit: Fine-grained Instruction-based Image Editing via Region-aware Vision Language Model [54.693572837423226]
FireEdit is an innovative Fine-grained Instruction-based image editing framework that exploits a REgion-aware VLM.<n>FireEdit is designed to accurately comprehend user instructions and ensure effective control over the editing process.<n>Our approach surpasses the state-of-the-art instruction-based image editing methods.
arXiv Detail & Related papers (2025-03-25T16:59:42Z) - BrushEdit: All-In-One Image Inpainting and Editing [76.93556996538398]
BrushEdit is a novel inpainting-based instruction-guided image editing paradigm.<n>We devise a system enabling free-form instruction editing by integrating MLLMs and a dual-branch image inpainting model.<n>Our framework effectively combines MLLMs and inpainting models, achieving superior performance across seven metrics.
arXiv Detail & Related papers (2024-12-13T17:58:06Z) - SmartEdit: Exploring Complex Instruction-based Image Editing with
Multimodal Large Language Models [91.22477798288003]
This paper introduces SmartEdit, a novel approach to instruction-based image editing.
It exploits Multimodal Large Language Models (MLLMs) to enhance their understanding and reasoning capabilities.
We show that a small amount of complex instruction editing data can effectively stimulate SmartEdit's editing capabilities for more complex instructions.
arXiv Detail & Related papers (2023-12-11T17:54:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.