Related papers: RetouchIQ: MLLM Agents for Instruction-Based Image Retouching with Generalist Reward

RetouchIQ: MLLM Agents for Instruction-Based Image Retouching with Generalist Reward

URL: http://arxiv.org/abs/2602.17558v1
Date: Thu, 19 Feb 2026 17:11:59 GMT
Title: RetouchIQ: MLLM Agents for Instruction-Based Image Retouching with Generalist Reward
Authors: Qiucheng Wu, Jing Shi, Simon Jenni, Kushal Kafle, Tianyu Wang, Shiyu Chang, Handong Zhao,
Abstract summary: We introduce RetouchIQ, a framework that performs instruction-based executable image editing through MLLM agents guided by a reward model.<n>We show that RetouchIQ substantially improves both semantic consistency and perceptual quality over previous MLLM-based and diffusion-based editing systems.
Score: 64.78078130943489
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in multimodal large language models (MLLMs) have shown great potential for extending vision-language reasoning to professional tool-based image editing, enabling intuitive and creative editing. A promising direction is to use reinforcement learning (RL) to enable MLLMs to reason about and execute optimal tool-use plans within professional image-editing software. However, training remains challenging due to the lack of reliable, verifiable reward signals that can reflect the inherently subjective nature of creative editing. In this work, we introduce RetouchIQ, a framework that performs instruction-based executable image editing through MLLM agents guided by a generalist reward model. RetouchIQ interprets user-specified editing intentions and generates corresponding, executable image adjustments, bridging high-level aesthetic goals with precise parameter control. To move beyond conventional, rule-based rewards that compute similarity against a fixed reference image using handcrafted metrics, we propose a generalist reward model, an RL fine-tuned MLLM that evaluates edited results through a set of generated metrics on a case-by-case basis. Then, the reward model provides scalar feedback through multimodal reasoning, enabling reinforcement learning with high-quality, instruction-consistent gradients. We curate an extended dataset with 190k instruction-reasoning pairs and establish a new benchmark for instruction-based image editing. Experiments show that RetouchIQ substantially improves both semantic consistency and perceptual quality over previous MLLM-based and diffusion-based editing systems. Our findings demonstrate the potential of generalist reward-driven MLLM agents as flexible, explainable, and executable assistants for professional image editing.

Related papers

Human-Aligned MLLM Judges for Fine-Grained Image Editing Evaluation: A Benchmark, Framework, and Analysis [95.89328387635176]
We introduce a fine-grained Multimodal Large Language Model (MLLM)-as-a-Judge framework for image editing.<n>We present a new human-validated benchmark that integrates human judgments, MLLM-based evaluations, model outputs, and traditional metrics.
arXiv Detail & Related papers (2026-02-13T15:34:32Z)
EditThinker: Unlocking Iterative Reasoning for Any Image Editor [72.28251670314451]
We propose a deliberative editing framework to 'think' while they edit.<n>We train a single MLLM, EditThinker, to act as the reasoning engine of this framework.<n>We employ reinforcement learning to align the EditThinker's thinking with its editing, thereby generating more targeted instruction improvements.
arXiv Detail & Related papers (2025-12-05T18:58:09Z)
ReasonEdit: Towards Reasoning-Enhanced Image Editing Models [60.902953259781675]
A common architectural design couples a multimodal large language model (MLLM) encoder with a diffusion decoder.<n>We show that unlocking the reasoning capabilities of MLLM can push the boundaries of editing models.<n>Our proposed framework enables image editing in a thinking-editing-reflection loop.
arXiv Detail & Related papers (2025-11-27T17:02:48Z)
Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback [41.41713036839503]
We introduce Edit-R1, a novel post-training framework for instruction-based image editing based on policy optimization.<n>We employ a Multimodal Large Language Model (MLLM) as a unified, training-free reward model, leveraging its output logits to provide fine-grained feedback.<n>Our framework is model-agnostic, delivering substantial performance gains when applied to diverse base models.
arXiv Detail & Related papers (2025-10-19T15:38:06Z)
EditScore: Unlocking Online RL for Image Editing via High-Fidelity Reward Modeling [71.8265422228785]
Reinforcement Learning (RL) offers a promising solution, but its adoption in image editing has been hindered by the lack of a high-fidelity, efficient reward signal.<n>We present a comprehensive methodology to overcome this barrier, centered on the development of a state-of-the-art, specialized reward model.
arXiv Detail & Related papers (2025-09-28T14:28:24Z)
An LLM-LVLM Driven Agent for Iterative and Fine-Grained Image Editing [5.192553173010677]
RefineEdit-Agent is a novel, training-free intelligent agent framework for complex, iterative, and context-aware image editing.<n>Our framework comprises an LVI-driven instruction and scene understanding module, a multi-level editing planner, an iterative image editing module, and a crucial LVLM-driven feedback and evaluation loop.
arXiv Detail & Related papers (2025-08-24T16:28:18Z)
MonetGPT: Solving Puzzles Enhances MLLMs' Image Retouching Skills [37.48977077142813]
We show that a multimodal large language model (MLLM) can be taught to critique raw photographs.<n>We demonstrate that MLLMs can be first made aware of the underlying image processing operations.<n>We then synthesize a reasoning dataset by procedurally manipulating expert-edited photos.
arXiv Detail & Related papers (2025-05-09T16:38:27Z)
Guiding Instruction-based Image Editing via Multimodal Large Language Models [102.82211398699644]
Multimodal large language models (MLLMs) show promising capabilities in cross-modal understanding and visual-aware response generation. We investigate how MLLMs facilitate edit instructions and present MLLM-Guided Image Editing (MGIE) MGIE learns to derive expressive instructions and provides explicit guidance.
arXiv Detail & Related papers (2023-09-29T10:01:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.