Related papers: An LLM-LVLM Driven Agent for Iterative and Fine-Grained Image Editing

An LLM-LVLM Driven Agent for Iterative and Fine-Grained Image Editing

URL: http://arxiv.org/abs/2508.17435v1
Date: Sun, 24 Aug 2025 16:28:18 GMT
Title: An LLM-LVLM Driven Agent for Iterative and Fine-Grained Image Editing
Authors: Zihan Liang, Jiahao Sun, Haoran Ma,
Abstract summary: RefineEdit-Agent is a novel, training-free intelligent agent framework for complex, iterative, and context-aware image editing.<n>Our framework comprises an LVI-driven instruction and scene understanding module, a multi-level editing planner, an iterative image editing module, and a crucial LVLM-driven feedback and evaluation loop.
Score: 5.192553173010677
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite the remarkable capabilities of text-to-image (T2I) generation models, real-world applications often demand fine-grained, iterative image editing that existing methods struggle to provide. Key challenges include granular instruction understanding, robust context preservation during modifications, and the lack of intelligent feedback mechanisms for iterative refinement. This paper introduces RefineEdit-Agent, a novel, training-free intelligent agent framework designed to address these limitations by enabling complex, iterative, and context-aware image editing. RefineEdit-Agent leverages the powerful planning capabilities of Large Language Models (LLMs) and the advanced visual understanding and evaluation prowess of Vision-Language Large Models (LVLMs) within a closed-loop system. Our framework comprises an LVLM-driven instruction parser and scene understanding module, a multi-level LLM-driven editing planner for goal decomposition, tool selection, and sequence generation, an iterative image editing module, and a crucial LVLM-driven feedback and evaluation loop. To rigorously evaluate RefineEdit-Agent, we propose LongBench-T2I-Edit, a new benchmark featuring 500 initial images with complex, multi-turn editing instructions across nine visual dimensions. Extensive experiments demonstrate that RefineEdit-Agent significantly outperforms state-of-the-art baselines, achieving an average score of 3.67 on LongBench-T2I-Edit, compared to 2.29 for Direct Re-Prompting, 2.91 for InstructPix2Pix, 3.16 for GLIGEN-based Edit, and 3.39 for ControlNet-XL. Ablation studies, human evaluations, and analyses of iterative refinement, backbone choices, tool usage, and robustness to instruction complexity further validate the efficacy of our agentic design in delivering superior edit fidelity and context preservation.

Related papers

RetouchIQ: MLLM Agents for Instruction-Based Image Retouching with Generalist Reward [64.78078130943489]
We introduce RetouchIQ, a framework that performs instruction-based executable image editing through MLLM agents guided by a reward model.<n>We show that RetouchIQ substantially improves both semantic consistency and perceptual quality over previous MLLM-based and diffusion-based editing systems.
arXiv Detail & Related papers (2026-02-19T17:11:59Z)
I2E: From Image Pixels to Actionable Interactive Environments for Text-Guided Image Editing [59.434028565445885]
I2E is a novel "Decompose-then-Action" paradigm that revisits image editing as an actionable interaction process within a structured environment.<n>I2E utilizes a Decomposer to transform unstructured images into discrete, manipulable object layers and then introduces a physics-aware Vision-Language-Action Agent to parse complex instructions.<n>I2E significantly outperforms state-of-the-art methods in handling complex compositional instructions, maintaining physical plausibility, and ensuring multi-turn editing stability.
arXiv Detail & Related papers (2026-01-07T09:29:57Z)
I2I-Bench: A Comprehensive Benchmark Suite for Image-to-Image Editing Models [78.62380562116135]
Existing image editing benchmarks suffer from limited task scopes, insufficient evaluation dimensions, and heavy reliance on manual annotations.<n>We propose textbfI2I-Bench, a comprehensive benchmark for image-to-image editing models, which features 10 task categories across both single-image and multi-image editing tasks.<n>Using I2I-Bench, we benchmark numerous mainstream image editing models, investigating the gaps and trade-offs between editing models across various dimensions.
arXiv Detail & Related papers (2025-12-04T10:44:07Z)
ReasonEdit: Towards Reasoning-Enhanced Image Editing Models [60.902953259781675]
A common architectural design couples a multimodal large language model (MLLM) encoder with a diffusion decoder.<n>We show that unlocking the reasoning capabilities of MLLM can push the boundaries of editing models.<n>Our proposed framework enables image editing in a thinking-editing-reflection loop.
arXiv Detail & Related papers (2025-11-27T17:02:48Z)
LumiGen: An LVLM-Enhanced Iterative Framework for Fine-Grained Text-to-Image Generation [1.124958340749622]
Vision-Language Models (LVLMs) have demonstrated powerful capabilities in cross-modal understanding and instruction following.<n>LumiGen is a novel LVLM-enhanced iterative framework designed to elevate T2I model performance.<n>LumiGen achieves a superior average score of 3.08, outperforming state-of-the-art baselines.
arXiv Detail & Related papers (2025-08-05T20:53:43Z)
UniEdit-I: Training-free Image Editing for Unified VLM via Iterative Understanding, Editing and Verifying [64.5307229755533]
We introduce a novel training-free framework named UniEdit-I to enable the unified VLM with image editing capability.<n>We implement our method based on the latest BLIP3-o and achieved state-of-the-art (SOTA) performance on the GEdit-Bench benchmark.
arXiv Detail & Related papers (2025-08-05T06:42:09Z)
Reinforcing Multimodal Understanding and Generation with Dual Self-rewards [56.08202047680044]
Large language models (LLMs) unify cross-model understanding and generation into a single framework.<n>Current solutions require external supervision (e.g., human feedback or reward models) and only address unidirectional tasks.<n>We introduce a self-supervised dual reward mechanism to reinforce the understanding and generation capabilities of LMMs.
arXiv Detail & Related papers (2025-06-09T17:38:45Z)
GraPE: A Generate-Plan-Edit Framework for Compositional T2I Synthesis [10.47359822447001]
We present an alternate paradigm for T2I synthesis, decomposing the task of complex multi-step generation into three steps.<n>Our approach derives its strength from the fact that it is modular in nature, is training free, and can be applied over any combination of image generation and editing models.
arXiv Detail & Related papers (2024-12-08T22:29:56Z)
ReEdit: Multimodal Exemplar-Based Image Editing with Diffusion Models [11.830273909934688]
Modern Text-to-Image (T2I) Diffusion models have revolutionized image editing by enabling the generation of high-quality images. We propose ReEdit, a modular and efficient end-to-end framework that captures edits in both text and image modalities. Our results demonstrate that ReEdit consistently outperforms contemporary approaches both qualitatively and quantitatively.
arXiv Detail & Related papers (2024-11-06T15:19:24Z)
SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models [91.22477798288003]
This paper introduces SmartEdit, a novel approach to instruction-based image editing. It exploits Multimodal Large Language Models (MLLMs) to enhance their understanding and reasoning capabilities. We show that a small amount of complex instruction editing data can effectively stimulate SmartEdit's editing capabilities for more complex instructions.
arXiv Detail & Related papers (2023-12-11T17:54:11Z)
LLMGA: Multimodal Large Language Model based Generation Assistant [53.150283805515926]
We introduce a Multimodal Large Language Model-based Generation Assistant (LLMGA) to assist users in image generation and editing. We train the MLLM to grasp the properties of image generation and editing, enabling it to generate detailed prompts. Extensive results show that LLMGA has promising generation and editing capabilities and can enable more flexible and expansive applications.
arXiv Detail & Related papers (2023-11-27T13:37:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.