Related papers: RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing

RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing

URL: http://arxiv.org/abs/2512.16864v1
Date: Thu, 18 Dec 2025 18:34:23 GMT
Title: RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing
Authors: Tianyuan Qu, Lei Ke, Xiaohang Zhan, Longxiang Tang, Yuqi Liu, Bohao Peng, Bei Yu, Dong Yu, Jiaya Jia,
Abstract summary: RePlan is a plan-then-execute framework that couples a vision-language planner with a diffusion editor.<n>The planner decomposes instructions via step-by-step reasoning and explicitly grounds them to target regions.<n>The editor then applies changes using a training-free attention-region injection mechanism, enabling precise, parallel multi-region edits without iterative inpainting.
Score: 80.70169829264812
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Instruction-based image editing enables natural-language control over visual modifications, yet existing models falter under Instruction-Visual Complexity (IV-Complexity), where intricate instructions meet cluttered or ambiguous scenes. We introduce RePlan (Region-aligned Planning), a plan-then-execute framework that couples a vision-language planner with a diffusion editor. The planner decomposes instructions via step-by-step reasoning and explicitly grounds them to target regions; the editor then applies changes using a training-free attention-region injection mechanism, enabling precise, parallel multi-region edits without iterative inpainting. To strengthen planning, we apply GRPO-based reinforcement learning using 1K instruction-only examples, yielding substantial gains in reasoning fidelity and format reliability. We further present IV-Edit, a benchmark focused on fine-grained grounding and knowledge-intensive edits. Across IV-Complex settings, RePlan consistently outperforms strong baselines trained on far larger datasets, improving regional precision and overall fidelity. Our project page: https://replan-iv-edit.github.io

Related papers

Generative Visual Chain-of-Thought for Image Editing [48.64933075232273]
Existing image editing methods struggle to perceive where to edit, especially under complex scenes and nuanced spatial instructions.<n>To address this issue, we propose Generative Visual Chain-of-Thought (GVCoT)<n>GVCoT performs native visual reasoning by first generating spatial cues to localize the target region and then executing the edit.
arXiv Detail & Related papers (2026-03-02T14:12:52Z)
InterCoG: Towards Spatially Precise Image Editing with Interleaved Chain-of-Grounding Reasoning [60.799998743918955]
We propose a novel text-vision Interleaved Chain-of-Grounding reasoning framework for fine-grained image editing in complex real-world scenes.<n>The key insight of InterCoG is to first perform object position reasoning solely within text.<n>We also propose two auxiliary training modules: multimodal grounding reconstruction supervision and multimodal grounding reasoning alignment.
arXiv Detail & Related papers (2026-03-02T08:13:16Z)
LocateEdit-Bench: A Benchmark for Instruction-Based Editing Localization [21.62979058692505]
We propose a large-scale dataset comprising $231$K edited images to benchmark forgery localization methods.<n>Our dataset incorporates four cutting-edge editing models and covers three common edit types.<n>Our work establishes a foundation to keep pace with the evolving landscape of image editing, thereby facilitating the development of effective methods for future forgery localization.
arXiv Detail & Related papers (2026-02-05T12:01:09Z)
SketchAssist: A Practical Assistant for Semantic Edits and Precise Local Redrawing [13.733328072282049]
We present SketchAssist, an interactive sketch drawing assistant that accelerates creation by unifying instruction-guided global edits with line-guided region redrawing.<n>To enable this assistant at scale, we introduce a controllable data generation pipeline that (i) constructs attribute-addition sequences from attribute-free base sketches, (ii) forms multi-step edit chains via cross-sequence sampling, and (iii) expands stylistic coverage with a style-preserving attribute-removal model.
arXiv Detail & Related papers (2025-12-16T06:50:44Z)
Follow-Your-Shape: Shape-Aware Image Editing via Trajectory-Guided Region Control [52.87568958372421]
Follow-Your-Shape is a training-free and mask-free framework that supports precise and controllable editing of object shapes.<n>We compute a Trajectory Divergence Map (TDM) by comparing token-wise velocity differences between the inversion and denoising paths.<n>Our method achieves superior editability and visual fidelity, particularly in tasks requiring large-scale shape replacement.
arXiv Detail & Related papers (2025-08-11T16:10:00Z)
CannyEdit: Selective Canny Control and Dual-Prompt Guidance for Training-Free Image Editing [10.535939265557895]
CannyEdit is a novel training-free framework for regional image editing.<n>It applies structural guidance from a Canny ControlNet only to the unedited regions, preserving the original image's details.<n>CannyEdit offers exceptional flexibility: it operates effectively with rough masks or even single-point hints in addition tasks.
arXiv Detail & Related papers (2025-08-09T11:06:58Z)
Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing [43.3517273862321]
X-Planner is a planning system that bridges user intent with editing model capabilities.<n>X-Planner employs chain-of-thought reasoning to systematically decompose complex instructions into simpler, clear sub-instructions.<n>For each sub-instruction, X-Planner automatically generates precise edit types and segmentation masks, eliminating manual intervention and ensuring localized, identity-preserving edits.
arXiv Detail & Related papers (2025-07-07T17:59:56Z)
SPIE: Semantic and Structural Post-Training of Image Editing Diffusion Models with AI feedback [28.807572302899004]
SPIE is a novel approach for semantic and structural post-training of instruction-based image editing diffusion models.<n>We introduce an online reinforcement learning framework that aligns the diffusion model with human preferences without relying on extensive human annotations.<n> Experimental results demonstrate that SPIE can perform intricate edits in complex scenes, after just 10 training steps.
arXiv Detail & Related papers (2025-04-17T10:46:39Z)
DCEdit: Dual-Level Controlled Image Editing via Precisely Localized Semantics [71.78350994830885]
We present a novel approach to improving text-guided image editing using diffusion-based models.<n>Our method uses visual and textual self-attention to enhance the cross-attention map, which can serve as a regional cues to improve editing performance.<n>To fully compare our methods with other DiT-based approaches, we construct the RW-800 benchmark, featuring high resolution images, long descriptive texts, real-world images, and a new text editing task.
arXiv Detail & Related papers (2025-03-21T02:14:03Z)
Learning by Planning: Language-Guided Global Image Editing [53.72807421111136]
We develop a text-to-operation model to map the vague editing language request into a series of editing operations. The only supervision in the task is the target image, which is insufficient for a stable training of sequential decisions. We propose a novel operation planning algorithm to generate possible editing sequences from the target image as pseudo ground truth.
arXiv Detail & Related papers (2021-06-24T16:30:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.