RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing
- URL: http://arxiv.org/abs/2512.16864v1
- Date: Thu, 18 Dec 2025 18:34:23 GMT
- Title: RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing
- Authors: Tianyuan Qu, Lei Ke, Xiaohang Zhan, Longxiang Tang, Yuqi Liu, Bohao Peng, Bei Yu, Dong Yu, Jiaya Jia,
- Abstract summary: RePlan is a plan-then-execute framework that couples a vision-language planner with a diffusion editor.<n>The planner decomposes instructions via step-by-step reasoning and explicitly grounds them to target regions.<n>The editor then applies changes using a training-free attention-region injection mechanism, enabling precise, parallel multi-region edits without iterative inpainting.
- Score: 80.70169829264812
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Instruction-based image editing enables natural-language control over visual modifications, yet existing models falter under Instruction-Visual Complexity (IV-Complexity), where intricate instructions meet cluttered or ambiguous scenes. We introduce RePlan (Region-aligned Planning), a plan-then-execute framework that couples a vision-language planner with a diffusion editor. The planner decomposes instructions via step-by-step reasoning and explicitly grounds them to target regions; the editor then applies changes using a training-free attention-region injection mechanism, enabling precise, parallel multi-region edits without iterative inpainting. To strengthen planning, we apply GRPO-based reinforcement learning using 1K instruction-only examples, yielding substantial gains in reasoning fidelity and format reliability. We further present IV-Edit, a benchmark focused on fine-grained grounding and knowledge-intensive edits. Across IV-Complex settings, RePlan consistently outperforms strong baselines trained on far larger datasets, improving regional precision and overall fidelity. Our project page: https://replan-iv-edit.github.io
Related papers
- Generative Visual Chain-of-Thought for Image Editing [48.64933075232273]
Existing image editing methods struggle to perceive where to edit, especially under complex scenes and nuanced spatial instructions.<n>To address this issue, we propose Generative Visual Chain-of-Thought (GVCoT)<n>GVCoT performs native visual reasoning by first generating spatial cues to localize the target region and then executing the edit.
arXiv Detail & Related papers (2026-03-02T14:12:52Z) - InterCoG: Towards Spatially Precise Image Editing with Interleaved Chain-of-Grounding Reasoning [60.799998743918955]
We propose a novel text-vision Interleaved Chain-of-Grounding reasoning framework for fine-grained image editing in complex real-world scenes.<n>The key insight of InterCoG is to first perform object position reasoning solely within text.<n>We also propose two auxiliary training modules: multimodal grounding reconstruction supervision and multimodal grounding reasoning alignment.
arXiv Detail & Related papers (2026-03-02T08:13:16Z) - LocateEdit-Bench: A Benchmark for Instruction-Based Editing Localization [21.62979058692505]
We propose a large-scale dataset comprising $231$K edited images to benchmark forgery localization methods.<n>Our dataset incorporates four cutting-edge editing models and covers three common edit types.<n>Our work establishes a foundation to keep pace with the evolving landscape of image editing, thereby facilitating the development of effective methods for future forgery localization.
arXiv Detail & Related papers (2026-02-05T12:01:09Z) - SketchAssist: A Practical Assistant for Semantic Edits and Precise Local Redrawing [13.733328072282049]
We present SketchAssist, an interactive sketch drawing assistant that accelerates creation by unifying instruction-guided global edits with line-guided region redrawing.<n>To enable this assistant at scale, we introduce a controllable data generation pipeline that (i) constructs attribute-addition sequences from attribute-free base sketches, (ii) forms multi-step edit chains via cross-sequence sampling, and (iii) expands stylistic coverage with a style-preserving attribute-removal model.
arXiv Detail & Related papers (2025-12-16T06:50:44Z) - Follow-Your-Shape: Shape-Aware Image Editing via Trajectory-Guided Region Control [52.87568958372421]
Follow-Your-Shape is a training-free and mask-free framework that supports precise and controllable editing of object shapes.<n>We compute a Trajectory Divergence Map (TDM) by comparing token-wise velocity differences between the inversion and denoising paths.<n>Our method achieves superior editability and visual fidelity, particularly in tasks requiring large-scale shape replacement.
arXiv Detail & Related papers (2025-08-11T16:10:00Z) - CannyEdit: Selective Canny Control and Dual-Prompt Guidance for Training-Free Image Editing [10.535939265557895]
CannyEdit is a novel training-free framework for regional image editing.<n>It applies structural guidance from a Canny ControlNet only to the unedited regions, preserving the original image's details.<n>CannyEdit offers exceptional flexibility: it operates effectively with rough masks or even single-point hints in addition tasks.
arXiv Detail & Related papers (2025-08-09T11:06:58Z) - Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing [43.3517273862321]
X-Planner is a planning system that bridges user intent with editing model capabilities.<n>X-Planner employs chain-of-thought reasoning to systematically decompose complex instructions into simpler, clear sub-instructions.<n>For each sub-instruction, X-Planner automatically generates precise edit types and segmentation masks, eliminating manual intervention and ensuring localized, identity-preserving edits.
arXiv Detail & Related papers (2025-07-07T17:59:56Z) - SPIE: Semantic and Structural Post-Training of Image Editing Diffusion Models with AI feedback [28.807572302899004]
SPIE is a novel approach for semantic and structural post-training of instruction-based image editing diffusion models.<n>We introduce an online reinforcement learning framework that aligns the diffusion model with human preferences without relying on extensive human annotations.<n> Experimental results demonstrate that SPIE can perform intricate edits in complex scenes, after just 10 training steps.
arXiv Detail & Related papers (2025-04-17T10:46:39Z) - DCEdit: Dual-Level Controlled Image Editing via Precisely Localized Semantics [71.78350994830885]
We present a novel approach to improving text-guided image editing using diffusion-based models.<n>Our method uses visual and textual self-attention to enhance the cross-attention map, which can serve as a regional cues to improve editing performance.<n>To fully compare our methods with other DiT-based approaches, we construct the RW-800 benchmark, featuring high resolution images, long descriptive texts, real-world images, and a new text editing task.
arXiv Detail & Related papers (2025-03-21T02:14:03Z) - Learning by Planning: Language-Guided Global Image Editing [53.72807421111136]
We develop a text-to-operation model to map the vague editing language request into a series of editing operations.
The only supervision in the task is the target image, which is insufficient for a stable training of sequential decisions.
We propose a novel operation planning algorithm to generate possible editing sequences from the target image as pseudo ground truth.
arXiv Detail & Related papers (2021-06-24T16:30:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.