Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing
- URL: http://arxiv.org/abs/2507.05259v1
- Date: Mon, 07 Jul 2025 17:59:56 GMT
- Title: Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing
- Authors: Chun-Hsiao Yeh, Yilin Wang, Nanxuan Zhao, Richard Zhang, Yuheng Li, Yi Ma, Krishna Kumar Singh,
- Abstract summary: X-Planner is a planning system that bridges user intent with editing model capabilities.<n>X-Planner employs chain-of-thought reasoning to systematically decompose complex instructions into simpler, clear sub-instructions.<n>For each sub-instruction, X-Planner automatically generates precise edit types and segmentation masks, eliminating manual intervention and ensuring localized, identity-preserving edits.
- Score: 43.3517273862321
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent diffusion-based image editing methods have significantly advanced text-guided tasks but often struggle to interpret complex, indirect instructions. Moreover, current models frequently suffer from poor identity preservation, unintended edits, or rely heavily on manual masks. To address these challenges, we introduce X-Planner, a Multimodal Large Language Model (MLLM)-based planning system that effectively bridges user intent with editing model capabilities. X-Planner employs chain-of-thought reasoning to systematically decompose complex instructions into simpler, clear sub-instructions. For each sub-instruction, X-Planner automatically generates precise edit types and segmentation masks, eliminating manual intervention and ensuring localized, identity-preserving edits. Additionally, we propose a novel automated pipeline for generating large-scale data to train X-Planner which achieves state-of-the-art results on both existing benchmarks and our newly introduced complex editing benchmark.
Related papers
- Instruction-based Image Editing with Planning, Reasoning, and Generation [52.0364486403062]
Prior work utilizes a chain of large language models, object segmentation models, and editing models for this task.<n>We aim to bridge understanding and generation via a new multi-modality model that provides the intelligent abilities to instruction-based image editing models.<n>Our method has competitive editing abilities on complex real-world images.
arXiv Detail & Related papers (2026-02-26T04:56:02Z) - RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing [80.70169829264812]
RePlan is a plan-then-execute framework that couples a vision-language planner with a diffusion editor.<n>The planner decomposes instructions via step-by-step reasoning and explicitly grounds them to target regions.<n>The editor then applies changes using a training-free attention-region injection mechanism, enabling precise, parallel multi-region edits without iterative inpainting.
arXiv Detail & Related papers (2025-12-18T18:34:23Z) - Understanding the Implicit User Intention via Reasoning with Large Language Model for Image Editing [23.69189799564107]
Existing image editing methods can handle simple editing instructions very well.<n>To deal with complex editing instructions, they often need to jointly fine-tune the large language models (LLMs) and diffusion models (DMs)<n>We propose a new method, called textbfComplex textbfImage textbfEditing via textbfLLM textbfReasoning (CIELR), which converts a complex user instruction into a set of simple and explicit editing actions.
arXiv Detail & Related papers (2025-10-31T10:06:28Z) - Image Editing As Programs with Diffusion Models [69.05164729625052]
We introduce Image Editing As Programs (IEAP), a unified image editing framework built upon the Diffusion Transformer (DiT) architecture.<n>IEAP approaches instructional editing through a reductionist lens, decomposing complex editing instructions into sequences of atomic operations.<n>Our framework delivers superior accuracy and semantic fidelity, particularly for complex, multi-step instructions.
arXiv Detail & Related papers (2025-06-04T16:57:24Z) - Beyond Editing Pairs: Fine-Grained Instructional Image Editing via Multi-Scale Learnable Regions [20.617718631292696]
We develop a novel paradigm for instruction-driven image editing that leverages widely available and enormous text-image pairs.<n>Our approach introduces a multi-scale learnable region to localize and guide the editing process.<n>By treating the alignment between images and their textual descriptions as supervision and learning to generate task-specific editing regions, our method achieves high-fidelity, precise, and instruction-consistent image editing.
arXiv Detail & Related papers (2025-05-25T22:40:59Z) - CompBench: Benchmarking Complex Instruction-guided Image Editing [63.347846732450364]
CompBench is a large-scale benchmark for complex instruction-guided image editing.<n>We propose an MLLM-human collaborative framework with tailored task pipelines.<n>We propose an instruction decoupling strategy that disentangles editing intents into four key dimensions.
arXiv Detail & Related papers (2025-05-18T02:30:52Z) - SmartFreeEdit: Mask-Free Spatial-Aware Image Editing with Complex Instruction Understanding [45.79481252237092]
SmartFreeEdit is an end-to-end framework that integrates a multimodal large language model (MLLM) with a hypergraph-enhanced inpainting architecture.<n>Key innovations of SmartFreeEdit include region aware tokens and a mask embedding paradigm.<n>Experiments on the Reason-Edit benchmark demonstrate that SmartFreeEdit surpasses current state-of-the-art methods.
arXiv Detail & Related papers (2025-04-17T07:17:49Z) - XY-Cut++: Advanced Layout Ordering via Hierarchical Mask Mechanism on a Novel Benchmark [1.9020548287019097]
XY-Cut++ is a layout ordering method that integrates pre-mask processing, multi-granularity segmentation, and cross-modal matching.<n>It achieves state-of-the-art performance (98.8 BLEU overall) while maintaining simplicity and efficiency.
arXiv Detail & Related papers (2025-04-14T14:19:57Z) - Disentangling Instruction Influence in Diffusion Transformers for Parallel Multi-Instruction-Guided Image Editing [26.02149948089938]
Instruction Influence Disentanglement (IID) is a novel framework enabling parallel execution of multiple instructions in a single denoising process.<n>We analyze self-attention mechanisms in DiTs and derive instruction-specific attention masks to disentangle each instruction's influence.<n>IID reduces diffusion steps while improving fidelity and instruction completion compared to existing baselines.
arXiv Detail & Related papers (2025-04-07T07:26:25Z) - BrushEdit: All-In-One Image Inpainting and Editing [76.93556996538398]
BrushEdit is a novel inpainting-based instruction-guided image editing paradigm.<n>We devise a system enabling free-form instruction editing by integrating MLLMs and a dual-branch image inpainting model.<n>Our framework effectively combines MLLMs and inpainting models, achieving superior performance across seven metrics.
arXiv Detail & Related papers (2024-12-13T17:58:06Z) - Task-Oriented Diffusion Inversion for High-Fidelity Text-based Editing [60.730661748555214]
We introduce textbfTask-textbfOriented textbfDiffusion textbfInversion (textbfTODInv), a novel framework that inverts and edits real images tailored to specific editing tasks.
ToDInv seamlessly integrates inversion and editing through reciprocal optimization, ensuring both high fidelity and precise editability.
arXiv Detail & Related papers (2024-08-23T22:16:34Z) - SmartEdit: Exploring Complex Instruction-based Image Editing with
Multimodal Large Language Models [91.22477798288003]
This paper introduces SmartEdit, a novel approach to instruction-based image editing.
It exploits Multimodal Large Language Models (MLLMs) to enhance their understanding and reasoning capabilities.
We show that a small amount of complex instruction editing data can effectively stimulate SmartEdit's editing capabilities for more complex instructions.
arXiv Detail & Related papers (2023-12-11T17:54:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.