I2E: From Image Pixels to Actionable Interactive Environments for Text-Guided Image Editing
- URL: http://arxiv.org/abs/2601.03741v1
- Date: Wed, 07 Jan 2026 09:29:57 GMT
- Title: I2E: From Image Pixels to Actionable Interactive Environments for Text-Guided Image Editing
- Authors: Jinghan Yu, Junhao Xiao, Chenyu Zhu, Jiaming Li, Jia Li, HanMing Deng, Xirui Wang, Guoli Jia, Jianjun Li, Zhiyuan Ma, Xiang Bai, Bowen Zhou,
- Abstract summary: I2E is a novel "Decompose-then-Action" paradigm that revisits image editing as an actionable interaction process within a structured environment.<n>I2E utilizes a Decomposer to transform unstructured images into discrete, manipulable object layers and then introduces a physics-aware Vision-Language-Action Agent to parse complex instructions.<n>I2E significantly outperforms state-of-the-art methods in handling complex compositional instructions, maintaining physical plausibility, and ensuring multi-turn editing stability.
- Score: 59.434028565445885
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing text-guided image editing methods primarily rely on end-to-end pixel-level inpainting paradigm. Despite its success in simple scenarios, this paradigm still significantly struggles with compositional editing tasks that require precise local control and complex multi-object spatial reasoning. This paradigm is severely limited by 1) the implicit coupling of planning and execution, 2) the lack of object-level control granularity, and 3) the reliance on unstructured, pixel-centric modeling. To address these limitations, we propose I2E, a novel "Decompose-then-Action" paradigm that revisits image editing as an actionable interaction process within a structured environment. I2E utilizes a Decomposer to transform unstructured images into discrete, manipulable object layers and then introduces a physics-aware Vision-Language-Action Agent to parse complex instructions into a series of atomic actions via Chain-of-Thought reasoning. Further, we also construct I2E-Bench, a benchmark designed for multi-instance spatial reasoning and high-precision editing. Experimental results on I2E-Bench and multiple public benchmarks demonstrate that I2E significantly outperforms state-of-the-art methods in handling complex compositional instructions, maintaining physical plausibility, and ensuring multi-turn editing stability.
Related papers
- InterCoG: Towards Spatially Precise Image Editing with Interleaved Chain-of-Grounding Reasoning [60.799998743918955]
We propose a novel text-vision Interleaved Chain-of-Grounding reasoning framework for fine-grained image editing in complex real-world scenes.<n>The key insight of InterCoG is to first perform object position reasoning solely within text.<n>We also propose two auxiliary training modules: multimodal grounding reconstruction supervision and multimodal grounding reasoning alignment.
arXiv Detail & Related papers (2026-03-02T08:13:16Z) - MCIE: Multimodal LLM-Driven Complex Instruction Image Editing with Spatial Guidance [16.97760861651234]
MCIE-E1 is a large language model-driven complex instruction image editing method.<n>It integrates two key modules: a spatial-aware cross-attention module and a background-consistent cross-attention module.<n>It consistently outperforms previous state-of-the-art methods in both quantitative and qualitative assessments.
arXiv Detail & Related papers (2026-02-08T14:40:54Z) - MIRA: Multimodal Iterative Reasoning Agent for Image Editing [48.41212094929379]
We propose MIRA (Multimodal Iterative Reasoning Agent), a lightweight, plug-and-play multimodal reasoning agent.<n>Instead of issuing a single prompt or static plan, MIRA predicts atomic edit instructions step by step, using visual feedback to make its decisions.<n>Our 150K multimodal tool-use dataset, MIRA-Editing, combined with a two-stage SFT + GRPO training pipeline, enables MIRA to perform reasoning and editing over complex editing instructions.
arXiv Detail & Related papers (2025-11-26T06:13:32Z) - Understanding the Implicit User Intention via Reasoning with Large Language Model for Image Editing [23.69189799564107]
Existing image editing methods can handle simple editing instructions very well.<n>To deal with complex editing instructions, they often need to jointly fine-tune the large language models (LLMs) and diffusion models (DMs)<n>We propose a new method, called textbfComplex textbfImage textbfEditing via textbfLLM textbfReasoning (CIELR), which converts a complex user instruction into a set of simple and explicit editing actions.
arXiv Detail & Related papers (2025-10-31T10:06:28Z) - Image Editing As Programs with Diffusion Models [69.05164729625052]
We introduce Image Editing As Programs (IEAP), a unified image editing framework built upon the Diffusion Transformer (DiT) architecture.<n>IEAP approaches instructional editing through a reductionist lens, decomposing complex editing instructions into sequences of atomic operations.<n>Our framework delivers superior accuracy and semantic fidelity, particularly for complex, multi-step instructions.
arXiv Detail & Related papers (2025-06-04T16:57:24Z) - CompBench: Benchmarking Complex Instruction-guided Image Editing [63.347846732450364]
CompBench is a large-scale benchmark for complex instruction-guided image editing.<n>We propose an MLLM-human collaborative framework with tailored task pipelines.<n>We propose an instruction decoupling strategy that disentangles editing intents into four key dimensions.
arXiv Detail & Related papers (2025-05-18T02:30:52Z) - BrushEdit: All-In-One Image Inpainting and Editing [76.93556996538398]
BrushEdit is a novel inpainting-based instruction-guided image editing paradigm.<n>We devise a system enabling free-form instruction editing by integrating MLLMs and a dual-branch image inpainting model.<n>Our framework effectively combines MLLMs and inpainting models, achieving superior performance across seven metrics.
arXiv Detail & Related papers (2024-12-13T17:58:06Z) - SGEdit: Bridging LLM with Text2Image Generative Model for Scene Graph-based Image Editing [42.23117201457898]
We introduce a new framework that integrates large language model (LLM) with Text2 generative model for graph-based image editing.
Our framework significantly outperforms existing image editing methods in terms of editing precision and scene aesthetics.
arXiv Detail & Related papers (2024-10-15T17:40:48Z) - FunEditor: Achieving Complex Image Edits via Function Aggregation with Diffusion Models [15.509233098264513]
Diffusion models have demonstrated outstanding performance in generative tasks, making them ideal candidates for image editing.<n>We introduce FunEditor, an efficient diffusion model designed to learn atomic editing functions and perform complex edits by aggregating simpler functions.<n>With only 4 steps of inference, FunEditor achieves 5-24x inference speedups over existing popular methods.
arXiv Detail & Related papers (2024-08-16T02:33:55Z) - SmartEdit: Exploring Complex Instruction-based Image Editing with
Multimodal Large Language Models [91.22477798288003]
This paper introduces SmartEdit, a novel approach to instruction-based image editing.
It exploits Multimodal Large Language Models (MLLMs) to enhance their understanding and reasoning capabilities.
We show that a small amount of complex instruction editing data can effectively stimulate SmartEdit's editing capabilities for more complex instructions.
arXiv Detail & Related papers (2023-12-11T17:54:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.