Related papers: I2E: From Image Pixels to Actionable Interactive Environments for Text-Guided Image Editing

I2E: From Image Pixels to Actionable Interactive Environments for Text-Guided Image Editing

URL: http://arxiv.org/abs/2601.03741v1
Date: Wed, 07 Jan 2026 09:29:57 GMT
Title: I2E: From Image Pixels to Actionable Interactive Environments for Text-Guided Image Editing
Authors: Jinghan Yu, Junhao Xiao, Chenyu Zhu, Jiaming Li, Jia Li, HanMing Deng, Xirui Wang, Guoli Jia, Jianjun Li, Zhiyuan Ma, Xiang Bai, Bowen Zhou,
Abstract summary: I2E is a novel "Decompose-then-Action" paradigm that revisits image editing as an actionable interaction process within a structured environment.<n>I2E utilizes a Decomposer to transform unstructured images into discrete, manipulable object layers and then introduces a physics-aware Vision-Language-Action Agent to parse complex instructions.<n>I2E significantly outperforms state-of-the-art methods in handling complex compositional instructions, maintaining physical plausibility, and ensuring multi-turn editing stability.
Score: 59.434028565445885
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing text-guided image editing methods primarily rely on end-to-end pixel-level inpainting paradigm. Despite its success in simple scenarios, this paradigm still significantly struggles with compositional editing tasks that require precise local control and complex multi-object spatial reasoning. This paradigm is severely limited by 1) the implicit coupling of planning and execution, 2) the lack of object-level control granularity, and 3) the reliance on unstructured, pixel-centric modeling. To address these limitations, we propose I2E, a novel "Decompose-then-Action" paradigm that revisits image editing as an actionable interaction process within a structured environment. I2E utilizes a Decomposer to transform unstructured images into discrete, manipulable object layers and then introduces a physics-aware Vision-Language-Action Agent to parse complex instructions into a series of atomic actions via Chain-of-Thought reasoning. Further, we also construct I2E-Bench, a benchmark designed for multi-instance spatial reasoning and high-precision editing. Experimental results on I2E-Bench and multiple public benchmarks demonstrate that I2E significantly outperforms state-of-the-art methods in handling complex compositional instructions, maintaining physical plausibility, and ensuring multi-turn editing stability.

Related papers

InterCoG: Towards Spatially Precise Image Editing with Interleaved Chain-of-Grounding Reasoning [60.799998743918955]
We propose a novel text-vision Interleaved Chain-of-Grounding reasoning framework for fine-grained image editing in complex real-world scenes.<n>The key insight of InterCoG is to first perform object position reasoning solely within text.<n>We also propose two auxiliary training modules: multimodal grounding reconstruction supervision and multimodal grounding reasoning alignment.
arXiv Detail & Related papers (2026-03-02T08:13:16Z)
MCIE: Multimodal LLM-Driven Complex Instruction Image Editing with Spatial Guidance [16.97760861651234]
MCIE-E1 is a large language model-driven complex instruction image editing method.<n>It integrates two key modules: a spatial-aware cross-attention module and a background-consistent cross-attention module.<n>It consistently outperforms previous state-of-the-art methods in both quantitative and qualitative assessments.
arXiv Detail & Related papers (2026-02-08T14:40:54Z)
MIRA: Multimodal Iterative Reasoning Agent for Image Editing [48.41212094929379]
We propose MIRA (Multimodal Iterative Reasoning Agent), a lightweight, plug-and-play multimodal reasoning agent.<n>Instead of issuing a single prompt or static plan, MIRA predicts atomic edit instructions step by step, using visual feedback to make its decisions.<n>Our 150K multimodal tool-use dataset, MIRA-Editing, combined with a two-stage SFT + GRPO training pipeline, enables MIRA to perform reasoning and editing over complex editing instructions.
arXiv Detail & Related papers (2025-11-26T06:13:32Z)
Understanding the Implicit User Intention via Reasoning with Large Language Model for Image Editing [23.69189799564107]
Existing image editing methods can handle simple editing instructions very well.<n>To deal with complex editing instructions, they often need to jointly fine-tune the large language models (LLMs) and diffusion models (DMs)<n>We propose a new method, called textbfComplex textbfImage textbfEditing via textbfLLM textbfReasoning (CIELR), which converts a complex user instruction into a set of simple and explicit editing actions.
arXiv Detail & Related papers (2025-10-31T10:06:28Z)
Image Editing As Programs with Diffusion Models [69.05164729625052]
We introduce Image Editing As Programs (IEAP), a unified image editing framework built upon the Diffusion Transformer (DiT) architecture.<n>IEAP approaches instructional editing through a reductionist lens, decomposing complex editing instructions into sequences of atomic operations.<n>Our framework delivers superior accuracy and semantic fidelity, particularly for complex, multi-step instructions.
arXiv Detail & Related papers (2025-06-04T16:57:24Z)
CompBench: Benchmarking Complex Instruction-guided Image Editing [63.347846732450364]
CompBench is a large-scale benchmark for complex instruction-guided image editing.<n>We propose an MLLM-human collaborative framework with tailored task pipelines.<n>We propose an instruction decoupling strategy that disentangles editing intents into four key dimensions.
arXiv Detail & Related papers (2025-05-18T02:30:52Z)
BrushEdit: All-In-One Image Inpainting and Editing [76.93556996538398]
BrushEdit is a novel inpainting-based instruction-guided image editing paradigm.<n>We devise a system enabling free-form instruction editing by integrating MLLMs and a dual-branch image inpainting model.<n>Our framework effectively combines MLLMs and inpainting models, achieving superior performance across seven metrics.
arXiv Detail & Related papers (2024-12-13T17:58:06Z)
SGEdit: Bridging LLM with Text2Image Generative Model for Scene Graph-based Image Editing [42.23117201457898]
We introduce a new framework that integrates large language model (LLM) with Text2 generative model for graph-based image editing. Our framework significantly outperforms existing image editing methods in terms of editing precision and scene aesthetics.
arXiv Detail & Related papers (2024-10-15T17:40:48Z)
FunEditor: Achieving Complex Image Edits via Function Aggregation with Diffusion Models [15.509233098264513]
Diffusion models have demonstrated outstanding performance in generative tasks, making them ideal candidates for image editing.<n>We introduce FunEditor, an efficient diffusion model designed to learn atomic editing functions and perform complex edits by aggregating simpler functions.<n>With only 4 steps of inference, FunEditor achieves 5-24x inference speedups over existing popular methods.
arXiv Detail & Related papers (2024-08-16T02:33:55Z)
SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models [91.22477798288003]
This paper introduces SmartEdit, a novel approach to instruction-based image editing. It exploits Multimodal Large Language Models (MLLMs) to enhance their understanding and reasoning capabilities. We show that a small amount of complex instruction editing data can effectively stimulate SmartEdit's editing capabilities for more complex instructions.
arXiv Detail & Related papers (2023-12-11T17:54:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.