Related papers: MCIE: Multimodal LLM-Driven Complex Instruction Image Editing with Spatial Guidance

MCIE: Multimodal LLM-Driven Complex Instruction Image Editing with Spatial Guidance

URL: http://arxiv.org/abs/2602.07993v1
Date: Sun, 08 Feb 2026 14:40:54 GMT
Title: MCIE: Multimodal LLM-Driven Complex Instruction Image Editing with Spatial Guidance
Authors: Xuehai Bai, Xiaoling Gu, Akide Liu, Hangjie Yuan, YiFan Zhang, Jack Ma,
Abstract summary: MCIE-E1 is a large language model-driven complex instruction image editing method.<n>It integrates two key modules: a spatial-aware cross-attention module and a background-consistent cross-attention module.<n>It consistently outperforms previous state-of-the-art methods in both quantitative and qualitative assessments.
Score: 16.97760861651234
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in instruction-based image editing have shown remarkable progress. However, existing methods remain limited to relatively simple editing operations, hindering real-world applications that require complex and compositional instructions. In this work, we address these limitations from the perspectives of architectural design, data, and evaluation protocols. Specifically, we identify two key challenges in current models: insufficient instruction compliance and background inconsistency. To this end, we propose MCIE-E1, a Multimodal Large Language Model-Driven Complex Instruction Image Editing method that integrates two key modules: a spatial-aware cross-attention module and a background-consistent cross-attention module. The former enhances instruction-following capability by explicitly aligning semantic instructions with spatial regions through spatial guidance during the denoising process, while the latter preserves features in unedited regions to maintain background consistency. To enable effective training, we construct a dedicated data pipeline to mitigate the scarcity of complex instruction-based image editing datasets, combining fine-grained automatic filtering via a powerful MLLM with rigorous human validation. Finally, to comprehensively evaluate complex instruction-based image editing, we introduce CIE-Bench, a new benchmark with two new evaluation metrics. Experimental results on CIE-Bench demonstrate that MCIE-E1 consistently outperforms previous state-of-the-art methods in both quantitative and qualitative assessments, achieving a 23.96% improvement in instruction compliance.

Related papers

I2E: From Image Pixels to Actionable Interactive Environments for Text-Guided Image Editing [59.434028565445885]
I2E is a novel "Decompose-then-Action" paradigm that revisits image editing as an actionable interaction process within a structured environment.<n>I2E utilizes a Decomposer to transform unstructured images into discrete, manipulable object layers and then introduces a physics-aware Vision-Language-Action Agent to parse complex instructions.<n>I2E significantly outperforms state-of-the-art methods in handling complex compositional instructions, maintaining physical plausibility, and ensuring multi-turn editing stability.
arXiv Detail & Related papers (2026-01-07T09:29:57Z)
CogniEdit: Dense Gradient Flow Optimization for Fine-Grained Image Editing [88.9067184995168]
We propose a unified framework CogniEdit, combining multi-modal reasoning with dense reward optimization.<n>Our method achieves state-of-the-art performance in balancing fine-grained instruction following with visual quality and editability preservation.
arXiv Detail & Related papers (2025-12-15T12:36:50Z)
DreamOmni2: Multimodal Instruction-based Editing and Generation [77.997848231822]
We propose two novel tasks: multimodal instruction-based editing and generation.<n>These tasks support both text and image instructions and extend the scope to include both concrete and abstract concepts.<n>Our data synthesis pipeline consists of three steps: (1) using a feature mixing method to create extraction data for both abstract and concrete concepts, (2) generating multimodal instruction-based editing training data using the editing and extraction models, and (3) further applying the extraction model to create training data for multimodal instruction-based editing.
arXiv Detail & Related papers (2025-10-08T06:07:14Z)
Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing [53.197392152109636]
We introduce Draw-In-Mind (DIM), a dataset consisting of two complementary subsets: (i) DIM-T2I, containing 14M long-context image-text pairs to enhance complex instruction comprehension; and (ii) DIM-Edit, consisting of 233K chain-of-thought imaginations generated by GPT-4o, serving as explicit design blueprints for image edits.<n>DIM-4.6B-T2I/Edit achieves competitive performance on the ImgEdit and GEdit-Bench benchmarks, outperforming much larger models such as UniWorld-V1 and Step1X-Edit.
arXiv Detail & Related papers (2025-09-02T06:06:52Z)
$\ exttt{Complex-Edit}$: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark [36.58090024531738]
We introduce $ttexttComplex-Edit$, a comprehensive benchmark designed to evaluate instruction-based image editing models.<n>We harness GPT-4o to automatically collect a diverse set of editing instructions at scale.<n>We introduce a suite of metrics to assess various aspects of editing performance, along with a VLM-based auto-evaluation pipeline.
arXiv Detail & Related papers (2025-04-17T17:51:59Z)
MIGE: Mutually Enhanced Multimodal Instruction-Based Image Generation and Editing [25.118495616895597]
MIGE is a unified framework that standardizes task representations using multimodal instructions.<n>It first treats subject-driven generation as creation on a blank canvas and instruction-based editing as modification of an existing image.<n>MIGE excels in both subject-driven generation and instruction-based editing while setting a SOTA in the new task of instruction-based subject-driven editing.
arXiv Detail & Related papers (2025-02-28T18:21:08Z)
MuSC: Improving Complex Instruction Following with Multi-granularity Self-Contrastive Training [36.483136685734735]
We propose a Multi-granularity Self-Contrastive Training (MuSC) framework to improve the complex instruction alignment without relying on a stronger model.<n>Our method is evaluated on open-sourced models, and experiment results show our method achieves significant improvement on both complex and general instruction-following benchmarks.
arXiv Detail & Related papers (2025-02-17T08:12:49Z)
Compositional Image Retrieval via Instruction-Aware Contrastive Learning [40.54022628032561]
Composed Image Retrieval (CIR) involves retrieving a target image based on a composed query of an image paired with text that specifies modifications or changes to the visual reference.<n>In practice, due to the scarcity of annotated data in downstream tasks, Zero-Shot CIR (ZS-CIR) is desirable.<n>We propose a novel embedding method utilizing an instruction-tuned Multimodal LLM (MLLM) to generate composed representation.
arXiv Detail & Related papers (2024-12-07T22:46:52Z)
SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models [91.22477798288003]
This paper introduces SmartEdit, a novel approach to instruction-based image editing. It exploits Multimodal Large Language Models (MLLMs) to enhance their understanding and reasoning capabilities. We show that a small amount of complex instruction editing data can effectively stimulate SmartEdit's editing capabilities for more complex instructions.
arXiv Detail & Related papers (2023-12-11T17:54:11Z)
USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text Retrieval [115.28586222748478]
Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality. Existing approaches typically suffer from two major limitations.
arXiv Detail & Related papers (2023-01-17T12:42:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.