Related papers: SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models

SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models

URL: http://arxiv.org/abs/2312.06739v1
Date: Mon, 11 Dec 2023 17:54:11 GMT
Title: SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models
Authors: Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, Ying Shan
Abstract summary: This paper introduces SmartEdit, a novel approach to instruction-based image editing. It exploits Multimodal Large Language Models (MLLMs) to enhance their understanding and reasoning capabilities. We show that a small amount of complex instruction editing data can effectively stimulate SmartEdit's editing capabilities for more complex instructions.
Score: 91.22477798288003
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current instruction-based editing methods, such as InstructPix2Pix, often fail to produce satisfactory results in complex scenarios due to their dependence on the simple CLIP text encoder in diffusion models. To rectify this, this paper introduces SmartEdit, a novel approach to instruction-based image editing that leverages Multimodal Large Language Models (MLLMs) to enhance their understanding and reasoning capabilities. However, direct integration of these elements still faces challenges in situations requiring complex reasoning. To mitigate this, we propose a Bidirectional Interaction Module that enables comprehensive bidirectional information interactions between the input image and the MLLM output. During training, we initially incorporate perception data to boost the perception and understanding capabilities of diffusion models. Subsequently, we demonstrate that a small amount of complex instruction editing data can effectively stimulate SmartEdit's editing capabilities for more complex instructions. We further construct a new evaluation dataset, Reason-Edit, specifically tailored for complex instruction-based image editing. Both quantitative and qualitative results on this evaluation dataset indicate that our SmartEdit surpasses previous methods, paving the way for the practical application of complex instruction-based image editing.

Related papers

Image Editing As Programs with Diffusion Models [69.05164729625052]
We introduce Image Editing As Programs (IEAP), a unified image editing framework built upon the Diffusion Transformer (DiT) architecture.<n>IEAP approaches instructional editing through a reductionist lens, decomposing complex editing instructions into sequences of atomic operations.<n>Our framework delivers superior accuracy and semantic fidelity, particularly for complex, multi-step instructions.
arXiv Detail & Related papers (2025-06-04T16:57:24Z)
MIND-Edit: MLLM Insight-Driven Editing via Language-Vision Projection [13.467269066605452]
We propose MIND-Edit, an end-to-end image-editing framework integrating pretrained diffusion model with MLLM.<n> MIND-Edit introduces two complementary strategies: (1) a text instruction optimization strategy that clarifies ambiguous user instructions based on semantic reasoning from the MLLM, and (2) an MLLM insight-driven editing strategy that explicitly leverages the intrinsic visual understanding capability of the MLLM to infer editing intent.<n>Extensive experiments demonstrate that MIND-Edit outperforms state-of-the-art image editing methods in both quantitative metrics and visual quality, particularly under complex and challenging scenarios.
arXiv Detail & Related papers (2025-05-25T13:54:31Z)
CompBench: Benchmarking Complex Instruction-guided Image Editing [63.347846732450364]
CompBench is a large-scale benchmark for complex instruction-guided image editing.<n>We propose an MLLM-human collaborative framework with tailored task pipelines.<n>We propose an instruction decoupling strategy that disentangles editing intents into four key dimensions.
arXiv Detail & Related papers (2025-05-18T02:30:52Z)
Image-Editing Specialists: An RLAIF Approach for Diffusion Models [28.807572302899004]
We present a novel approach to training specialized instruction-based image-editing diffusion models. We introduce an online reinforcement learning framework that aligns the diffusion model with human preferences. Experimental results demonstrate that our models can perform intricate edits in complex scenes, after just 10 training steps.
arXiv Detail & Related papers (2025-04-17T10:46:39Z)
SmartFreeEdit: Mask-Free Spatial-Aware Image Editing with Complex Instruction Understanding [45.79481252237092]
SmartFreeEdit is an end-to-end framework that integrates a multimodal large language model (MLLM) with a hypergraph-enhanced inpainting architecture. Key innovations of SmartFreeEdit include region aware tokens and a mask embedding paradigm. Experiments on the Reason-Edit benchmark demonstrate that SmartFreeEdit surpasses current state-of-the-art methods.
arXiv Detail & Related papers (2025-04-17T07:17:49Z)
FireEdit: Fine-grained Instruction-based Image Editing via Region-aware Vision Language Model [54.693572837423226]
FireEdit is an innovative Fine-grained Instruction-based image editing framework that exploits a REgion-aware VLM. FireEdit is designed to accurately comprehend user instructions and ensure effective control over the editing process. Our approach surpasses the state-of-the-art instruction-based image editing methods.
arXiv Detail & Related papers (2025-03-25T16:59:42Z)
BrushEdit: All-In-One Image Inpainting and Editing [79.55816192146762]
BrushEdit is a novel inpainting-based instruction-guided image editing paradigm. We devise a system enabling free-form instruction editing by integrating MLLMs and a dual-branch image inpainting model. Our framework effectively combines MLLMs and inpainting models, achieving superior performance across seven metrics.
arXiv Detail & Related papers (2024-12-13T17:58:06Z)
Compositional Image Retrieval via Instruction-Aware Contrastive Learning [40.54022628032561]
Composed Image Retrieval (CIR) involves retrieving a target image based on a composed query of an image paired with text that specifies modifications or changes to the visual reference. In practice, due to the scarcity of annotated data in downstream tasks, Zero-Shot CIR (ZS-CIR) is desirable. We propose a novel embedding method utilizing an instruction-tuned Multimodal LLM (MLLM) to generate composed representation.
arXiv Detail & Related papers (2024-12-07T22:46:52Z)
InsightEdit: Towards Better Instruction Following for Image Editing [12.683378605956024]
We focus on the task of instruction-based image editing. Previous works like InstructPix2Pix, InstructDiffusion, and SmartEdit have explored end-to-end editing. We introduce a two-stream bridging mechanism utilizing both the textual and visual features reasoned by the powerful Multimodal Large Language Models (MLLM) Our approach, InsightEdit, achieves state-of-the-art performance, excelling in complex instruction following and maintaining high background consistency with the original image.
arXiv Detail & Related papers (2024-11-26T11:11:10Z)
ReEdit: Multimodal Exemplar-Based Image Editing with Diffusion Models [11.830273909934688]
Modern Text-to-Image (T2I) Diffusion models have revolutionized image editing by enabling the generation of high-quality images. We propose ReEdit, a modular and efficient end-to-end framework that captures edits in both text and image modalities. Our results demonstrate that ReEdit consistently outperforms contemporary approaches both qualitatively and quantitatively.
arXiv Detail & Related papers (2024-11-06T15:19:24Z)
Achieving Complex Image Edits via Function Aggregation with Diffusion Models [15.509233098264513]
Diffusion models have demonstrated strong performance in generative tasks, making them ideal candidates for image editing. We introduce FunEditor, an efficient diffusion model designed to learn atomic editing functions and perform complex edits by aggregating simpler functions. FunEditor is 5 to 24 times faster inference than existing methods on complex tasks like object movement.
arXiv Detail & Related papers (2024-08-16T02:33:55Z)
A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models [117.77807994397784]
Image editing aims to edit the given synthetic or real image to meet the specific requirements from users. Recent significant advancement in this field is based on the development of text-to-image (T2I) diffusion models. T2I-based image editing methods significantly enhance editing performance and offer a user-friendly interface for modifying content guided by multimodal inputs.
arXiv Detail & Related papers (2024-06-20T17:58:52Z)
ReasonPix2Pix: Instruction Reasoning Dataset for Advanced Image Editing [77.12834553200632]
We introduce ReasonPix2Pix, a comprehensive reasoning-attentive instruction editing dataset. The dataset is characterized by 1) reasoning instruction, 2) more realistic images from fine-grained categories, and 3) increased variances between input and edited images. When fine-tuned with our dataset under supervised conditions, the model demonstrates superior performance in instructional editing tasks, independent of whether the tasks require reasoning or not.
arXiv Detail & Related papers (2024-05-18T06:03:42Z)
SEED-Data-Edit Technical Report: A Hybrid Dataset for Instructional Image Editing [53.00272278754867]
SEED-Data-Edit is a hybrid dataset for instruction-guided image editing. High-quality editing data produced by an automated pipeline. Real-world scenario data collected from the internet. High-precision multi-turn editing data annotated by humans.
arXiv Detail & Related papers (2024-05-07T04:55:47Z)
Guiding Instruction-based Image Editing via Multimodal Large Language Models [102.82211398699644]
Multimodal large language models (MLLMs) show promising capabilities in cross-modal understanding and visual-aware response generation. We investigate how MLLMs facilitate edit instructions and present MLLM-Guided Image Editing (MGIE) MGIE learns to derive expressive instructions and provides explicit guidance.
arXiv Detail & Related papers (2023-09-29T10:01:50Z)
LEDITS: Real Image Editing with DDPM Inversion and Semantic Guidance [0.0]
LEDITS is a combined lightweight approach for real-image editing, incorporating the Edit Friendly DDPM inversion technique with Semantic Guidance. This approach achieves versatile edits, both subtle and extensive as well as alterations in composition and style, while requiring no optimization nor extensions to the architecture.
arXiv Detail & Related papers (2023-07-02T09:11:09Z)
InstructPix2Pix: Learning to Follow Image Editing Instructions [103.77092910685764]
We propose a method for editing images from human instructions. given an input image and a written instruction that tells the model what to do, our model follows these instructions to edit the image. We show compelling editing results for a diverse collection of input images and written instructions.
arXiv Detail & Related papers (2022-11-17T18:58:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.