SmartEdit: Exploring Complex Instruction-based Image Editing with
Multimodal Large Language Models
- URL: http://arxiv.org/abs/2312.06739v1
- Date: Mon, 11 Dec 2023 17:54:11 GMT
- Title: SmartEdit: Exploring Complex Instruction-based Image Editing with
Multimodal Large Language Models
- Authors: Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun,
Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, Ying Shan
- Abstract summary: This paper introduces SmartEdit, a novel approach to instruction-based image editing.
It exploits Multimodal Large Language Models (MLLMs) to enhance their understanding and reasoning capabilities.
We show that a small amount of complex instruction editing data can effectively stimulate SmartEdit's editing capabilities for more complex instructions.
- Score: 91.22477798288003
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current instruction-based editing methods, such as InstructPix2Pix, often
fail to produce satisfactory results in complex scenarios due to their
dependence on the simple CLIP text encoder in diffusion models. To rectify
this, this paper introduces SmartEdit, a novel approach to instruction-based
image editing that leverages Multimodal Large Language Models (MLLMs) to
enhance their understanding and reasoning capabilities. However, direct
integration of these elements still faces challenges in situations requiring
complex reasoning. To mitigate this, we propose a Bidirectional Interaction
Module that enables comprehensive bidirectional information interactions
between the input image and the MLLM output. During training, we initially
incorporate perception data to boost the perception and understanding
capabilities of diffusion models. Subsequently, we demonstrate that a small
amount of complex instruction editing data can effectively stimulate
SmartEdit's editing capabilities for more complex instructions. We further
construct a new evaluation dataset, Reason-Edit, specifically tailored for
complex instruction-based image editing. Both quantitative and qualitative
results on this evaluation dataset indicate that our SmartEdit surpasses
previous methods, paving the way for the practical application of complex
instruction-based image editing.
Related papers
- BrushEdit: All-In-One Image Inpainting and Editing [79.55816192146762]
BrushEdit is a novel inpainting-based instruction-guided image editing paradigm.
We devise a system enabling free-form instruction editing by integrating MLLMs and a dual-branch image inpainting model.
Our framework effectively combines MLLMs and inpainting models, achieving superior performance across seven metrics.
arXiv Detail & Related papers (2024-12-13T17:58:06Z) - Compositional Image Retrieval via Instruction-Aware Contrastive Learning [40.54022628032561]
Composed Image Retrieval (CIR) involves retrieving a target image based on a composed query of an image paired with text that specifies modifications or changes to the visual reference.
In practice, due to the scarcity of annotated data in downstream tasks, Zero-Shot CIR (ZS-CIR) is desirable.
We propose a novel embedding method utilizing an instruction-tuned Multimodal LLM (MLLM) to generate composed representation.
arXiv Detail & Related papers (2024-12-07T22:46:52Z) - InsightEdit: Towards Better Instruction Following for Image Editing [12.683378605956024]
We focus on the task of instruction-based image editing. Previous works like InstructPix2Pix, InstructDiffusion, and SmartEdit have explored end-to-end editing.
We introduce a two-stream bridging mechanism utilizing both the textual and visual features reasoned by the powerful Multimodal Large Language Models (MLLM)
Our approach, InsightEdit, achieves state-of-the-art performance, excelling in complex instruction following and maintaining high background consistency with the original image.
arXiv Detail & Related papers (2024-11-26T11:11:10Z) - ReEdit: Multimodal Exemplar-Based Image Editing with Diffusion Models [11.830273909934688]
Modern Text-to-Image (T2I) Diffusion models have revolutionized image editing by enabling the generation of high-quality images.
We propose ReEdit, a modular and efficient end-to-end framework that captures edits in both text and image modalities.
Our results demonstrate that ReEdit consistently outperforms contemporary approaches both qualitatively and quantitatively.
arXiv Detail & Related papers (2024-11-06T15:19:24Z) - A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models [117.77807994397784]
Image editing aims to edit the given synthetic or real image to meet the specific requirements from users.
Recent significant advancement in this field is based on the development of text-to-image (T2I) diffusion models.
T2I-based image editing methods significantly enhance editing performance and offer a user-friendly interface for modifying content guided by multimodal inputs.
arXiv Detail & Related papers (2024-06-20T17:58:52Z) - SEED-Data-Edit Technical Report: A Hybrid Dataset for Instructional Image Editing [53.00272278754867]
SEED-Data-Edit is a hybrid dataset for instruction-guided image editing.
High-quality editing data produced by an automated pipeline.
Real-world scenario data collected from the internet.
High-precision multi-turn editing data annotated by humans.
arXiv Detail & Related papers (2024-05-07T04:55:47Z) - Guiding Instruction-based Image Editing via Multimodal Large Language
Models [102.82211398699644]
Multimodal large language models (MLLMs) show promising capabilities in cross-modal understanding and visual-aware response generation.
We investigate how MLLMs facilitate edit instructions and present MLLM-Guided Image Editing (MGIE)
MGIE learns to derive expressive instructions and provides explicit guidance.
arXiv Detail & Related papers (2023-09-29T10:01:50Z) - InstructPix2Pix: Learning to Follow Image Editing Instructions [103.77092910685764]
We propose a method for editing images from human instructions.
given an input image and a written instruction that tells the model what to do, our model follows these instructions to edit the image.
We show compelling editing results for a diverse collection of input images and written instructions.
arXiv Detail & Related papers (2022-11-17T18:58:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.