InstructEdit: Improving Automatic Masks for Diffusion-based Image
Editing With User Instructions
- URL: http://arxiv.org/abs/2305.18047v1
- Date: Mon, 29 May 2023 12:24:58 GMT
- Title: InstructEdit: Improving Automatic Masks for Diffusion-based Image
Editing With User Instructions
- Authors: Qian Wang, Biao Zhang, Michael Birsak, Peter Wonka
- Abstract summary: We propose a framework termed InstructEdit that can do fine-grained editing based on user instructions.
Our method outperforms previous editing methods in fine-grained editing applications.
- Score: 46.88926203020054
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent works have explored text-guided image editing using diffusion models
and generated edited images based on text prompts. However, the models struggle
to accurately locate the regions to be edited and faithfully perform precise
edits. In this work, we propose a framework termed InstructEdit that can do
fine-grained editing based on user instructions. Our proposed framework has
three components: language processor, segmenter, and image editor. The first
component, the language processor, processes the user instruction using a large
language model. The goal of this processing is to parse the user instruction
and output prompts for the segmenter and captions for the image editor. We
adopt ChatGPT and optionally BLIP2 for this step. The second component, the
segmenter, uses the segmentation prompt provided by the language processor. We
employ a state-of-the-art segmentation framework Grounded Segment Anything to
automatically generate a high-quality mask based on the segmentation prompt.
The third component, the image editor, uses the captions from the language
processor and the masks from the segmenter to compute the edited image. We
adopt Stable Diffusion and the mask-guided generation from DiffEdit for this
purpose. Experiments show that our method outperforms previous editing methods
in fine-grained editing applications where the input image contains a complex
object or multiple objects. We improve the mask quality over DiffEdit and thus
improve the quality of edited images. We also show that our framework can
accept multiple forms of user instructions as input. We provide the code at
https://github.com/QianWangX/InstructEdit.
Related papers
- FireEdit: Fine-grained Instruction-based Image Editing via Region-aware Vision Language Model [54.693572837423226]
FireEdit is an innovative Fine-grained Instruction-based image editing framework that exploits a REgion-aware VLM.
FireEdit is designed to accurately comprehend user instructions and ensure effective control over the editing process.
Our approach surpasses the state-of-the-art instruction-based image editing methods.
arXiv Detail & Related papers (2025-03-25T16:59:42Z) - SGEdit: Bridging LLM with Text2Image Generative Model for Scene Graph-based Image Editing [42.23117201457898]
We introduce a new framework that integrates large language model (LLM) with Text2 generative model for graph-based image editing.
Our framework significantly outperforms existing image editing methods in terms of editing precision and scene aesthetics.
arXiv Detail & Related papers (2024-10-15T17:40:48Z) - FreeEdit: Mask-free Reference-based Image Editing with Multi-modal Instruction [31.95664918050255]
FreeEdit is a novel approach for achieving reference-based image editing.
It can accurately reproduce the visual concept from the reference image based on user-friendly language instructions.
arXiv Detail & Related papers (2024-09-26T17:18:39Z) - InstructBrush: Learning Attention-based Instruction Optimization for Image Editing [54.07526261513434]
InstructBrush is an inversion method for instruction-based image editing methods.
It extracts editing effects from image pairs as editing instructions, which are further applied for image editing.
Our approach achieves superior performance in editing and is more semantically consistent with the target editing effects.
arXiv Detail & Related papers (2024-03-27T15:03:38Z) - An Item is Worth a Prompt: Versatile Image Editing with Disentangled Control [21.624984690721842]
D-Edit is a framework to disentangle the comprehensive image-prompt interaction into several item-prompt interactions.
It is based on pretrained diffusion models with cross-attention layers disentangled and adopts a two-step optimization to build item-prompt associations.
We demonstrate state-of-the-art results in four types of editing operations including image-based, text-based, mask-based editing, and item removal.
arXiv Detail & Related papers (2024-03-07T20:06:29Z) - Text-Driven Image Editing via Learnable Regions [74.45313434129005]
We introduce a method for region-based image editing driven by textual prompts, without the need for user-provided masks or sketches.
We show that this simple approach enables flexible editing that is compatible with current image generation models.
Experiments demonstrate the competitive performance of our method in manipulating images with high fidelity and realism that correspond to the provided language descriptions.
arXiv Detail & Related papers (2023-11-28T02:27:31Z) - Visual Instruction Inversion: Image Editing via Visual Prompting [34.96778567507126]
We present a method for image editing via visual prompting.
We leverage the rich, pretrained editing capabilities of text-to-image diffusion models by inverting visual prompts into editing instructions.
arXiv Detail & Related papers (2023-07-26T17:50:10Z) - Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image
Inpainting [53.708523312636096]
We present Imagen Editor, a cascaded diffusion model built, by fine-tuning on text-guided image inpainting.
edits are faithful to the text prompts, which is accomplished by using object detectors to propose inpainting masks during training.
To improve qualitative and quantitative evaluation, we introduce EditBench, a systematic benchmark for text-guided image inpainting.
arXiv Detail & Related papers (2022-12-13T21:25:11Z) - DiffEdit: Diffusion-based semantic image editing with mask guidance [64.555930158319]
DiffEdit is a method to take advantage of text-conditioned diffusion models for the task of semantic image editing.
Our main contribution is able to automatically generate a mask highlighting regions of the input image that need to be edited.
arXiv Detail & Related papers (2022-10-20T17:16:37Z) - EditGAN: High-Precision Semantic Image Editing [120.49401527771067]
EditGAN is a novel method for high quality, high precision semantic image editing.
We show that EditGAN can manipulate images with an unprecedented level of detail and freedom.
We can also easily combine multiple edits and perform plausible edits beyond EditGAN training data.
arXiv Detail & Related papers (2021-11-04T22:36:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.