Specify and Edit: Overcoming Ambiguity in Text-Based Image Editing
- URL: http://arxiv.org/abs/2407.20232v1
- Date: Mon, 29 Jul 2024 17:59:57 GMT
- Title: Specify and Edit: Overcoming Ambiguity in Text-Based Image Editing
- Authors: Ekaterina Iakovleva, Fabio Pizzati, Philip Torr, Stéphane Lathuilière,
- Abstract summary: We propose a zero-shot inference pipeline for diffusion-based editing systems.
We use a large language model (LLM) to decompose the input instruction into specific instructions.
Our pipeline improves the interpretability of editing models, and boosts the output diversity.
- Score: 24.316956641791034
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-based editing diffusion models exhibit limited performance when the user's input instruction is ambiguous. To solve this problem, we propose $\textit{Specify ANd Edit}$ (SANE), a zero-shot inference pipeline for diffusion-based editing systems. We use a large language model (LLM) to decompose the input instruction into specific instructions, i.e. well-defined interventions to apply to the input image to satisfy the user's request. We benefit from the LLM-derived instructions along the original one, thanks to a novel denoising guidance strategy specifically designed for the task. Our experiments with three baselines and on two datasets demonstrate the benefits of SANE in all setups. Moreover, our pipeline improves the interpretability of editing models, and boosts the output diversity. We also demonstrate that our approach can be applied to any edit, whether ambiguous or not. Our code is public at https://github.com/fabvio/SANE.
Related papers
- Disentangling Instruction Influence in Diffusion Transformers for Parallel Multi-Instruction-Guided Image Editing [26.02149948089938]
Instruction Influence Disentanglement (IID) is a novel framework enabling parallel execution of multiple instructions in a single denoising process.
We analyze self-attention mechanisms in DiTs and derive instruction-specific attention masks to disentangle each instruction's influence.
IID reduces diffusion steps while improving fidelity and instruction completion compared to existing baselines.
arXiv Detail & Related papers (2025-04-07T07:26:25Z) - FireEdit: Fine-grained Instruction-based Image Editing via Region-aware Vision Language Model [54.693572837423226]
FireEdit is an innovative Fine-grained Instruction-based image editing framework that exploits a REgion-aware VLM.
FireEdit is designed to accurately comprehend user instructions and ensure effective control over the editing process.
Our approach surpasses the state-of-the-art instruction-based image editing methods.
arXiv Detail & Related papers (2025-03-25T16:59:42Z) - A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models [117.77807994397784]
Image editing aims to edit the given synthetic or real image to meet the specific requirements from users.
Recent significant advancement in this field is based on the development of text-to-image (T2I) diffusion models.
T2I-based image editing methods significantly enhance editing performance and offer a user-friendly interface for modifying content guided by multimodal inputs.
arXiv Detail & Related papers (2024-06-20T17:58:52Z) - ZONE: Zero-Shot Instruction-Guided Local Editing [56.56213730578504]
We propose a Zero-shot instructiON-guided local image Editing approach, termed ZONE.
We first convert the editing intent from the user-provided instruction into specific image editing regions through InstructPix2Pix.
We then propose a Region-IoU scheme for precise image layer extraction from an off-the-shelf segment model.
arXiv Detail & Related papers (2023-12-28T02:54:34Z) - InstructAny2Pix: Flexible Visual Editing via Multimodal Instruction Following [26.457571615782985]
InstructAny2Pix is a flexible multi-modal instruction-following system that enables users to edit an input image using instructions involving audio, images, and text.
We demonstrate that our system can perform a series of novel instruction-guided editing tasks.
arXiv Detail & Related papers (2023-12-11T17:53:45Z) - From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning [63.63840740526497]
We investigate how instruction tuning adjusts pre-trained models with a focus on intrinsic changes.
The impact of instruction tuning is then studied by comparing the explanations derived from the pre-trained and instruction-tuned models.
Our findings reveal three significant impacts of instruction tuning.
arXiv Detail & Related papers (2023-09-30T21:16:05Z) - XATU: A Fine-grained Instruction-based Benchmark for Explainable Text Updates [7.660511135287692]
This paper introduces XATU, the first benchmark specifically designed for fine-grained instruction-based explainable text editing.
XATU considers finer-grained text editing tasks of varying difficulty, incorporating lexical, syntactic, semantic, and knowledge-intensive edit aspects.
We demonstrate the effectiveness of instruction tuning and the impact of underlying architecture across various editing tasks.
arXiv Detail & Related papers (2023-09-20T04:58:59Z) - InstructEdit: Improving Automatic Masks for Diffusion-based Image
Editing With User Instructions [46.88926203020054]
We propose a framework termed InstructEdit that can do fine-grained editing based on user instructions.
Our method outperforms previous editing methods in fine-grained editing applications.
arXiv Detail & Related papers (2023-05-29T12:24:58Z) - StyleDiffusion: Prompt-Embedding Inversion for Text-Based Editing [115.49488548588305]
A significant research effort is focused on exploiting the amazing capacities of pretrained diffusion models for the editing of images.
They either finetune the model, or invert the image in the latent space of the pretrained model.
They suffer from two problems: Unsatisfying results for selected regions and unexpected changes in non-selected regions.
arXiv Detail & Related papers (2023-03-28T00:16:45Z) - SKED: Sketch-guided Text-based 3D Editing [49.019881133348775]
We present SKED, a technique for editing 3D shapes represented by NeRFs.
Our technique utilizes as few as two guiding sketches from different views to alter an existing neural field.
We propose novel loss functions to generate the desired edits while preserving the density and radiance of the base instance.
arXiv Detail & Related papers (2023-03-19T18:40:44Z) - Zero-shot Image-to-Image Translation [57.46189236379433]
We propose pix2pix-zero, an image-to-image translation method that can preserve the original image without manual prompting.
We propose cross-attention guidance, which aims to retain the cross-attention maps of the input image throughout the diffusion process.
Our method does not need additional training for these edits and can directly use the existing text-to-image diffusion model.
arXiv Detail & Related papers (2023-02-06T18:59:51Z) - DiffEdit: Diffusion-based semantic image editing with mask guidance [64.555930158319]
DiffEdit is a method to take advantage of text-conditioned diffusion models for the task of semantic image editing.
Our main contribution is able to automatically generate a mask highlighting regions of the input image that need to be edited.
arXiv Detail & Related papers (2022-10-20T17:16:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.