Related papers: FreeEdit: Mask-free Reference-based Image Editing with Multi-modal Instruction

FreeEdit: Mask-free Reference-based Image Editing with Multi-modal Instruction

URL: http://arxiv.org/abs/2409.18071v1
Date: Thu, 26 Sep 2024 17:18:39 GMT
Title: FreeEdit: Mask-free Reference-based Image Editing with Multi-modal Instruction
Authors: Runze He, Kai Ma, Linjiang Huang, Shaofei Huang, Jialin Gao, Xiaoming Wei, Jiao Dai, Jizhong Han, Si Liu,
Abstract summary: FreeEdit is a novel approach for achieving reference-based image editing. It can accurately reproduce the visual concept from the reference image based on user-friendly language instructions.
Score: 31.95664918050255
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Introducing user-specified visual concepts in image editing is highly practical as these concepts convey the user's intent more precisely than text-based descriptions. We propose FreeEdit, a novel approach for achieving such reference-based image editing, which can accurately reproduce the visual concept from the reference image based on user-friendly language instructions. Our approach leverages the multi-modal instruction encoder to encode language instructions to guide the editing process. This implicit way of locating the editing area eliminates the need for manual editing masks. To enhance the reconstruction of reference details, we introduce the Decoupled Residual ReferAttention (DRRA) module. This module is designed to integrate fine-grained reference features extracted by a detail extractor into the image editing process in a residual way without interfering with the original self-attention. Given that existing datasets are unsuitable for reference-based image editing tasks, particularly due to the difficulty in constructing image triplets that include a reference image, we curate a high-quality dataset, FreeBench, using a newly developed twice-repainting scheme. FreeBench comprises the images before and after editing, detailed editing instructions, as well as a reference image that maintains the identity of the edited object, encompassing tasks such as object addition, replacement, and deletion. By conducting phased training on FreeBench followed by quality tuning, FreeEdit achieves high-quality zero-shot editing through convenient language instructions. We conduct extensive experiments to evaluate the effectiveness of FreeEdit across multiple task types, demonstrating its superiority over existing methods. The code will be available at: https://freeedit.github.io/.

Related papers

SliderEdit: Continuous Image Editing with Fine-Grained Instruction Control [50.76070785417023]
We introduce SliderEdit, a framework for continuous image editing with fine-grained, interpretable instruction control.<n>Given a multi-part edit instruction, SliderEdit disentangles the individual instructions and exposes each as a globally trained slider.<n>Our results pave the way for interactive, instruction-driven image manipulation with continuous and compositional control.
arXiv Detail & Related papers (2025-11-12T20:21:37Z)
Understanding the Implicit User Intention via Reasoning with Large Language Model for Image Editing [23.69189799564107]
Existing image editing methods can handle simple editing instructions very well.<n>To deal with complex editing instructions, they often need to jointly fine-tune the large language models (LLMs) and diffusion models (DMs)<n>We propose a new method, called textbfComplex textbfImage textbfEditing via textbfLLM textbfReasoning (CIELR), which converts a complex user instruction into a set of simple and explicit editing actions.
arXiv Detail & Related papers (2025-10-31T10:06:28Z)
Kontinuous Kontext: Continuous Strength Control for Instruction-based Image Editing [76.44219733285898]
Kontinuous Kontext is an instruction-driven editing model that provides a new dimension of control over edit strength.<n>A lightweight projector network maps the input scalar and the edit instruction to coefficients in the model's modulation space.<n>For training our model, we synthesize a diverse dataset of image-edit-instruction-strength quadruplets using existing generative models.
arXiv Detail & Related papers (2025-10-09T17:51:03Z)
CAMILA: Context-Aware Masking for Image Editing with Language Alignment [19.448726702919416]
We propose a context-aware method for image editing named as CAMILA.<n> CAMILA is designed to validate the contextual coherence between instructions and the image.<n>Our method achieves better performance and higher semantic alignment than state-of-the-art models.
arXiv Detail & Related papers (2025-09-24T03:20:44Z)
Describe, Don't Dictate: Semantic Image Editing with Natural Language Intent [38.61468007698179]
We propose a descriptive-prompt-based editing framework, named DescriptiveEdit.<n>The core idea is to re-frame instruction-based image editing' as reference-image-based text-to-image generation'
arXiv Detail & Related papers (2025-08-28T07:45:08Z)
Concept Lancet: Image Editing with Compositional Representation Transplant [58.9421919837084]
Concept Lancet is a zero-shot plug-and-play framework for principled representation manipulation in image editing. We decompose the source input in the latent (text embedding or diffusion score) space as a sparse linear combination of the representations of the collected visual concepts. We perform a customized concept transplant process to impose the corresponding editing direction.
arXiv Detail & Related papers (2025-04-03T17:59:58Z)
FireEdit: Fine-grained Instruction-based Image Editing via Region-aware Vision Language Model [54.693572837423226]
FireEdit is an innovative Fine-grained Instruction-based image editing framework that exploits a REgion-aware VLM. FireEdit is designed to accurately comprehend user instructions and ensure effective control over the editing process. Our approach surpasses the state-of-the-art instruction-based image editing methods.
arXiv Detail & Related papers (2025-03-25T16:59:42Z)
BrushEdit: All-In-One Image Inpainting and Editing [79.55816192146762]
BrushEdit is a novel inpainting-based instruction-guided image editing paradigm. We devise a system enabling free-form instruction editing by integrating MLLMs and a dual-branch image inpainting model. Our framework effectively combines MLLMs and inpainting models, achieving superior performance across seven metrics.
arXiv Detail & Related papers (2024-12-13T17:58:06Z)
AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea [88.79769371584491]
We present AnyEdit, a comprehensive multi-modal instruction editing dataset. We ensure the diversity and quality of the AnyEdit collection through three aspects: initial data diversity, adaptive editing process, and automated selection of editing results. Experiments on three benchmark datasets show that AnyEdit consistently boosts the performance of diffusion-based editing models.
arXiv Detail & Related papers (2024-11-24T07:02:56Z)
ReEdit: Multimodal Exemplar-Based Image Editing with Diffusion Models [11.830273909934688]
Modern Text-to-Image (T2I) Diffusion models have revolutionized image editing by enabling the generation of high-quality images. We propose ReEdit, a modular and efficient end-to-end framework that captures edits in both text and image modalities. Our results demonstrate that ReEdit consistently outperforms contemporary approaches both qualitatively and quantitatively.
arXiv Detail & Related papers (2024-11-06T15:19:24Z)
Zero-shot Image Editing with Reference Imitation [50.75310094611476]
We present a new form of editing, termed imitative editing, to help users exercise their creativity more conveniently. We propose a generative training framework, dubbed MimicBrush, which randomly selects two frames from a video clip, masks some regions of one frame, and learns to recover the masked regions using the information from the other frame. We experimentally show the effectiveness of our method under various test cases as well as its superiority over existing alternatives.
arXiv Detail & Related papers (2024-06-11T17:59:51Z)
InstructBrush: Learning Attention-based Instruction Optimization for Image Editing [54.07526261513434]
InstructBrush is an inversion method for instruction-based image editing methods. It extracts editing effects from image pairs as editing instructions, which are further applied for image editing. Our approach achieves superior performance in editing and is more semantically consistent with the target editing effects.
arXiv Detail & Related papers (2024-03-27T15:03:38Z)
An Item is Worth a Prompt: Versatile Image Editing with Disentangled Control [21.624984690721842]
D-Edit is a framework to disentangle the comprehensive image-prompt interaction into several item-prompt interactions. It is based on pretrained diffusion models with cross-attention layers disentangled and adopts a two-step optimization to build item-prompt associations. We demonstrate state-of-the-art results in four types of editing operations including image-based, text-based, mask-based editing, and item removal.
arXiv Detail & Related papers (2024-03-07T20:06:29Z)
SpecRef: A Fast Training-free Baseline of Specific Reference-Condition Real Image Editing [0.0]
We propose a new task called Specific Reference Condition Real Image Editing. It allows user to provide a reference image to further control the outcome, such as replacing an object with a particular one. Specifically, we design a Specific Reference Attention Controller to incorporate features from the reference image, and adopt a mask mechanism to prevent interference between editing and non-editing regions.
arXiv Detail & Related papers (2024-01-07T09:23:06Z)
Optimisation-Based Multi-Modal Semantic Image Editing [58.496064583110694]
We propose an inference-time editing optimisation to accommodate multiple editing instruction types. By allowing to adjust the influence of each loss function, we build a flexible editing solution that can be adjusted to user preferences. We evaluate our method using text, pose and scribble edit conditions, and highlight our ability to achieve complex edits.
arXiv Detail & Related papers (2023-11-28T15:31:11Z)
Emu Edit: Precise Image Editing via Recognition and Generation Tasks [62.95717180730946]
We present Emu Edit, a multi-task image editing model which sets state-of-the-art results in instruction-based image editing. We train it to multi-task across an unprecedented range of tasks, such as region-based editing, free-form editing, and Computer Vision tasks. We show that Emu Edit can generalize to new tasks, such as image inpainting, super-resolution, and compositions of editing tasks, with just a few labeled examples.
arXiv Detail & Related papers (2023-11-16T18:55:58Z)
EditGAN: High-Precision Semantic Image Editing [120.49401527771067]
EditGAN is a novel method for high quality, high precision semantic image editing. We show that EditGAN can manipulate images with an unprecedented level of detail and freedom. We can also easily combine multiple edits and perform plausible edits beyond EditGAN training data.
arXiv Detail & Related papers (2021-11-04T22:36:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.