iEdit: Localised Text-guided Image Editing with Weak Supervision
- URL: http://arxiv.org/abs/2305.05947v1
- Date: Wed, 10 May 2023 07:39:14 GMT
- Title: iEdit: Localised Text-guided Image Editing with Weak Supervision
- Authors: Rumeysa Bodur, Erhan Gundogdu, Binod Bhattarai, Tae-Kyun Kim, Michael
Donoser, Loris Bazzani
- Abstract summary: We propose a novel learning method for text-guided image editing.
It generates images conditioned on a source image and a textual edit prompt.
It shows favourable results against its counterparts in terms of image fidelity, CLIP alignment score and qualitatively for editing both generated and real images.
- Score: 53.082196061014734
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion models (DMs) can generate realistic images with text guidance using
large-scale datasets. However, they demonstrate limited controllability in the
output space of the generated images. We propose a novel learning method for
text-guided image editing, namely \texttt{iEdit}, that generates images
conditioned on a source image and a textual edit prompt. As a fully-annotated
dataset with target images does not exist, previous approaches perform
subject-specific fine-tuning at test time or adopt contrastive learning without
a target image, leading to issues on preserving the fidelity of the source
image. We propose to automatically construct a dataset derived from LAION-5B,
containing pseudo-target images with their descriptive edit prompts given input
image-caption pairs. This dataset gives us the flexibility of introducing a
weakly-supervised loss function to generate the pseudo-target image from the
latent noise of the source image conditioned on the edit prompt. To encourage
localised editing and preserve or modify spatial structures in the image, we
propose a loss function that uses segmentation masks to guide the editing
during training and optionally at inference. Our model is trained on the
constructed dataset with 200K samples and constrained GPU resources. It shows
favourable results against its counterparts in terms of image fidelity, CLIP
alignment score and qualitatively for editing both generated and real images.
Related papers
- DM-Align: Leveraging the Power of Natural Language Instructions to Make Changes to Images [55.546024767130994]
We propose a novel model to enhance the text-based control of an image editor by explicitly reasoning about which parts of the image to alter or preserve.
It relies on word alignments between a description of the original source image and the instruction that reflects the needed updates, and the input image.
It is evaluated on a subset of the Bison dataset and a self-defined dataset dubbed Dream.
arXiv Detail & Related papers (2024-04-27T22:45:47Z) - Dynamic Prompt Learning: Addressing Cross-Attention Leakage for
Text-Based Image Editing [23.00202969969574]
We propose Dynamic Prompt Learning (DPL) to force cross-attention maps to focus on correct noun words in the text prompt.
We show improved prompt editing results for Word-Swap, Prompt Refinement, and Attention Re-weighting, especially for complex multi-object scenes.
arXiv Detail & Related papers (2023-09-27T13:55:57Z) - Text-to-image Editing by Image Information Removal [19.464349486031566]
We propose a text-to-image editing model with an Image Information Removal module (IIR) that selectively erases color-related and texture-related information from the original image.
Our experiments on CUB, Outdoor Scenes, and COCO shows that our edited images are preferred 35% more often than prior work.
arXiv Detail & Related papers (2023-05-27T14:48:05Z) - Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image
Inpainting [53.708523312636096]
We present Imagen Editor, a cascaded diffusion model built, by fine-tuning on text-guided image inpainting.
edits are faithful to the text prompts, which is accomplished by using object detectors to propose inpainting masks during training.
To improve qualitative and quantitative evaluation, we introduce EditBench, a systematic benchmark for text-guided image inpainting.
arXiv Detail & Related papers (2022-12-13T21:25:11Z) - ClipCrop: Conditioned Cropping Driven by Vision-Language Model [90.95403416150724]
We take advantage of vision-language models as a foundation for creating robust and user-intentional cropping algorithms.
We develop a method to perform cropping with a text or image query that reflects the user's intention as guidance.
Our pipeline design allows the model to learn text-conditioned aesthetic cropping with a small dataset.
arXiv Detail & Related papers (2022-11-21T14:27:07Z) - DiffEdit: Diffusion-based semantic image editing with mask guidance [64.555930158319]
DiffEdit is a method to take advantage of text-conditioned diffusion models for the task of semantic image editing.
Our main contribution is able to automatically generate a mask highlighting regions of the input image that need to be edited.
arXiv Detail & Related papers (2022-10-20T17:16:37Z) - Blended Diffusion for Text-driven Editing of Natural Images [18.664733153082146]
We introduce the first solution for performing local (region-based) edits in generic natural images.
We achieve our goal by leveraging and combining a pretrained language-image model (CLIP)
To seamlessly fuse the edited region with the unchanged parts of the image, we spatially blend noised versions of the input image with the local text-guided diffusion latent.
arXiv Detail & Related papers (2021-11-29T18:58:49Z) - Semantic Image Manipulation Using Scene Graphs [105.03614132953285]
We introduce a-semantic scene graph network that does not require direct supervision for constellation changes or image edits.
This makes possible to train the system from existing real-world datasets with no additional annotation effort.
arXiv Detail & Related papers (2020-04-07T20:02:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.