Related papers: Text Guided Image Editing with Automatic Concept Locating and Forgetting

Text Guided Image Editing with Automatic Concept Locating and Forgetting

URL: http://arxiv.org/abs/2405.19708v1
Date: Thu, 30 May 2024 05:36:32 GMT
Title: Text Guided Image Editing with Automatic Concept Locating and Forgetting
Authors: Jia Li, Lijie Hu, Zhixian He, Jingfeng Zhang, Tianhang Zheng, Di Wang,
Abstract summary: We propose a novel method called Locate and Forget (LaF) to locate potential target concepts in the image for modification. Compared to the baselines, our method demonstrates its superiority in text-guided image editing tasks both qualitatively and quantitatively.
Score: 27.70615803908037
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the advancement of image-to-image diffusion models guided by text, significant progress has been made in image editing. However, a persistent challenge remains in seamlessly incorporating objects into images based on textual instructions, without relying on extra user-provided guidance. Text and images are inherently distinct modalities, bringing out difficulties in fully capturing the semantic intent conveyed through language and accurately translating that into the desired visual modifications. Therefore, text-guided image editing models often produce generations with residual object attributes that do not fully align with human expectations. To address this challenge, the models should comprehend the image content effectively away from a disconnect between the provided textual editing prompts and the actual modifications made to the image. In our paper, we propose a novel method called Locate and Forget (LaF), which effectively locates potential target concepts in the image for modification by comparing the syntactic trees of the target prompt and scene descriptions in the input image, intending to forget their existence clues in the generated image. Compared to the baselines, our method demonstrates its superiority in text-guided image editing tasks both qualitatively and quantitatively.

Related papers

Concept Lancet: Image Editing with Compositional Representation Transplant [58.9421919837084]
Concept Lancet is a zero-shot plug-and-play framework for principled representation manipulation in image editing. We decompose the source input in the latent (text embedding or diffusion score) space as a sparse linear combination of the representations of the collected visual concepts. We perform a customized concept transplant process to impose the corresponding editing direction.
arXiv Detail & Related papers (2025-04-03T17:59:58Z)
Contrastive Learning Guided Latent Diffusion Model for Image-to-Image Translation [7.218556478126324]
diffusion model has demonstrated superior performance in diverse and high-quality images for text-guided image translation. We propose pix2pix-zeroCon, a zero-shot diffusion-based method that eliminates the need for additional training by leveraging patch-wise contrastive loss. Our approach requires no additional training and operates directly on a pre-trained text-to-image diffusion model.
arXiv Detail & Related papers (2025-03-26T12:15:25Z)
Prompt Augmentation for Self-supervised Text-guided Image Manipulation [34.01939157351624]
We introduce prompt augmentation, a method amplifying a single input prompt into several target prompts, strengthening textual context and enabling localised image editing. We propose a Contrastive Loss tailored to driving effective image editing by displacing edited areas and drawing preserved regions closer. New losses are incorporated to the diffusion model, demonstrating improved or competitive image editing results on public datasets and generated images over state-of-the-art approaches.
arXiv Detail & Related papers (2024-12-17T16:54:05Z)
DM-Align: Leveraging the Power of Natural Language Instructions to Make Changes to Images [55.546024767130994]
We propose a novel model to enhance the text-based control of an image editor by explicitly reasoning about which parts of the image to alter or preserve. It relies on word alignments between a description of the original source image and the instruction that reflects the needed updates, and the input image. It is evaluated on a subset of the Bison dataset and a self-defined dataset dubbed Dream.
arXiv Detail & Related papers (2024-04-27T22:45:47Z)
Training-Free Consistent Text-to-Image Generation [80.4814768762066]
Text-to-image models can portray the same subject across diverse prompts. Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects. We present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model.
arXiv Detail & Related papers (2024-02-05T18:42:34Z)
AdapEdit: Spatio-Temporal Guided Adaptive Editing Algorithm for Text-Based Continuity-Sensitive Image Editing [24.9487669818162]
We propose atemporal guided adaptive editing algorithm AdapEdit, which realizes adaptive image editing. Our approach has a significant advantage in preserving model priors and does not require model training, fine-tuning extra data, or optimization. We present our results over a wide variety of raw images and editing instructions, demonstrating competitive performance and showing it significantly outperforms the previous approaches.
arXiv Detail & Related papers (2023-12-13T09:45:58Z)
Dynamic Prompt Learning: Addressing Cross-Attention Leakage for Text-Based Image Editing [23.00202969969574]
We propose Dynamic Prompt Learning (DPL) to force cross-attention maps to focus on correct noun words in the text prompt. We show improved prompt editing results for Word-Swap, Prompt Refinement, and Attention Re-weighting, especially for complex multi-object scenes.
arXiv Detail & Related papers (2023-09-27T13:55:57Z)
Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval [11.798006331912056]
The goal of Text-to-Image Person Retrieval (TIPR) is to retrieve specific person images according to the given textual descriptions. We propose a novel TIPR framework to build fine-grained interactions and alignment between person images and the corresponding texts.
arXiv Detail & Related papers (2023-07-18T08:23:46Z)
iEdit: Localised Text-guided Image Editing with Weak Supervision [53.082196061014734]
We propose a novel learning method for text-guided image editing. It generates images conditioned on a source image and a textual edit prompt. It shows favourable results against its counterparts in terms of image fidelity, CLIP alignment score and qualitatively for editing both generated and real images.
arXiv Detail & Related papers (2023-05-10T07:39:14Z)
Zero-shot Image-to-Image Translation [57.46189236379433]
We propose pix2pix-zero, an image-to-image translation method that can preserve the original image without manual prompting. We propose cross-attention guidance, which aims to retain the cross-attention maps of the input image throughout the diffusion process. Our method does not need additional training for these edits and can directly use the existing text-to-image diffusion model.
arXiv Detail & Related papers (2023-02-06T18:59:51Z)
Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting [53.708523312636096]
We present Imagen Editor, a cascaded diffusion model built, by fine-tuning on text-guided image inpainting. edits are faithful to the text prompts, which is accomplished by using object detectors to propose inpainting masks during training. To improve qualitative and quantitative evaluation, we introduce EditBench, a systematic benchmark for text-guided image inpainting.
arXiv Detail & Related papers (2022-12-13T21:25:11Z)
Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation [10.39028769374367]
We present a new framework that takes text-to-image synthesis to the realm of image-to-image translation. Our method harnesses the power of a pre-trained text-to-image diffusion model to generate a new image that complies with the target text.
arXiv Detail & Related papers (2022-11-22T20:39:18Z)
DiffEdit: Diffusion-based semantic image editing with mask guidance [64.555930158319]
DiffEdit is a method to take advantage of text-conditioned diffusion models for the task of semantic image editing. Our main contribution is able to automatically generate a mask highlighting regions of the input image that need to be edited.
arXiv Detail & Related papers (2022-10-20T17:16:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.