DeltaEdit: Exploring Text-free Training for Text-Driven Image
Manipulation
- URL: http://arxiv.org/abs/2303.06285v1
- Date: Sat, 11 Mar 2023 02:38:31 GMT
- Title: DeltaEdit: Exploring Text-free Training for Text-Driven Image
Manipulation
- Authors: Yueming Lyu, Tianwei Lin, Fu Li, Dongliang He, Jing Dong, Tieniu Tan
- Abstract summary: We propose a novel framework named textitDeltaEdit to address these problems.
Based on the CLIP delta space, the DeltaEdit network is designed to map the CLIP visual features differences to the editing directions of StyleGAN.
- Score: 86.86227840278137
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-driven image manipulation remains challenging in training or inference
flexibility. Conditional generative models depend heavily on expensive
annotated training data. Meanwhile, recent frameworks, which leverage
pre-trained vision-language models, are limited by either per text-prompt
optimization or inference-time hyper-parameters tuning. In this work, we
propose a novel framework named \textit{DeltaEdit} to address these problems.
Our key idea is to investigate and identify a space, namely delta image and
text space that has well-aligned distribution between CLIP visual feature
differences of two images and CLIP textual embedding differences of source and
target texts. Based on the CLIP delta space, the DeltaEdit network is designed
to map the CLIP visual features differences to the editing directions of
StyleGAN at training phase. Then, in inference phase, DeltaEdit predicts the
StyleGAN's editing directions from the differences of the CLIP textual
features. In this way, DeltaEdit is trained in a text-free manner. Once
trained, it can well generalize to various text prompts for zero-shot inference
without bells and whistles. Code is available at
https://github.com/Yueming6568/DeltaEdit.
Related papers
- VIXEN: Visual Text Comparison Network for Image Difference Captioning [58.16313862434814]
We present VIXEN, a technique that succinctly summarizes in text the visual differences between a pair of images.
Our proposed network linearly maps image features in a pairwise manner, constructing a soft prompt for a pretrained large language model.
arXiv Detail & Related papers (2024-02-29T12:56:18Z) - Text-Driven Image Editing via Learnable Regions [74.45313434129005]
We introduce a method for region-based image editing driven by textual prompts, without the need for user-provided masks or sketches.
We show that this simple approach enables flexible editing that is compatible with current image generation models.
Experiments demonstrate the competitive performance of our method in manipulating images with high fidelity and realism that correspond to the provided language descriptions.
arXiv Detail & Related papers (2023-11-28T02:27:31Z) - DeltaSpace: A Semantic-aligned Feature Space for Flexible Text-guided
Image Editing [22.354236929932476]
Text-guided image editing faces significant challenges to training and inference flexibility.
We propose a novel framework called DeltaEdit, which maps the CLIP visual feature differences to the latent space directions of a generative model.
Experiments validate the effectiveness and versatility of DeltaEdit with different generative models.
arXiv Detail & Related papers (2023-10-12T15:43:12Z) - CLIP-Guided StyleGAN Inversion for Text-Driven Real Image Editing [22.40686064568406]
We present CLIPInverter, a new text-driven image editing approach that is able to efficiently and reliably perform multi-attribute changes.
Our method outperforms competing approaches in terms of manipulation accuracy and photo-realism on various domains including human faces, cats, and birds.
arXiv Detail & Related papers (2023-07-17T11:29:48Z) - Improving Diffusion Models for Scene Text Editing with Dual Encoders [44.12999932588205]
Scene text editing is a challenging task that involves modifying or inserting specified texts in an image.
Recent advances in diffusion models have shown promise in overcoming these limitations with text-conditional image editing.
We propose DIFFSTE to improve pre-trained diffusion models with a dual encoder design.
arXiv Detail & Related papers (2023-04-12T02:08:34Z) - Exploring Stroke-Level Modifications for Scene Text Editing [86.33216648792964]
Scene text editing (STE) aims to replace text with the desired one while preserving background and styles of the original text.
Previous methods of editing the whole image have to learn different translation rules of background and text regions simultaneously.
We propose a novel network by MOdifying Scene Text image at strokE Level (MOSTEL)
arXiv Detail & Related papers (2022-12-05T02:10:59Z) - One Model to Edit Them All: Free-Form Text-Driven Image Manipulation
with Semantic Modulations [75.81725681546071]
Free-Form CLIP aims to establish an automatic latent mapping so that one manipulation model handles free-form text prompts.
For one type of image (e.g., human portrait'), one FFCLIP model can be learned to handle free-form text prompts.
Both visual and numerical results show that FFCLIP effectively produces semantically accurate and visually realistic images.
arXiv Detail & Related papers (2022-10-14T15:06:05Z) - Towards Counterfactual Image Manipulation via CLIP [106.94502632502194]
Existing methods can achieve realistic editing of different visual attributes such as age and gender of facial images.
We investigate this problem in a text-driven manner with Contrastive-Language-Image-Pretraining (CLIP)
We design a novel contrastive loss that exploits predefined CLIP-space directions to guide the editing toward desired directions from different perspectives.
arXiv Detail & Related papers (2022-07-06T17:02:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.