StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery
- URL: http://arxiv.org/abs/2103.17249v1
- Date: Wed, 31 Mar 2021 17:51:25 GMT
- Title: StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery
- Authors: Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, Dani
Lischinski
- Abstract summary: We develop a text-based interface for StyleGAN image manipulation.
We first introduce an optimization scheme that utilizes a CLIP-based loss to modify an input latent vector in response to a user-provided text prompt.
Next, we describe a latent mapper that infers a text-guided latent manipulation step for a given input image, allowing faster and more stable text-based manipulation.
- Score: 71.1862388442953
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Inspired by the ability of StyleGAN to generate highly realistic images in a
variety of domains, much recent work has focused on understanding how to use
the latent spaces of StyleGAN to manipulate generated and real images. However,
discovering semantically meaningful latent manipulations typically involves
painstaking human examination of the many degrees of freedom, or an annotated
collection of images for each desired manipulation. In this work, we explore
leveraging the power of recently introduced Contrastive Language-Image
Pre-training (CLIP) models in order to develop a text-based interface for
StyleGAN image manipulation that does not require such manual effort. We first
introduce an optimization scheme that utilizes a CLIP-based loss to modify an
input latent vector in response to a user-provided text prompt. Next, we
describe a latent mapper that infers a text-guided latent manipulation step for
a given input image, allowing faster and more stable text-based manipulation.
Finally, we present a method for mapping a text prompts to input-agnostic
directions in StyleGAN's style space, enabling interactive text-driven image
manipulation. Extensive results and comparisons demonstrate the effectiveness
of our approaches.
Related papers
- CLIP-Guided StyleGAN Inversion for Text-Driven Real Image Editing [22.40686064568406]
We present CLIPInverter, a new text-driven image editing approach that is able to efficiently and reliably perform multi-attribute changes.
Our method outperforms competing approaches in terms of manipulation accuracy and photo-realism on various domains including human faces, cats, and birds.
arXiv Detail & Related papers (2023-07-17T11:29:48Z) - TD-GEM: Text-Driven Garment Editing Mapper [15.121103742607383]
We propose a Text-Driven Garment Editing Mapper (TD-GEM) to edit fashion items in a disentangled way.
An optimization-based Contrastive Language-Image Pre-training is then utilized to guide the latent representation of a fashion image.
Our TD-GEM manipulates the image accurately according to the target attribute expressed in terms of a text prompt.
arXiv Detail & Related papers (2023-05-29T14:31:54Z) - Towards Arbitrary Text-driven Image Manipulation via Space Alignment [49.3370305074319]
We propose a new Text-driven image Manipulation framework via Space Alignment (TMSA)
TMSA aims to align the same semantic regions in CLIP and StyleGAN spaces.
The framework can support arbitrary image editing mode without additional cost.
arXiv Detail & Related papers (2023-01-25T16:20:01Z) - One Model to Edit Them All: Free-Form Text-Driven Image Manipulation
with Semantic Modulations [75.81725681546071]
Free-Form CLIP aims to establish an automatic latent mapping so that one manipulation model handles free-form text prompts.
For one type of image (e.g., human portrait'), one FFCLIP model can be learned to handle free-form text prompts.
Both visual and numerical results show that FFCLIP effectively produces semantically accurate and visually realistic images.
arXiv Detail & Related papers (2022-10-14T15:06:05Z) - CLIP-PAE: Projection-Augmentation Embedding to Extract Relevant Features for a Disentangled, Interpretable, and Controllable Text-Guided Face Manipulation [4.078926358349661]
Contrastive Language-Image Pre-Training (CLIP) bridges images and text by embedding them into a joint latent space.
Due to the discrepancy between image and text embeddings in the joint space, using text embeddings as the optimization target often introduces undesired artifacts in the resulting images.
We introduce CLIP Projection-Augmentation Embedding (PAE) as an optimization target to improve the performance of text-guided image manipulation.
arXiv Detail & Related papers (2022-10-08T05:12:25Z) - LDEdit: Towards Generalized Text Guided Image Manipulation via Latent
Diffusion Models [12.06277444740134]
generic image manipulation using a single model with flexible text inputs is highly desirable.
Recent work addresses this task by guiding generative models trained on the generic image using pretrained vision-language encoders.
We propose an optimization-free method for the task of generic image manipulation from text prompts.
arXiv Detail & Related papers (2022-10-05T13:26:15Z) - FlexIT: Towards Flexible Semantic Image Translation [59.09398209706869]
We propose FlexIT, a novel method which can take any input image and a user-defined text instruction for editing.
First, FlexIT combines the input image and text into a single target point in the CLIP multimodal embedding space.
We iteratively transform the input image toward the target point, ensuring coherence and quality with a variety of novel regularization terms.
arXiv Detail & Related papers (2022-03-09T13:34:38Z) - Towards Open-World Text-Guided Face Image Generation and Manipulation [52.83401421019309]
We propose a unified framework for both face image generation and manipulation.
Our method supports open-world scenarios, including both image and text, without any re-training, fine-tuning, or post-processing.
arXiv Detail & Related papers (2021-04-18T16:56:07Z) - TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions.
StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN.
visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space.
instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.