Towards Counterfactual Image Manipulation via CLIP
- URL: http://arxiv.org/abs/2207.02812v2
- Date: Thu, 7 Jul 2022 04:57:58 GMT
- Title: Towards Counterfactual Image Manipulation via CLIP
- Authors: Yingchen Yu, Fangneng Zhan, Rongliang Wu, Jiahui Zhang, Shijian Lu,
Miaomiao Cui, Xuansong Xie, Xian-Sheng Hua, Chunyan Miao
- Abstract summary: Existing methods can achieve realistic editing of different visual attributes such as age and gender of facial images.
We investigate this problem in a text-driven manner with Contrastive-Language-Image-Pretraining (CLIP)
We design a novel contrastive loss that exploits predefined CLIP-space directions to guide the editing toward desired directions from different perspectives.
- Score: 106.94502632502194
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Leveraging StyleGAN's expressivity and its disentangled latent codes,
existing methods can achieve realistic editing of different visual attributes
such as age and gender of facial images. An intriguing yet challenging problem
arises: Can generative models achieve counterfactual editing against their
learnt priors? Due to the lack of counterfactual samples in natural datasets,
we investigate this problem in a text-driven manner with
Contrastive-Language-Image-Pretraining (CLIP), which can offer rich semantic
knowledge even for various counterfactual concepts. Different from in-domain
manipulation, counterfactual manipulation requires more comprehensive
exploitation of semantic knowledge encapsulated in CLIP as well as more
delicate handling of editing directions for avoiding being stuck in local
minimum or undesired editing. To this end, we design a novel contrastive loss
that exploits predefined CLIP-space directions to guide the editing toward
desired directions from different perspectives. In addition, we design a simple
yet effective scheme that explicitly maps CLIP embeddings (of target text) to
the latent space and fuses them with latent codes for effective latent code
optimization and accurate editing. Extensive experiments show that our design
achieves accurate and realistic editing while driving by target texts with
various counterfactual concepts.
Related papers
- Optimisation-Based Multi-Modal Semantic Image Editing [58.496064583110694]
We propose an inference-time editing optimisation to accommodate multiple editing instruction types.
By allowing to adjust the influence of each loss function, we build a flexible editing solution that can be adjusted to user preferences.
We evaluate our method using text, pose and scribble edit conditions, and highlight our ability to achieve complex edits.
arXiv Detail & Related papers (2023-11-28T15:31:11Z) - CLIP-Guided StyleGAN Inversion for Text-Driven Real Image Editing [22.40686064568406]
We present CLIPInverter, a new text-driven image editing approach that is able to efficiently and reliably perform multi-attribute changes.
Our method outperforms competing approaches in terms of manipulation accuracy and photo-realism on various domains including human faces, cats, and birds.
arXiv Detail & Related papers (2023-07-17T11:29:48Z) - Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image
Models [59.094601993993535]
Text-to-image (T2I) personalization allows users to combine their own visual concepts in natural language prompts.
Most existing encoders are limited to a single-class domain, which hinders their ability to handle diverse concepts.
We propose a domain-agnostic method that does not require any specialized dataset or prior information about the personalized concepts.
arXiv Detail & Related papers (2023-07-13T17:46:42Z) - CLIP-PAE: Projection-Augmentation Embedding to Extract Relevant Features for a Disentangled, Interpretable, and Controllable Text-Guided Face Manipulation [4.078926358349661]
Contrastive Language-Image Pre-Training (CLIP) bridges images and text by embedding them into a joint latent space.
Due to the discrepancy between image and text embeddings in the joint space, using text embeddings as the optimization target often introduces undesired artifacts in the resulting images.
We introduce CLIP Projection-Augmentation Embedding (PAE) as an optimization target to improve the performance of text-guided image manipulation.
arXiv Detail & Related papers (2022-10-08T05:12:25Z) - Expanding the Latent Space of StyleGAN for Real Face Editing [4.1715767752637145]
A surge of face editing techniques have been proposed to employ the pretrained StyleGAN for semantic manipulation.
To successfully edit a real image, one must first convert the input image into StyleGAN's latent variables.
We present a method to expand the latent space of StyleGAN with additional content features to break down the trade-off between low-distortion and high-editability.
arXiv Detail & Related papers (2022-04-26T18:27:53Z) - CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions [65.00528970576401]
StyleGAN has enabled unprecedented semantic editing capabilities, on both synthesized and real images.
We propose two novel building blocks; one for finding interesting CLIP directions and one for labeling arbitrary directions in CLIP latent space.
We evaluate the effectiveness of the proposed method and demonstrate that extraction of disentangled labeled StyleGAN edit directions is indeed possible.
arXiv Detail & Related papers (2021-12-09T21:26:03Z) - HairCLIP: Design Your Hair by Text and Reference Image [100.85116679883724]
This paper proposes a new hair editing interaction mode, which enables manipulating hair attributes individually or jointly.
We encode the image and text conditions in a shared embedding space and propose a unified hair editing framework.
With the carefully designed network structures and loss functions, our framework can perform high-quality hair editing.
arXiv Detail & Related papers (2021-12-09T18:59:58Z) - CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields [33.43993665841577]
We present CLIP-NeRF, a multi-modal 3D object manipulation method for neural radiance fields (NeRF)
We propose a unified framework that allows manipulating NeRF in a user-friendly way, using either a short text prompt or an exemplar image.
We evaluate our approach by extensive experiments on a variety of text prompts and exemplar images.
arXiv Detail & Related papers (2021-12-09T18:59:55Z) - StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery [71.1862388442953]
We develop a text-based interface for StyleGAN image manipulation.
We first introduce an optimization scheme that utilizes a CLIP-based loss to modify an input latent vector in response to a user-provided text prompt.
Next, we describe a latent mapper that infers a text-guided latent manipulation step for a given input image, allowing faster and more stable text-based manipulation.
arXiv Detail & Related papers (2021-03-31T17:51:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.