Towards Counterfactual Image Manipulation via CLIP
- URL: http://arxiv.org/abs/2207.02812v2
- Date: Thu, 7 Jul 2022 04:57:58 GMT
- Title: Towards Counterfactual Image Manipulation via CLIP
- Authors: Yingchen Yu, Fangneng Zhan, Rongliang Wu, Jiahui Zhang, Shijian Lu,
Miaomiao Cui, Xuansong Xie, Xian-Sheng Hua, Chunyan Miao
- Abstract summary: Existing methods can achieve realistic editing of different visual attributes such as age and gender of facial images.
We investigate this problem in a text-driven manner with Contrastive-Language-Image-Pretraining (CLIP)
We design a novel contrastive loss that exploits predefined CLIP-space directions to guide the editing toward desired directions from different perspectives.
- Score: 106.94502632502194
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Leveraging StyleGAN's expressivity and its disentangled latent codes,
existing methods can achieve realistic editing of different visual attributes
such as age and gender of facial images. An intriguing yet challenging problem
arises: Can generative models achieve counterfactual editing against their
learnt priors? Due to the lack of counterfactual samples in natural datasets,
we investigate this problem in a text-driven manner with
Contrastive-Language-Image-Pretraining (CLIP), which can offer rich semantic
knowledge even for various counterfactual concepts. Different from in-domain
manipulation, counterfactual manipulation requires more comprehensive
exploitation of semantic knowledge encapsulated in CLIP as well as more
delicate handling of editing directions for avoiding being stuck in local
minimum or undesired editing. To this end, we design a novel contrastive loss
that exploits predefined CLIP-space directions to guide the editing toward
desired directions from different perspectives. In addition, we design a simple
yet effective scheme that explicitly maps CLIP embeddings (of target text) to
the latent space and fuses them with latent codes for effective latent code
optimization and accurate editing. Extensive experiments show that our design
achieves accurate and realistic editing while driving by target texts with
various counterfactual concepts.
Related papers
- Lost in Edits? A $λ$-Compass for AIGC Provenance [119.95562081325552]
We propose a novel latent-space attribution method that robustly identifies and differentiates authentic outputs from manipulated ones.
LambdaTracer is effective across diverse iterative editing processes, whether automated through text-guided editing tools such as InstructPix2Pix or performed manually with editing software such as Adobe Photoshop.
arXiv Detail & Related papers (2025-02-05T06:24:25Z) - UIP2P: Unsupervised Instruction-based Image Editing via Cycle Edit Consistency [69.33072075580483]
We propose an unsupervised model for instruction-based image editing that eliminates the need for ground-truth edited images during training.
Our method addresses these challenges by introducing a novel editing mechanism called Cycle Edit Consistency ( CEC)
CEC applies forward and backward edits in one training step and enforces consistency in image and attention spaces.
arXiv Detail & Related papers (2024-12-19T18:59:58Z) - Optimisation-Based Multi-Modal Semantic Image Editing [58.496064583110694]
We propose an inference-time editing optimisation to accommodate multiple editing instruction types.
By allowing to adjust the influence of each loss function, we build a flexible editing solution that can be adjusted to user preferences.
We evaluate our method using text, pose and scribble edit conditions, and highlight our ability to achieve complex edits.
arXiv Detail & Related papers (2023-11-28T15:31:11Z) - CLIP-Guided StyleGAN Inversion for Text-Driven Real Image Editing [22.40686064568406]
We present CLIPInverter, a new text-driven image editing approach that is able to efficiently and reliably perform multi-attribute changes.
Our method outperforms competing approaches in terms of manipulation accuracy and photo-realism on various domains including human faces, cats, and birds.
arXiv Detail & Related papers (2023-07-17T11:29:48Z) - CLIP-PAE: Projection-Augmentation Embedding to Extract Relevant Features for a Disentangled, Interpretable, and Controllable Text-Guided Face Manipulation [4.078926358349661]
Contrastive Language-Image Pre-Training (CLIP) bridges images and text by embedding them into a joint latent space.
Due to the discrepancy between image and text embeddings in the joint space, using text embeddings as the optimization target often introduces undesired artifacts in the resulting images.
We introduce CLIP Projection-Augmentation Embedding (PAE) as an optimization target to improve the performance of text-guided image manipulation.
arXiv Detail & Related papers (2022-10-08T05:12:25Z) - Expanding the Latent Space of StyleGAN for Real Face Editing [4.1715767752637145]
A surge of face editing techniques have been proposed to employ the pretrained StyleGAN for semantic manipulation.
To successfully edit a real image, one must first convert the input image into StyleGAN's latent variables.
We present a method to expand the latent space of StyleGAN with additional content features to break down the trade-off between low-distortion and high-editability.
arXiv Detail & Related papers (2022-04-26T18:27:53Z) - CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions [65.00528970576401]
StyleGAN has enabled unprecedented semantic editing capabilities, on both synthesized and real images.
We propose two novel building blocks; one for finding interesting CLIP directions and one for labeling arbitrary directions in CLIP latent space.
We evaluate the effectiveness of the proposed method and demonstrate that extraction of disentangled labeled StyleGAN edit directions is indeed possible.
arXiv Detail & Related papers (2021-12-09T21:26:03Z) - HairCLIP: Design Your Hair by Text and Reference Image [100.85116679883724]
This paper proposes a new hair editing interaction mode, which enables manipulating hair attributes individually or jointly.
We encode the image and text conditions in a shared embedding space and propose a unified hair editing framework.
With the carefully designed network structures and loss functions, our framework can perform high-quality hair editing.
arXiv Detail & Related papers (2021-12-09T18:59:58Z) - CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields [33.43993665841577]
We present CLIP-NeRF, a multi-modal 3D object manipulation method for neural radiance fields (NeRF)
We propose a unified framework that allows manipulating NeRF in a user-friendly way, using either a short text prompt or an exemplar image.
We evaluate our approach by extensive experiments on a variety of text prompts and exemplar images.
arXiv Detail & Related papers (2021-12-09T18:59:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.