Related papers: Towards Counterfactual Image Manipulation via CLIP

Towards Counterfactual Image Manipulation via CLIP

URL: http://arxiv.org/abs/2207.02812v2
Date: Thu, 7 Jul 2022 04:57:58 GMT
Title: Towards Counterfactual Image Manipulation via CLIP
Authors: Yingchen Yu, Fangneng Zhan, Rongliang Wu, Jiahui Zhang, Shijian Lu, Miaomiao Cui, Xuansong Xie, Xian-Sheng Hua, Chunyan Miao
Abstract summary: Existing methods can achieve realistic editing of different visual attributes such as age and gender of facial images. We investigate this problem in a text-driven manner with Contrastive-Language-Image-Pretraining (CLIP) We design a novel contrastive loss that exploits predefined CLIP-space directions to guide the editing toward desired directions from different perspectives.
Score: 106.94502632502194
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Leveraging StyleGAN's expressivity and its disentangled latent codes, existing methods can achieve realistic editing of different visual attributes such as age and gender of facial images. An intriguing yet challenging problem arises: Can generative models achieve counterfactual editing against their learnt priors? Due to the lack of counterfactual samples in natural datasets, we investigate this problem in a text-driven manner with Contrastive-Language-Image-Pretraining (CLIP), which can offer rich semantic knowledge even for various counterfactual concepts. Different from in-domain manipulation, counterfactual manipulation requires more comprehensive exploitation of semantic knowledge encapsulated in CLIP as well as more delicate handling of editing directions for avoiding being stuck in local minimum or undesired editing. To this end, we design a novel contrastive loss that exploits predefined CLIP-space directions to guide the editing toward desired directions from different perspectives. In addition, we design a simple yet effective scheme that explicitly maps CLIP embeddings (of target text) to the latent space and fuses them with latent codes for effective latent code optimization and accurate editing. Extensive experiments show that our design achieves accurate and realistic editing while driving by target texts with various counterfactual concepts.

Related papers

Image-Editing Specialists: An RLAIF Approach for Diffusion Models [28.807572302899004]
We present a novel approach to training specialized instruction-based image-editing diffusion models. We introduce an online reinforcement learning framework that aligns the diffusion model with human preferences. Experimental results demonstrate that our models can perform intricate edits in complex scenes, after just 10 training steps.
arXiv Detail & Related papers (2025-04-17T10:46:39Z)
Concept Lancet: Image Editing with Compositional Representation Transplant [58.9421919837084]
Concept Lancet is a zero-shot plug-and-play framework for principled representation manipulation in image editing. We decompose the source input in the latent (text embedding or diffusion score) space as a sparse linear combination of the representations of the collected visual concepts. We perform a customized concept transplant process to impose the corresponding editing direction.
arXiv Detail & Related papers (2025-04-03T17:59:58Z)
UIP2P: Unsupervised Instruction-based Image Editing via Cycle Edit Consistency [69.33072075580483]
We propose an unsupervised model for instruction-based image editing that eliminates the need for ground-truth edited images during training. Our method addresses these challenges by introducing a novel editing mechanism called Cycle Edit Consistency ( CEC) CEC applies forward and backward edits in one training step and enforces consistency in image and attention spaces.
arXiv Detail & Related papers (2024-12-19T18:59:58Z)
Optimisation-Based Multi-Modal Semantic Image Editing [58.496064583110694]
We propose an inference-time editing optimisation to accommodate multiple editing instruction types. By allowing to adjust the influence of each loss function, we build a flexible editing solution that can be adjusted to user preferences. We evaluate our method using text, pose and scribble edit conditions, and highlight our ability to achieve complex edits.
arXiv Detail & Related papers (2023-11-28T15:31:11Z)
CLIP-Guided StyleGAN Inversion for Text-Driven Real Image Editing [22.40686064568406]
We present CLIPInverter, a new text-driven image editing approach that is able to efficiently and reliably perform multi-attribute changes. Our method outperforms competing approaches in terms of manipulation accuracy and photo-realism on various domains including human faces, cats, and birds.
arXiv Detail & Related papers (2023-07-17T11:29:48Z)
Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image Models [59.094601993993535]
Text-to-image (T2I) personalization allows users to combine their own visual concepts in natural language prompts. Most existing encoders are limited to a single-class domain, which hinders their ability to handle diverse concepts. We propose a domain-agnostic method that does not require any specialized dataset or prior information about the personalized concepts.
arXiv Detail & Related papers (2023-07-13T17:46:42Z)
CLIP-PAE: Projection-Augmentation Embedding to Extract Relevant Features for a Disentangled, Interpretable, and Controllable Text-Guided Face Manipulation [4.078926358349661]
Contrastive Language-Image Pre-Training (CLIP) bridges images and text by embedding them into a joint latent space. Due to the discrepancy between image and text embeddings in the joint space, using text embeddings as the optimization target often introduces undesired artifacts in the resulting images. We introduce CLIP Projection-Augmentation Embedding (PAE) as an optimization target to improve the performance of text-guided image manipulation.
arXiv Detail & Related papers (2022-10-08T05:12:25Z)
Expanding the Latent Space of StyleGAN for Real Face Editing [4.1715767752637145]
A surge of face editing techniques have been proposed to employ the pretrained StyleGAN for semantic manipulation. To successfully edit a real image, one must first convert the input image into StyleGAN's latent variables. We present a method to expand the latent space of StyleGAN with additional content features to break down the trade-off between low-distortion and high-editability.
arXiv Detail & Related papers (2022-04-26T18:27:53Z)
CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions [65.00528970576401]
StyleGAN has enabled unprecedented semantic editing capabilities, on both synthesized and real images. We propose two novel building blocks; one for finding interesting CLIP directions and one for labeling arbitrary directions in CLIP latent space. We evaluate the effectiveness of the proposed method and demonstrate that extraction of disentangled labeled StyleGAN edit directions is indeed possible.
arXiv Detail & Related papers (2021-12-09T21:26:03Z)
HairCLIP: Design Your Hair by Text and Reference Image [100.85116679883724]
This paper proposes a new hair editing interaction mode, which enables manipulating hair attributes individually or jointly. We encode the image and text conditions in a shared embedding space and propose a unified hair editing framework. With the carefully designed network structures and loss functions, our framework can perform high-quality hair editing.
arXiv Detail & Related papers (2021-12-09T18:59:58Z)
CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields [33.43993665841577]
We present CLIP-NeRF, a multi-modal 3D object manipulation method for neural radiance fields (NeRF) We propose a unified framework that allows manipulating NeRF in a user-friendly way, using either a short text prompt or an exemplar image. We evaluate our approach by extensive experiments on a variety of text prompts and exemplar images.
arXiv Detail & Related papers (2021-12-09T18:59:55Z)
StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery [71.1862388442953]
We develop a text-based interface for StyleGAN image manipulation. We first introduce an optimization scheme that utilizes a CLIP-based loss to modify an input latent vector in response to a user-provided text prompt. Next, we describe a latent mapper that infers a text-guided latent manipulation step for a given input image, allowing faster and more stable text-based manipulation.
arXiv Detail & Related papers (2021-03-31T17:51:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.