Predict, Prevent, and Evaluate: Disentangled Text-Driven Image
Manipulation Empowered by Pre-Trained Vision-Language Model
- URL: http://arxiv.org/abs/2111.13333v1
- Date: Fri, 26 Nov 2021 06:49:26 GMT
- Title: Predict, Prevent, and Evaluate: Disentangled Text-Driven Image
Manipulation Empowered by Pre-Trained Vision-Language Model
- Authors: Zipeng Xu, Tianwei Lin, Hao Tang, Fu Li, Dongliang He, Nicu Sebe, Radu
Timofte, Luc Van Gool and Errui Ding
- Abstract summary: We propose a novel framework, i.e., Predict, Prevent, and Evaluate (PPE), for disentangled text-driven image manipulation.
Our method approaches the targets by exploiting the power of the large scale pre-trained vision-language model CLIP.
Extensive experiments show that the proposed PPE framework achieves much better quantitative and qualitative results than the up-to-date StyleCLIP baseline.
- Score: 168.04947140367258
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To achieve disentangled image manipulation, previous works depend heavily on
manual annotation. Meanwhile, the available manipulations are limited to a
pre-defined set the models were trained for. In this paper, we propose a novel
framework, i.e., Predict, Prevent, and Evaluate (PPE), for disentangled
text-driven image manipulation, which does not need manual annotation and thus
is not limited to fixed manipulations. Our method approaches the targets by
deeply exploiting the power of the large scale pre-trained vision-language
model CLIP. Concretely, we firstly Predict the possibly entangled attributes
for a given text command. Then, based on the predicted attributes, we introduce
an entanglement loss to Prevent entanglements during training. Finally, we
propose a new evaluation metric to Evaluate the disentangled image
manipulation. We verify the effectiveness of our method on the challenging face
editing task. Extensive experiments show that the proposed PPE framework
achieves much better quantitative and qualitative results than the up-to-date
StyleCLIP baseline.
Related papers
- Dynamic Prompt Optimizing for Text-to-Image Generation [63.775458908172176]
We introduce the textbfPrompt textbfAuto-textbfEditing (PAE) method to improve text-to-image generative models.
We employ an online reinforcement learning strategy to explore the weights and injection time steps of each word, leading to the dynamic fine-control prompts.
arXiv Detail & Related papers (2024-04-05T13:44:39Z) - CLIPtone: Unsupervised Learning for Text-based Image Tone Adjustment [23.36770607997754]
We propose an unsupervised learning-based approach for text-based image tone adjustment method, CLIPtone.
Our approach's efficacy is demonstrated through comprehensive experiments, including a user study.
arXiv Detail & Related papers (2024-04-01T13:57:46Z) - Exploring Annotation-free Image Captioning with Retrieval-augmented Pseudo Sentence Generation [21.54093527562344]
We propose a new strategy where the prior knowledge from large pre-trained models (LPMs) is distilled and leveraged as supervision.
Specifically, we introduce Retrieval-augmented Pseudo Sentence Generation (RaPSG), which can efficiently retrieve highly relevant short region descriptions.
Experimental results indicate that our method outperforms SOTA captioning models across various settings.
arXiv Detail & Related papers (2023-07-27T10:16:13Z) - Disentangled Pre-training for Image Matting [74.10407744483526]
Image matting requires high-quality pixel-level human annotations to support the training of a deep model.
We propose a self-supervised pre-training approach that can leverage infinite numbers of data to boost the matting performance.
arXiv Detail & Related papers (2023-04-03T08:16:02Z) - CLIP-PAE: Projection-Augmentation Embedding to Extract Relevant Features for a Disentangled, Interpretable, and Controllable Text-Guided Face Manipulation [4.078926358349661]
Contrastive Language-Image Pre-Training (CLIP) bridges images and text by embedding them into a joint latent space.
Due to the discrepancy between image and text embeddings in the joint space, using text embeddings as the optimization target often introduces undesired artifacts in the resulting images.
We introduce CLIP Projection-Augmentation Embedding (PAE) as an optimization target to improve the performance of text-guided image manipulation.
arXiv Detail & Related papers (2022-10-08T05:12:25Z) - Exploring CLIP for Assessing the Look and Feel of Images [87.97623543523858]
We introduce Contrastive Language-Image Pre-training (CLIP) models for assessing both the quality perception (look) and abstract perception (feel) of images in a zero-shot manner.
Our results show that CLIP captures meaningful priors that generalize well to different perceptual assessments.
arXiv Detail & Related papers (2022-07-25T17:58:16Z) - DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [91.56988987393483]
We present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.
Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models.
Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones.
arXiv Detail & Related papers (2021-12-02T18:59:32Z) - Generative Model-Based Loss to the Rescue: A Method to Overcome
Annotation Errors for Depth-Based Hand Pose Estimation [76.12736932610163]
We propose to use a model-based generative loss for training hand pose estimators on depth images based on a volumetric hand model.
This additional loss allows training of a hand pose estimator that accurately infers the entire set of 21 hand keypoints while only using supervision for 6 easy-to-annotate keypoints (fingertips and wrist).
arXiv Detail & Related papers (2020-07-06T21:24:25Z) - Appearance Shock Grammar for Fast Medial Axis Extraction from Real
Images [10.943417197085882]
We combine ideas from shock graph theory with more recent appearance-based methods for medial axis extraction from complex natural scenes.
Our experiments on the BMAX500 and SK-LARGE datasets demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2020-04-06T13:57:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.