CLIP2GAN: Towards Bridging Text with the Latent Space of GANs
- URL: http://arxiv.org/abs/2211.15045v1
- Date: Mon, 28 Nov 2022 04:07:17 GMT
- Title: CLIP2GAN: Towards Bridging Text with the Latent Space of GANs
- Authors: Yixuan Wang, Wengang Zhou, Jianmin Bao, Weilun Wang, Li Li, Houqiang
Li
- Abstract summary: We propose a novel framework, i.e., CLIP2GAN, by leveraging CLIP model and StyleGAN.
The key idea of our CLIP2GAN is to bridge the output feature embedding space of CLIP and the input latent space of StyleGAN.
- Score: 128.47600914674985
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we are dedicated to text-guided image generation and propose a
novel framework, i.e., CLIP2GAN, by leveraging CLIP model and StyleGAN. The key
idea of our CLIP2GAN is to bridge the output feature embedding space of CLIP
and the input latent space of StyleGAN, which is realized by introducing a
mapping network. In the training stage, we encode an image with CLIP and map
the output feature to a latent code, which is further used to reconstruct the
image. In this way, the mapping network is optimized in a self-supervised
learning way. In the inference stage, since CLIP can embed both image and text
into a shared feature embedding space, we replace CLIP image encoder in the
training architecture with CLIP text encoder, while keeping the following
mapping network as well as StyleGAN model. As a result, we can flexibly input a
text description to generate an image. Moreover, by simply adding mapped text
features of an attribute to a mapped CLIP image feature, we can effectively
edit the attribute to the image. Extensive experiments demonstrate the superior
performance of our proposed CLIP2GAN compared to previous methods.
Related papers
- Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets.
We introduce a novel method named Decoder Pre-training with only text for STR (DPTR)
DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z) - Selective Vision-Language Subspace Projection for Few-shot CLIP [55.361337202198925]
We introduce a method called Selective Vision-Language Subspace Projection (SSP)
SSP incorporates local image features and utilizes them as a bridge to enhance the alignment between image-text pairs.
Our approach entails only training-free matrix calculations and can be seamlessly integrated into advanced CLIP-based few-shot learning frameworks.
arXiv Detail & Related papers (2024-07-24T03:45:35Z) - Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification [54.96876797812238]
We present a novel CrOss-moDal nEighbor Representation(CODER) based on the distance structure between images and their neighbor texts.
The key to construct a high-quality CODER lies in how to create a vast amount of high-quality and diverse texts to match with images.
Experiment results across various datasets and models confirm CODER's effectiveness.
arXiv Detail & Related papers (2024-04-27T02:04:36Z) - Symmetrical Linguistic Feature Distillation with CLIP for Scene Text
Recognition [77.93678598476149]
We establish a novel Symmetrical Linguistic Feature Distillation framework (named CLIP-OCR)
By cascading the CLIP image encoder with the reversed CLIP text encoder, a symmetrical structure is built with an image-to-text feature flow.
Extensive experiments demonstrate the effectiveness of CLIP-OCR with 93.8% average accuracy on six popular STR benchmarks.
arXiv Detail & Related papers (2023-10-08T04:00:20Z) - CgT-GAN: CLIP-guided Text GAN for Image Captioning [48.276753091051035]
We propose CLIP-guided text GAN (CgT-GAN) to enable the model to "see" real visual modality.
We use adversarial training to teach CgT-GAN to mimic the phrases of an external text corpus.
CgT-GAN outperforms state-of-the-art methods significantly across all metrics.
arXiv Detail & Related papers (2023-08-23T10:25:37Z) - Robust Text-driven Image Editing Method that Adaptively Explores
Directions in Latent Spaces of StyleGAN and CLIP [10.187432367590201]
A pioneering work in text-driven image editing, StyleCLIP, finds an edit direction in the CLIP space and then edits the image by mapping the direction to the StyleGAN space.
At the same time, it is difficult to tune appropriate inputs other than the original image and text instructions for image editing.
We propose a method to construct the edit direction adaptively in the StyleGAN and CLIP spaces with SVM.
arXiv Detail & Related papers (2023-04-03T13:30:48Z) - Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive
Learning [82.70453633641466]
We introduce Patch Aligned Contrastive Learning (PACL), a modified compatibility function for CLIP's contrastive loss.
We show that PACL is also applicable to image-level predictions and when used with a CLIP backbone, provides a general improvement in zero-shot classification accuracy.
arXiv Detail & Related papers (2022-12-09T17:23:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.