Text-Free Learning of a Natural Language Interface for Pretrained Face
Generators
- URL: http://arxiv.org/abs/2209.03953v1
- Date: Thu, 8 Sep 2022 17:56:50 GMT
- Title: Text-Free Learning of a Natural Language Interface for Pretrained Face
Generators
- Authors: Xiaodan Du, Raymond A. Yeh, Nicholas Kolkin, Eli Shechtman, Greg
Shakhnarovich
- Abstract summary: We propose Fast text2StyleGAN, a natural language interface that adapts pre-trained GANs for text-guided human face synthesis.
Our model does not require re-training or fine-tuning of the GANs or CLIP when encountering new text prompts.
- Score: 39.60881623602501
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose Fast text2StyleGAN, a natural language interface that adapts
pre-trained GANs for text-guided human face synthesis. Leveraging the recent
advances in Contrastive Language-Image Pre-training (CLIP), no text data is
required during training. Fast text2StyleGAN is formulated as a conditional
variational autoencoder (CVAE) that provides extra control and diversity to the
generated images at test time. Our model does not require re-training or
fine-tuning of the GANs or CLIP when encountering new text prompts. In contrast
to prior work, we do not rely on optimization at test time, making our method
orders of magnitude faster than prior work. Empirically, on FFHQ dataset, our
method offers faster and more accurate generation of images from natural
language descriptions with varying levels of detail compared to prior work.
Related papers
- ASPIRE: Language-Guided Data Augmentation for Improving Robustness Against Spurious Correlations [43.323791505213634]
ASPIRE (Language-guided Data Augmentation for SPurIous correlation REmoval) is a solution for supplementing the training dataset with images without spurious features.
It can generate non-spurious images without requiring any group labeling or existing non-spurious images in the training set.
It improves the worst-group classification accuracy of prior methods by 1% - 38%.
arXiv Detail & Related papers (2023-08-19T20:18:15Z) - ALIP: Adaptive Language-Image Pre-training with Synthetic Caption [78.93535202851278]
Contrastive Language-Image Pre-training (CLIP) has significantly boosted the performance of various vision-language tasks.
The presence of intrinsic noise and unmatched image-text pairs in web data can potentially affect the performance of representation learning.
We propose an Adaptive Language-Image Pre-training (ALIP), a bi-path model that integrates supervision from both raw text and synthetic caption.
arXiv Detail & Related papers (2023-08-16T15:19:52Z) - Being Comes from Not-being: Open-vocabulary Text-to-Motion Generation
with Wordless Training [178.09150600453205]
In this paper, we investigate offline open-vocabulary text-to-motion generation in a zero-shot learning manner.
Inspired by the prompt learning in NLP, we pretrain a motion generator that learns to reconstruct the full motion from the masked motion.
Our method reformulates the input text into a masked motion as the prompt for the motion generator to reconstruct'' the motion.
arXiv Detail & Related papers (2022-10-28T06:20:55Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [91.56988987393483]
We present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.
Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models.
Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones.
arXiv Detail & Related papers (2021-12-02T18:59:32Z) - LAFITE: Towards Language-Free Training for Text-to-Image Generation [83.2935513540494]
We propose the first work to train text-to-image generation models without any text data.
Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model.
We obtain state-of-the-art results in the standard text-to-image generation tasks.
arXiv Detail & Related papers (2021-11-27T01:54:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.