GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis
- URL: http://arxiv.org/abs/2301.12959v1
- Date: Mon, 30 Jan 2023 14:58:23 GMT
- Title: GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis
- Authors: Ming Tao, Bing-Kun Bao, Hao Tang, Changsheng Xu
- Abstract summary: We propose Generative Adrial CLIPs to enable high-quality, efficient, fast, and controllable text-to-image synthesis.
Our model achieves 120 times faster synthesis speed and inherits the smooth latent space from GAN.
- Score: 74.71986888051381
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Synthesizing high-fidelity complex images from text is challenging. Based on
large pretraining, the autoregressive and diffusion models can synthesize
photo-realistic images. Although these large models have shown notable
progress, there remain three flaws. 1) These models require tremendous training
data and parameters to achieve good performance. 2) The multi-step generation
design slows the image synthesis process heavily. 3) The synthesized visual
features are difficult to control and require delicately designed prompts. To
enable high-quality, efficient, fast, and controllable text-to-image synthesis,
we propose Generative Adversarial CLIPs, namely GALIP. GALIP leverages the
powerful pretrained CLIP model both in the discriminator and generator.
Specifically, we propose a CLIP-based discriminator. The complex scene
understanding ability of CLIP enables the discriminator to accurately assess
the image quality. Furthermore, we propose a CLIP-empowered generator that
induces the visual concepts from CLIP through bridge features and prompts. The
CLIP-integrated generator and discriminator boost training efficiency, and as a
result, our model only requires about 3% training data and 6% learnable
parameters, achieving comparable results to large pretrained autoregressive and
diffusion models. Moreover, our model achieves 120 times faster synthesis speed
and inherits the smooth latent space from GAN. The extensive experimental
results demonstrate the excellent performance of our GALIP. Code is available
at https://github.com/tobran/GALIP.
Related papers
- TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives [65.82577305915643]
Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations.
We show that generating hard'' negative captions via in-context learning and corresponding negative images with text-to-image generators offers a solution.
We demonstrate that our method, named TripletCLIP, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark.
arXiv Detail & Related papers (2024-11-04T19:24:59Z) - Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification [54.96876797812238]
We present a novel CrOss-moDal nEighbor Representation(CODER) based on the distance structure between images and their neighbor texts.
The key to construct a high-quality CODER lies in how to create a vast amount of high-quality and diverse texts to match with images.
Experiment results across various datasets and models confirm CODER's effectiveness.
arXiv Detail & Related papers (2024-04-27T02:04:36Z) - CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples [34.71588837946776]
We propose CounterCurate, a framework to improve visio-linguistic compositional reasoning.
In particular, we identify two critical under-explored problems: the neglect of the physically grounded reasoning.
We first spotlight the near-chance performance of multimodal models like CLIP and LLaVA in physically grounded compositional reasoning.
We then apply simple data augmentation using grounded image generation model GLIGEN to generate fine-tuning data, resulting in significant performance improvements.
arXiv Detail & Related papers (2024-02-20T18:59:55Z) - Distilling Knowledge from Text-to-Image Generative Models Improves Visio-Linguistic Reasoning in CLIP [57.53087077735303]
We introduce SDS-CLIP, a lightweight and sample-efficient distillation method to enhance CLIP's compositional visio-linguistic reasoning.
Our approach fine-tunes CLIP using a distillation objective borrowed from large text-to-image generative models like Stable-Diffusion.
On the challenging Winoground benchmark, SDS-CLIP improves the visio-linguistic performance of various CLIP models by up to 7%, while on the ARO dataset, it boosts performance by up to 3%.
arXiv Detail & Related papers (2023-07-18T13:10:11Z) - Image Captions are Natural Prompts for Text-to-Image Models [70.30915140413383]
We analyze the relationship between the training effect of synthetic data and the synthetic data distribution induced by prompts.
We propose a simple yet effective method that prompts text-to-image generative models to synthesize more informative and diverse training data.
Our method significantly improves the performance of models trained on synthetic training data.
arXiv Detail & Related papers (2023-07-17T14:38:11Z) - ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation [35.60888272729273]
Recently, CLIP has been applied to pixel-level zero-shot learning tasks via a two-stage scheme.
While effective, such a scheme requires two image encoders, one for proposal generation and one for CLIP, leading to a complicated pipeline and high computational cost.
We propose a simpler-and-efficient one-stage solution that directly extends CLIP's zero-shot prediction capability from image to pixel level.
arXiv Detail & Related papers (2022-12-07T12:05:00Z) - CALIP: Zero-Shot Enhancement of CLIP with Parameter-free Attention [31.84299688413136]
Contrastive Language-Image Pre-training has been shown to learn visual representations with great transferability.
Existing works propose additional learnable modules upon CLIP and fine-tune them by few-shot training sets.
We introduce a free-lunch enhancement method, CALIP, to boost CLIP's zero-shot performance via a parameter-free Attention module.
arXiv Detail & Related papers (2022-09-28T15:22:11Z) - Toward Fast, Flexible, and Robust Low-Light Image Enhancement [87.27326390675155]
We develop a new Self-Calibrated Illumination (SCI) learning framework for fast, flexible, and robust brightening images in real-world low-light scenarios.
Considering the computational burden of the cascaded pattern, we construct the self-calibrated module which realizes the convergence between results of each stage.
We make comprehensive explorations to SCI's inherent properties including operation-insensitive adaptability and model-irrelevant generality.
arXiv Detail & Related papers (2022-04-21T14:40:32Z) - FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN
Space Optimization [37.318948462348054]
We approach text-to-image generation by combining the power of the retrained CLIP representation with an off-the-shelf image generator (GANs)
When promoted by different input text, FuseDream can generate high-quality images with varying objects, backgrounds, artistic styles, even novel counterfactual concepts that do not appear in the training data we use.
arXiv Detail & Related papers (2021-12-02T19:27:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.