ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations
- URL: http://arxiv.org/abs/2312.04655v1
- Date: Thu, 7 Dec 2023 19:32:39 GMT
- Title: ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations
- Authors: Maitreya Patel and Changhoon Kim and Sheng Cheng and Chitta Baral and
Yezhou Yang
- Abstract summary: Text-to-image (T2I) diffusion models, notably the unCLIP models, achieve state-of-the-art (SOTA) performance on various compositional T2I benchmarks.
We introduce ECLIPSE, a novel contrastive learning method that is both parameter and data-efficient.
We demonstrate that ECLIPSE trained prior, with only 3.3% of the parameters and trained on a mere 2.8% of the data, surpasses the baseline T2I priors with an average of 71.6% preference score.
- Score: 67.25974711647481
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-to-image (T2I) diffusion models, notably the unCLIP models (e.g.,
DALL-E-2), achieve state-of-the-art (SOTA) performance on various compositional
T2I benchmarks, at the cost of significant computational resources. The unCLIP
stack comprises T2I prior and diffusion image decoder. The T2I prior model
alone adds a billion parameters compared to the Latent Diffusion Models, which
increases the computational and high-quality data requirements. We introduce
ECLIPSE, a novel contrastive learning method that is both parameter and
data-efficient. ECLIPSE leverages pre-trained vision-language models (e.g.,
CLIP) to distill the knowledge into the prior model. We demonstrate that the
ECLIPSE trained prior, with only 3.3% of the parameters and trained on a mere
2.8% of the data, surpasses the baseline T2I priors with an average of 71.6%
preference score under resource-limited setting. It also attains performance on
par with SOTA big models, achieving an average of 63.36% preference score in
terms of the ability to follow the text compositions. Extensive experiments on
two unCLIP diffusion image decoders, Karlo and Kandinsky, affirm that ECLIPSE
priors consistently deliver high performance while significantly reducing
resource dependency.
Related papers
- FastCLIP: A Suite of Optimization Techniques to Accelerate CLIP Training with Limited Resources [45.40926501138365]
We introduce FastCLIP, a general CLIP training framework built on advanced compositional optimization techniques.
Our framework is equipped with an efficient gradient reduction strategy to reduce communication overhead.
We benchmark the performance of FastCLIP and the state-of-the-art training baseline on different compute scales.
arXiv Detail & Related papers (2024-07-01T16:37:18Z) - $λ$-ECLIPSE: Multi-Concept Personalized Text-to-Image Diffusion Models by Leveraging CLIP Latent Space [61.091910046492345]
$lambda$-ECLIPSE works in the latent space of a pre-trained CLIP model without relying on the diffusion UNet models.
$lambda$-ECLIPSE performs multisubject driven P-T2I with just 34M parameters and is trained on a mere 74 GPU hours.
arXiv Detail & Related papers (2024-02-07T19:07:10Z) - A-SDM: Accelerating Stable Diffusion through Redundancy Removal and
Performance Optimization [54.113083217869516]
In this work, we first explore the computational redundancy part of the network.
We then prune the redundancy blocks of the model and maintain the network performance.
Thirdly, we propose a global-regional interactive (GRI) attention to speed up the computationally intensive attention part.
arXiv Detail & Related papers (2023-12-24T15:37:47Z) - Distilling Knowledge from Text-to-Image Generative Models Improves Visio-Linguistic Reasoning in CLIP [57.53087077735303]
We introduce SDS-CLIP, a lightweight and sample-efficient distillation method to enhance CLIP's compositional visio-linguistic reasoning.
Our approach fine-tunes CLIP using a distillation objective borrowed from large text-to-image generative models like Stable-Diffusion.
On the challenging Winoground benchmark, SDS-CLIP improves the visio-linguistic performance of various CLIP models by up to 7%, while on the ARO dataset, it boosts performance by up to 3%.
arXiv Detail & Related papers (2023-07-18T13:10:11Z) - Boosting Visual-Language Models by Exploiting Hard Samples [126.35125029639168]
HELIP is a cost-effective strategy tailored to enhance the performance of existing CLIP models.
Our method allows for effortless integration with existing models' training pipelines.
On comprehensive benchmarks, HELIP consistently boosts existing models to achieve leading performance.
arXiv Detail & Related papers (2023-05-09T07:00:17Z) - Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior
Refinement [24.108008515395458]
We propose APE, an Adaptive Prior rEfinement method for CLIP's pre-trained knowledge, which achieves superior accuracy with high computational efficiency.
For the average accuracy over 11 benchmarks, both APE and APE-T attain state-of-the-art and respectively outperform the second-best by +1.59% and +1.99% under 16 shots with x30 less learnable parameters.
arXiv Detail & Related papers (2023-04-03T17:58:54Z) - Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese [55.95225353842118]
We construct a large-scale dataset of image-text pairs in Chinese, where most data are retrieved from publicly available datasets.
We develop 5 Chinese CLIP models of multiple sizes, spanning from 77 to 958 million parameters.
Our experiments demonstrate that Chinese CLIP can achieve the state-of-the-art performance on MUGE, Flickr30K-CN, and COCO-CN.
arXiv Detail & Related papers (2022-11-02T17:47:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.