StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale
Text-to-Image Synthesis
- URL: http://arxiv.org/abs/2301.09515v1
- Date: Mon, 23 Jan 2023 16:05:45 GMT
- Title: StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale
Text-to-Image Synthesis
- Authors: Axel Sauer, Tero Karras, Samuli Laine, Andreas Geiger, Timo Aila
- Abstract summary: StyleGAN-T addresses the specific requirements of large-scale text-to-image synthesis.
It significantly improves over previous GANs and outperforms distilled diffusion models in terms of sample quality and speed.
- Score: 54.39789900854696
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-image synthesis has recently seen significant progress thanks to
large pretrained language models, large-scale training data, and the
introduction of scalable model families such as diffusion and autoregressive
models. However, the best-performing models require iterative evaluation to
generate a single sample. In contrast, generative adversarial networks (GANs)
only need a single forward pass. They are thus much faster, but they currently
remain far behind the state-of-the-art in large-scale text-to-image synthesis.
This paper aims to identify the necessary steps to regain competitiveness. Our
proposed model, StyleGAN-T, addresses the specific requirements of large-scale
text-to-image synthesis, such as large capacity, stable training on diverse
datasets, strong text alignment, and controllable variation vs. text alignment
tradeoff. StyleGAN-T significantly improves over previous GANs and outperforms
distilled diffusion models - the previous state-of-the-art in fast
text-to-image synthesis - in terms of sample quality and speed.
Related papers
- SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models [64.40250409933752]
We build upon our previous publication by implementing a simple and efficient non-autoregressive (NAR) TTS framework, termed SimpleSpeech 2.
SimpleSpeech 2 effectively combines the strengths of both autoregressive (AR) and non-autoregressive (NAR) methods.
We show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models.
arXiv Detail & Related papers (2024-08-25T17:07:39Z) - Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings [16.28853186016663]
We create synthetic image-text pairs for efficient and effective Visual-Language Models (VLMs) training.
Our method employs a pretrained text-to-image model to synthesize image embeddings from captions generated by an LLM.
Our VLM, finetuned on synthetic data achieves comparable performance to models trained solely on human-annotated data.
arXiv Detail & Related papers (2024-03-12T15:36:42Z) - Scaling Rectified Flow Transformers for High-Resolution Image Synthesis [22.11487736315616]
Rectified flow is a recent generative model formulation that connects data and noise in a straight line.
We improve existing noise sampling techniques for training rectified flow models by biasing them towards perceptually relevant scales.
We present a novel transformer-based architecture for text-to-image generation that uses separate weights for the two modalities.
arXiv Detail & Related papers (2024-03-05T18:45:39Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z) - Lafite2: Few-shot Text-to-Image Generation [132.14211027057766]
We propose a novel method for pre-training text-to-image generation model on image-only datasets.
It considers a retrieval-then-optimization procedure to synthesize pseudo text features.
It can be beneficial to a wide range of settings, including the few-shot, semi-supervised and fully-supervised learning.
arXiv Detail & Related papers (2022-10-25T16:22:23Z) - Photorealistic Text-to-Image Diffusion Models with Deep Language
Understanding [53.170767750244366]
Imagen is a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding.
To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models.
arXiv Detail & Related papers (2022-05-23T17:42:53Z) - LAFITE: Towards Language-Free Training for Text-to-Image Generation [83.2935513540494]
We propose the first work to train text-to-image generation models without any text data.
Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model.
We obtain state-of-the-art results in the standard text-to-image generation tasks.
arXiv Detail & Related papers (2021-11-27T01:54:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.