Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and
Latent Diffusion
- URL: http://arxiv.org/abs/2310.03502v1
- Date: Thu, 5 Oct 2023 12:29:41 GMT
- Title: Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and
Latent Diffusion
- Authors: Anton Razzhigaev, Arseniy Shakhmatov, Anastasia Maltseva, Vladimir
Arkhipkin, Igor Pavlov, Ilya Ryabov, Angelina Kuts, Alexander Panchenko,
Andrey Kuznetsov and Denis Dimitrov
- Abstract summary: We present Kandinsky1, a novel exploration of latent diffusion architecture.
The proposed model is trained separately to map text embeddings to image embeddings of CLIP.
We also deployed a user-friendly demo system that supports diverse generative modes such as text-to-image generation, image fusion, text and image fusion, image variations generation, and text-guided inpainting/outpainting.
- Score: 50.59261592343479
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-to-image generation is a significant domain in modern computer vision
and has achieved substantial improvements through the evolution of generative
architectures. Among these, there are diffusion-based models that have
demonstrated essential quality enhancements. These models are generally split
into two categories: pixel-level and latent-level approaches. We present
Kandinsky1, a novel exploration of latent diffusion architecture, combining the
principles of the image prior models with latent diffusion techniques. The
image prior model is trained separately to map text embeddings to image
embeddings of CLIP. Another distinct feature of the proposed model is the
modified MoVQ implementation, which serves as the image autoencoder component.
Overall, the designed model contains 3.3B parameters. We also deployed a
user-friendly demo system that supports diverse generative modes such as
text-to-image generation, image fusion, text and image fusion, image variations
generation, and text-guided inpainting/outpainting. Additionally, we released
the source code and checkpoints for the Kandinsky models. Experimental
evaluations demonstrate a FID score of 8.03 on the COCO-30K dataset, marking
our model as the top open-source performer in terms of measurable image
generation quality.
Related papers
- Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation [52.509092010267665]
We introduce LlamaGen, a new family of image generation models that apply original next-token prediction'' paradigm of large language models to visual generation domain.
It is an affirmative answer to whether vanilla autoregressive models, e.g., Llama, without inductive biases on visual signals can achieve state-of-the-art image generation performance if scaling properly.
arXiv Detail & Related papers (2024-06-10T17:59:52Z) - MaxFusion: Plug&Play Multi-Modal Generation in Text-to-Image Diffusion Models [34.611309081801345]
Large diffusion-based Text-to-Image (T2I) models have shown impressive generative powers for text-to-image generation.
In this paper, we propose a novel strategy to scale a generative model across new tasks with minimal compute.
arXiv Detail & Related papers (2024-04-15T17:55:56Z) - Refining Text-to-Image Generation: Towards Accurate Training-Free Glyph-Enhanced Image Generation [5.55027585813848]
The capability to generate visual text is crucial, offering both academic interest and a wide range of practical applications.
We introduce a benchmark, LenCom-Eval, specifically designed for testing models' capability in generating images with Lengthy and Complex visual text.
We demonstrate notable improvements across a range of evaluation metrics, including CLIPScore, OCR precision, recall, F1 score, accuracy, and edit distance scores.
arXiv Detail & Related papers (2024-03-25T04:54:49Z) - RenAIssance: A Survey into AI Text-to-Image Generation in the Era of
Large Model [93.8067369210696]
Text-to-image generation (TTI) refers to the usage of models that could process text input and generate high fidelity images based on text descriptions.
Diffusion models are one prominent type of generative model used for the generation of images through the systematic introduction of noises with repeating steps.
In the era of large models, scaling up model size and the integration with large language models have further improved the performance of TTI models.
arXiv Detail & Related papers (2023-09-02T03:27:20Z) - Improving Few-shot Image Generation by Structural Discrimination and
Textural Modulation [10.389698647141296]
Few-shot image generation aims to produce plausible and diverse images for one category given a few images from this category.
Existing approaches either globally interpolate different images or fuse local representations with pre-defined coefficients.
This paper proposes a novel mechanism to inject external semantic signals into internal local representations.
arXiv Detail & Related papers (2023-08-30T16:10:21Z) - DiffBlender: Scalable and Composable Multimodal Text-to-Image Diffusion
Models [10.744438740060458]
We aim to extend the capabilities of diffusion-based text-to-image (T2I) generation models by incorporating diverse modalities beyond textual description.
We thus design a multimodal T2I diffusion model, coined as DiffBlender, by separating the channels of conditions into three types.
The unique architecture of DiffBlender facilitates adding new input modalities, pioneering a scalable framework for conditional image generation.
arXiv Detail & Related papers (2023-05-24T14:31:20Z) - Parents and Children: Distinguishing Multimodal DeepFakes from Natural Images [60.34381768479834]
Recent advancements in diffusion models have enabled the generation of realistic deepfakes from textual prompts in natural language.
We pioneer a systematic study on deepfake detection generated by state-of-the-art diffusion models.
arXiv Detail & Related papers (2023-04-02T10:25:09Z) - LayoutDiffuse: Adapting Foundational Diffusion Models for
Layout-to-Image Generation [24.694298869398033]
Our method trains efficiently, generates images with both high perceptual quality and layout alignment.
Our method significantly outperforms other 10 generative models based on GANs, VQ-VAE, and diffusion models.
arXiv Detail & Related papers (2023-02-16T14:20:25Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z) - Generating Diverse Structure for Image Inpainting With Hierarchical
VQ-VAE [74.29384873537587]
We propose a two-stage model for diverse inpainting, where the first stage generates multiple coarse results each of which has a different structure, and the second stage refines each coarse result separately by augmenting texture.
Experimental results on CelebA-HQ, Places2, and ImageNet datasets show that our method not only enhances the diversity of the inpainting solutions but also improves the visual quality of the generated multiple images.
arXiv Detail & Related papers (2021-03-18T05:10:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.