Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and
Latent Diffusion
- URL: http://arxiv.org/abs/2310.03502v1
- Date: Thu, 5 Oct 2023 12:29:41 GMT
- Title: Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and
Latent Diffusion
- Authors: Anton Razzhigaev, Arseniy Shakhmatov, Anastasia Maltseva, Vladimir
Arkhipkin, Igor Pavlov, Ilya Ryabov, Angelina Kuts, Alexander Panchenko,
Andrey Kuznetsov and Denis Dimitrov
- Abstract summary: We present Kandinsky1, a novel exploration of latent diffusion architecture.
The proposed model is trained separately to map text embeddings to image embeddings of CLIP.
We also deployed a user-friendly demo system that supports diverse generative modes such as text-to-image generation, image fusion, text and image fusion, image variations generation, and text-guided inpainting/outpainting.
- Score: 50.59261592343479
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-to-image generation is a significant domain in modern computer vision
and has achieved substantial improvements through the evolution of generative
architectures. Among these, there are diffusion-based models that have
demonstrated essential quality enhancements. These models are generally split
into two categories: pixel-level and latent-level approaches. We present
Kandinsky1, a novel exploration of latent diffusion architecture, combining the
principles of the image prior models with latent diffusion techniques. The
image prior model is trained separately to map text embeddings to image
embeddings of CLIP. Another distinct feature of the proposed model is the
modified MoVQ implementation, which serves as the image autoencoder component.
Overall, the designed model contains 3.3B parameters. We also deployed a
user-friendly demo system that supports diverse generative modes such as
text-to-image generation, image fusion, text and image fusion, image variations
generation, and text-guided inpainting/outpainting. Additionally, we released
the source code and checkpoints for the Kandinsky models. Experimental
evaluations demonstrate a FID score of 8.03 on the COCO-30K dataset, marking
our model as the top open-source performer in terms of measurable image
generation quality.
Related papers
- Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models [54.052963634384945]
We introduce the Image Regeneration task to assess text-to-image models.
We use GPT4V to bridge the gap between the reference image and the text input for the T2I model.
We also present ImageRepainter framework to enhance the quality of generated images.
arXiv Detail & Related papers (2024-11-14T13:52:43Z) - Kandinsky 3: Text-to-Image Synthesis for Multifunctional Generative Framework [3.7953598825170753]
Kandinsky 3 is a novel T2I model based on latent diffusion, achieving a high level of quality and photorealism.
We extend the base T2I model for various applications and create a multifunctional generation system.
Human evaluations show that Kandinsky 3 demonstrates one of the highest quality scores among open source generation systems.
arXiv Detail & Related papers (2024-10-28T14:22:08Z) - Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis [62.06970466554273]
We present Meissonic, which non-autoregressive masked image modeling (MIM) text-to-image elevates to a level comparable with state-of-the-art diffusion models like SDXL.
We leverage high-quality training data, integrate micro-conditions informed by human preference scores, and employ feature compression layers to further enhance image fidelity and resolution.
Our model not only matches but often exceeds the performance of existing models like SDXL in generating high-quality, high-resolution images.
arXiv Detail & Related papers (2024-10-10T17:59:17Z) - IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation [70.8833857249951]
IterComp is a novel framework that aggregates composition-aware model preferences from multiple models.
We propose an iterative feedback learning method to enhance compositionality in a closed-loop manner.
IterComp opens new research avenues in reward feedback learning for diffusion models and compositional generation.
arXiv Detail & Related papers (2024-10-09T17:59:13Z) - MaxFusion: Plug&Play Multi-Modal Generation in Text-to-Image Diffusion Models [34.611309081801345]
Large diffusion-based Text-to-Image (T2I) models have shown impressive generative powers for text-to-image generation.
In this paper, we propose a novel strategy to scale a generative model across new tasks with minimal compute.
arXiv Detail & Related papers (2024-04-15T17:55:56Z) - DiffBlender: Scalable and Composable Multimodal Text-to-Image Diffusion
Models [10.744438740060458]
We aim to extend the capabilities of diffusion-based text-to-image (T2I) generation models by incorporating diverse modalities beyond textual description.
We thus design a multimodal T2I diffusion model, coined as DiffBlender, by separating the channels of conditions into three types.
The unique architecture of DiffBlender facilitates adding new input modalities, pioneering a scalable framework for conditional image generation.
arXiv Detail & Related papers (2023-05-24T14:31:20Z) - Parents and Children: Distinguishing Multimodal DeepFakes from Natural Images [60.34381768479834]
Recent advancements in diffusion models have enabled the generation of realistic deepfakes from textual prompts in natural language.
We pioneer a systematic study on deepfake detection generated by state-of-the-art diffusion models.
arXiv Detail & Related papers (2023-04-02T10:25:09Z) - LayoutDiffuse: Adapting Foundational Diffusion Models for
Layout-to-Image Generation [24.694298869398033]
Our method trains efficiently, generates images with both high perceptual quality and layout alignment.
Our method significantly outperforms other 10 generative models based on GANs, VQ-VAE, and diffusion models.
arXiv Detail & Related papers (2023-02-16T14:20:25Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.