Kandinsky 3: Text-to-Image Synthesis for Multifunctional Generative Framework
- URL: http://arxiv.org/abs/2410.21061v1
- Date: Mon, 28 Oct 2024 14:22:08 GMT
- Title: Kandinsky 3: Text-to-Image Synthesis for Multifunctional Generative Framework
- Authors: Vladimir Arkhipkin, Viacheslav Vasilev, Andrei Filatov, Igor Pavlov, Julia Agafonova, Nikolai Gerasimenko, Anna Averchenkova, Evelina Mironova, Anton Bukashkin, Konstantin Kulikov, Andrey Kuznetsov, Denis Dimitrov,
- Abstract summary: Kandinsky 3 is a novel T2I model based on latent diffusion, achieving a high level of quality and photorealism.
We extend the base T2I model for various applications and create a multifunctional generation system.
Human evaluations show that Kandinsky 3 demonstrates one of the highest quality scores among open source generation systems.
- Score: 3.7953598825170753
- License:
- Abstract: Text-to-image (T2I) diffusion models are popular for introducing image manipulation methods, such as editing, image fusion, inpainting, etc. At the same time, image-to-video (I2V) and text-to-video (T2V) models are also built on top of T2I models. We present Kandinsky 3, a novel T2I model based on latent diffusion, achieving a high level of quality and photorealism. The key feature of the new architecture is the simplicity and efficiency of its adaptation for many types of generation tasks. We extend the base T2I model for various applications and create a multifunctional generation system that includes text-guided inpainting/outpainting, image fusion, text-image fusion, image variations generation, I2V and T2V generation. We also present a distilled version of the T2I model, evaluating inference in 4 steps of the reverse process without reducing image quality and 3 times faster than the base model. We deployed a user-friendly demo system in which all the features can be tested in the public domain. Additionally, we released the source code and checkpoints for the Kandinsky 3 and extended models. Human evaluations show that Kandinsky 3 demonstrates one of the highest quality scores among open source generation systems.
Related papers
- Tencent Hunyuan3D-1.0: A Unified Framework for Text-to-3D and Image-to-3D Generation [22.699173137070883]
Hunyuan3D-1.0 is a two-stage approach that supports text- and image-conditioned generation.
In the first stage, we employ a multi-view diffusion model that efficiently generates multi-view RGB in approximately 4 seconds.
In the second stage, we introduce a feed-forward reconstruction model that rapidly and faithfully reconstructs the 3D asset.
Our framework involves the text-to-image model, i.e., Hunyuan-DiT, making it a unified framework to support both text- and image-conditioned 3D generation.
arXiv Detail & Related papers (2024-11-04T17:21:42Z) - Direct Consistency Optimization for Compositional Text-to-Image
Personalization [73.94505688626651]
Text-to-image (T2I) diffusion models, when fine-tuned on a few personal images, are able to generate visuals with a high degree of consistency.
We propose to fine-tune the T2I model by maximizing consistency to reference images, while penalizing the deviation from the pretrained model.
arXiv Detail & Related papers (2024-02-19T09:52:41Z) - DreamDistribution: Prompt Distribution Learning for Text-to-Image
Diffusion Models [53.17454737232668]
We introduce a solution that allows a pretrained T2I diffusion model to learn a set of soft prompts.
These prompts offer text-guided editing capabilities and additional flexibility in controlling variation and mixing between multiple distributions.
We also show the adaptability of the learned prompt distribution to other tasks, such as text-to-3D.
arXiv Detail & Related papers (2023-12-21T12:11:00Z) - Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and
Latent Diffusion [50.59261592343479]
We present Kandinsky1, a novel exploration of latent diffusion architecture.
The proposed model is trained separately to map text embeddings to image embeddings of CLIP.
We also deployed a user-friendly demo system that supports diverse generative modes such as text-to-image generation, image fusion, text and image fusion, image variations generation, and text-guided inpainting/outpainting.
arXiv Detail & Related papers (2023-10-05T12:29:41Z) - BLIP-Diffusion: Pre-trained Subject Representation for Controllable
Text-to-Image Generation and Editing [73.74570290836152]
BLIP-Diffusion is a new subject-driven image generation model that supports multimodal control.
Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation.
arXiv Detail & Related papers (2023-05-24T04:51:04Z) - GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation [143.81719619351335]
Text-to-image (T2I) models based on diffusion processes have achieved remarkable success in controllable image generation using user-provided captions.
The tight coupling between the current text encoder and image decoder in T2I models makes it challenging to replace or upgrade.
We propose GlueGen, which applies a newly proposed GlueNet model to align features from single-modal or multi-modal encoders with the latent space of an existing T2I model.
arXiv Detail & Related papers (2023-03-17T15:37:07Z) - Tune-A-Video: One-Shot Tuning of Image Diffusion Models for
Text-to-Video Generation [31.882356164068753]
To reproduce the success of text-to-image (T2I) generation, recent works in text-to-video (T2V) generation employ massive dataset for dataset for T2V generation.
We propose Tune-A-Video is capable of producing temporally-coherent videos over various applications.
arXiv Detail & Related papers (2022-12-22T09:43:36Z) - Make-A-Video: Text-to-Video Generation without Text-Video Data [69.20996352229422]
Make-A-Video is an approach for translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V)
We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules.
In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation.
arXiv Detail & Related papers (2022-09-29T13:59:46Z) - Implementing and Experimenting with Diffusion Models for Text-to-Image
Generation [0.0]
Two models, DALL-E 2 and Imagen, have demonstrated that highly photorealistic images could be generated from a simple textual description of an image.
Text-to-image models require exceptionally large amounts of computational resources to train, as well as handling huge datasets collected from the internet.
This thesis contributes by reviewing the different approaches and techniques used by these models, and then by proposing our own implementation of a text-to-image model.
arXiv Detail & Related papers (2022-09-22T12:03:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.