DiffBlender: Scalable and Composable Multimodal Text-to-Image Diffusion
Models
- URL: http://arxiv.org/abs/2305.15194v2
- Date: Thu, 21 Dec 2023 12:55:57 GMT
- Title: DiffBlender: Scalable and Composable Multimodal Text-to-Image Diffusion
Models
- Authors: Sungnyun Kim, Junsoo Lee, Kibeom Hong, Daesik Kim, Namhyuk Ahn
- Abstract summary: We aim to extend the capabilities of diffusion-based text-to-image (T2I) generation models by incorporating diverse modalities beyond textual description.
We thus design a multimodal T2I diffusion model, coined as DiffBlender, by separating the channels of conditions into three types.
The unique architecture of DiffBlender facilitates adding new input modalities, pioneering a scalable framework for conditional image generation.
- Score: 10.744438740060458
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this study, we aim to extend the capabilities of diffusion-based
text-to-image (T2I) generation models by incorporating diverse modalities
beyond textual description, such as sketch, box, color palette, and style
embedding, within a single model. We thus design a multimodal T2I diffusion
model, coined as DiffBlender, by separating the channels of conditions into
three types, i.e., image forms, spatial tokens, and non-spatial tokens. The
unique architecture of DiffBlender facilitates adding new input modalities,
pioneering a scalable framework for conditional image generation. Notably, we
achieve this without altering the parameters of the existing generative model,
Stable Diffusion, only with updating partial components. Our study establishes
new benchmarks in multimodal generation through quantitative and qualitative
comparisons with existing conditional generation methods. We demonstrate that
DiffBlender faithfully blends all the provided information and showcase its
various applications in the detailed image synthesis.
Related papers
- TweedieMix: Improving Multi-Concept Fusion for Diffusion-based Image/Video Generation [67.97044071594257]
TweedieMix is a novel method for composing customized diffusion models.
Our framework can be effortlessly extended to image-to-video diffusion models.
arXiv Detail & Related papers (2024-10-08T01:06:01Z) - Diffusion Models For Multi-Modal Generative Modeling [32.61765315067488]
We propose a principled way to define a diffusion model by constructing a unified multi-modal diffusion model in a common diffusion space.
We propose several multimodal generation settings to verify our framework, including image transition, masked-image training, joint image-label and joint image-representation generative modeling.
arXiv Detail & Related papers (2024-07-24T18:04:17Z) - MaxFusion: Plug&Play Multi-Modal Generation in Text-to-Image Diffusion Models [34.611309081801345]
Large diffusion-based Text-to-Image (T2I) models have shown impressive generative powers for text-to-image generation.
In this paper, we propose a novel strategy to scale a generative model across new tasks with minimal compute.
arXiv Detail & Related papers (2024-04-15T17:55:56Z) - Diffusion Cocktail: Mixing Domain-Specific Diffusion Models for Diversified Image Generations [7.604214200457584]
Diffusion Cocktail (Ditail) is a training-free method that transfers style and content information between multiple diffusion models.
Ditail offers fine-grained control of the generation process, which enables flexible manipulations of styles and contents.
arXiv Detail & Related papers (2023-12-12T00:53:56Z) - Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and
Latent Diffusion [50.59261592343479]
We present Kandinsky1, a novel exploration of latent diffusion architecture.
The proposed model is trained separately to map text embeddings to image embeddings of CLIP.
We also deployed a user-friendly demo system that supports diverse generative modes such as text-to-image generation, image fusion, text and image fusion, image variations generation, and text-guided inpainting/outpainting.
arXiv Detail & Related papers (2023-10-05T12:29:41Z) - BLIP-Diffusion: Pre-trained Subject Representation for Controllable
Text-to-Image Generation and Editing [73.74570290836152]
BLIP-Diffusion is a new subject-driven image generation model that supports multimodal control.
Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation.
arXiv Detail & Related papers (2023-05-24T04:51:04Z) - LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image
Diffusion Models with Large Language Models [62.75006608940132]
This work proposes to enhance prompt understanding capabilities in text-to-image diffusion models.
Our method leverages a pretrained large language model for grounded generation in a novel two-stage process.
Our method significantly outperforms the base diffusion model and several strong baselines in accurately generating images.
arXiv Detail & Related papers (2023-05-23T03:59:06Z) - SinDiffusion: Learning a Diffusion Model from a Single Natural Image [159.4285444680301]
We present SinDiffusion, leveraging denoising diffusion models to capture internal distribution of patches from a single natural image.
It is based on two core designs. First, SinDiffusion is trained with a single model at a single scale instead of multiple models with progressive growing of scales.
Second, we identify that a patch-level receptive field of the diffusion network is crucial and effective for capturing the image's patch statistics.
arXiv Detail & Related papers (2022-11-22T18:00:03Z) - Versatile Diffusion: Text, Images and Variations All in One Diffusion
Model [76.89932822375208]
Versatile Diffusion handles multiple flows of text-to-image, image-to-text, and variations in one unified model.
Our code and models are open-sourced at https://github.com/SHI-Labs/Versatile-Diffusion.
arXiv Detail & Related papers (2022-11-15T17:44:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.