MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation
- URL: http://arxiv.org/abs/2302.08113v1
- Date: Thu, 16 Feb 2023 06:28:29 GMT
- Title: MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation
- Authors: Omer Bar-Tal, Lior Yariv, Yaron Lipman, Tali Dekel
- Abstract summary: MultiDiffusion is a unified framework that enables versatile and controllable image generation.
We show that MultiDiffusion can be readily applied to generate high quality and diverse images.
- Score: 34.61940502872307
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in text-to-image generation with diffusion models present
transformative capabilities in image quality. However, user controllability of
the generated image, and fast adaptation to new tasks still remains an open
challenge, currently mostly addressed by costly and long re-training and
fine-tuning or ad-hoc adaptations to specific image generation tasks. In this
work, we present MultiDiffusion, a unified framework that enables versatile and
controllable image generation, using a pre-trained text-to-image diffusion
model, without any further training or finetuning. At the center of our
approach is a new generation process, based on an optimization task that binds
together multiple diffusion generation processes with a shared set of
parameters or constraints. We show that MultiDiffusion can be readily applied
to generate high quality and diverse images that adhere to user-provided
controls, such as desired aspect ratio (e.g., panorama), and spatial guiding
signals, ranging from tight segmentation masks to bounding boxes. Project
webpage: https://multidiffusion.github.io
Related papers
- A Simple Approach to Unifying Diffusion-based Conditional Generation [63.389616350290595]
We introduce a simple, unified framework to handle diverse conditional generation tasks.
Our approach enables versatile capabilities via different inference-time sampling schemes.
Our model supports additional capabilities like non-spatially aligned and coarse conditioning.
arXiv Detail & Related papers (2024-10-15T09:41:43Z) - MM2Latent: Text-to-facial image generation and editing in GANs with multimodal assistance [32.70801495328193]
We propose a practical framework - MM2Latent - for multimodal image generation and editing.
We use StyleGAN2 as our image generator, FaRL for text encoding, and train an autoencoders for spatial modalities like mask, sketch and 3DMM.
Our method exhibits superior performance in multimodal image generation, surpassing recent GAN- and diffusion-based methods.
arXiv Detail & Related papers (2024-09-17T09:21:07Z) - Many-to-many Image Generation with Auto-regressive Diffusion Models [59.5041405824704]
This paper introduces a domain-general framework for many-to-many image generation, capable of producing interrelated image series from a given set of images.
We present MIS, a novel large-scale multi-image dataset, containing 12M synthetic multi-image samples, each with 25 interconnected images.
We learn M2M, an autoregressive model for many-to-many generation, where each image is modeled within a diffusion framework.
arXiv Detail & Related papers (2024-04-03T23:20:40Z) - MM-Diff: High-Fidelity Image Personalization via Multi-Modal Condition Integration [7.087475633143941]
MM-Diff is a tuning-free image personalization framework capable of generating high-fidelity images of both single and multiple subjects in seconds.
MM-Diff employs a vision encoder to transform the input image into CLS and patch embeddings.
CLS embeddings are used on the one hand to augment the text embeddings, and on the other hand together with patch embeddings to derive a small number of detail-rich subject embeddings.
arXiv Detail & Related papers (2024-03-22T09:32:31Z) - StreamMultiDiffusion: Real-Time Interactive Generation with Region-Based Semantic Control [43.04874003852966]
StreamMultiDiffusion is the first real-time region-based text-to-image generation framework.
Our solution opens up a new paradigm for interactive image generation named semantic palette.
arXiv Detail & Related papers (2024-03-14T02:51:01Z) - Real-World Image Variation by Aligning Diffusion Inversion Chain [53.772004619296794]
A domain gap exists between generated images and real-world images, which poses a challenge in generating high-quality variations of real-world images.
We propose a novel inference pipeline called Real-world Image Variation by ALignment (RIVAL)
Our pipeline enhances the generation quality of image variations by aligning the image generation process to the source image's inversion chain.
arXiv Detail & Related papers (2023-05-30T04:09:47Z) - BLIP-Diffusion: Pre-trained Subject Representation for Controllable
Text-to-Image Generation and Editing [73.74570290836152]
BLIP-Diffusion is a new subject-driven image generation model that supports multimodal control.
Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation.
arXiv Detail & Related papers (2023-05-24T04:51:04Z) - Unified Multi-Modal Latent Diffusion for Joint Subject and Text
Conditional Image Generation [63.061871048769596]
We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences.
To be more specific, both input texts and images are encoded into one unified multi-modal latent space.
Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
arXiv Detail & Related papers (2023-03-16T13:50:20Z) - Versatile Diffusion: Text, Images and Variations All in One Diffusion
Model [76.89932822375208]
Versatile Diffusion handles multiple flows of text-to-image, image-to-text, and variations in one unified model.
Our code and models are open-sourced at https://github.com/SHI-Labs/Versatile-Diffusion.
arXiv Detail & Related papers (2022-11-15T17:44:05Z) - DiffusionCLIP: Text-guided Image Manipulation Using Diffusion Models [33.79188588182528]
We present a novel DiffusionCLIP which performs text-driven image manipulation with diffusion models using Contrastive Language-Image Pre-training (CLIP) loss.
Our method has a performance comparable to that of the modern GAN-based image processing methods for in and out-of-domain image processing tasks.
Our method can be easily used for various novel applications, enabling image translation from an unseen domain to another unseen domain or stroke-conditioned image generation in an unseen domain.
arXiv Detail & Related papers (2021-10-06T12:59:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.