BLIP-Diffusion: Pre-trained Subject Representation for Controllable
Text-to-Image Generation and Editing
- URL: http://arxiv.org/abs/2305.14720v2
- Date: Thu, 22 Jun 2023 02:36:06 GMT
- Title: BLIP-Diffusion: Pre-trained Subject Representation for Controllable
Text-to-Image Generation and Editing
- Authors: Dongxu Li, Junnan Li, Steven C.H. Hoi
- Abstract summary: BLIP-Diffusion is a new subject-driven image generation model that supports multimodal control.
Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation.
- Score: 73.74570290836152
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Subject-driven text-to-image generation models create novel renditions of an
input subject based on text prompts. Existing models suffer from lengthy
fine-tuning and difficulties preserving the subject fidelity. To overcome these
limitations, we introduce BLIP-Diffusion, a new subject-driven image generation
model that supports multimodal control which consumes inputs of subject images
and text prompts. Unlike other subject-driven generation models, BLIP-Diffusion
introduces a new multimodal encoder which is pre-trained to provide subject
representation. We first pre-train the multimodal encoder following BLIP-2 to
produce visual representation aligned with the text. Then we design a subject
representation learning task which enables a diffusion model to leverage such
visual representation and generates new subject renditions. Compared with
previous methods such as DreamBooth, our model enables zero-shot subject-driven
generation, and efficient fine-tuning for customized subject with up to 20x
speedup. We also demonstrate that BLIP-Diffusion can be flexibly combined with
existing techniques such as ControlNet and prompt-to-prompt to enable novel
subject-driven generation and editing applications. Code and models will be
released at
https://github.com/salesforce/LAVIS/tree/main/projects/blip-diffusion. Project
page at https://dxli94.github.io/BLIP-Diffusion-website/.
Related papers
- ACE: All-round Creator and Editor Following Instructions via Diffusion Transformer [40.32254040909614]
We propose ACE, an All-round Creator and Editor, for visual generation tasks.
We first introduce a unified condition format termed Long-context Condition Unit (LCU)
We then propose a novel Transformer-based diffusion model that uses LCU as input, aiming for joint training across various generation and editing tasks.
arXiv Detail & Related papers (2024-09-30T17:56:27Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z) - LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image
Diffusion Models with Large Language Models [62.75006608940132]
This work proposes to enhance prompt understanding capabilities in text-to-image diffusion models.
Our method leverages a pretrained large language model for grounded generation in a novel two-stage process.
Our method significantly outperforms the base diffusion model and several strong baselines in accurately generating images.
arXiv Detail & Related papers (2023-05-23T03:59:06Z) - In-Context Learning Unlocked for Diffusion Models [163.54453915874402]
We present Prompt Diffusion, a framework for enabling in-context learning in diffusion-based generative models.
We propose a vision-language prompt that can model a wide range of vision-language tasks and a diffusion model that takes it as input.
The resulting Prompt Diffusion model is the first diffusion-based vision-language foundation model capable of in-context learning.
arXiv Detail & Related papers (2023-05-01T23:03:37Z) - GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation [143.81719619351335]
Text-to-image (T2I) models based on diffusion processes have achieved remarkable success in controllable image generation using user-provided captions.
The tight coupling between the current text encoder and image decoder in T2I models makes it challenging to replace or upgrade.
We propose GlueGen, which applies a newly proposed GlueNet model to align features from single-modal or multi-modal encoders with the latent space of an existing T2I model.
arXiv Detail & Related papers (2023-03-17T15:37:07Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.