GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation
- URL: http://arxiv.org/abs/2303.10056v2
- Date: Thu, 2 Nov 2023 15:19:12 GMT
- Title: GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation
- Authors: Can Qin, Ning Yu, Chen Xing, Shu Zhang, Zeyuan Chen, Stefano Ermon,
Yun Fu, Caiming Xiong, Ran Xu
- Abstract summary: Text-to-image (T2I) models based on diffusion processes have achieved remarkable success in controllable image generation using user-provided captions.
The tight coupling between the current text encoder and image decoder in T2I models makes it challenging to replace or upgrade.
We propose GlueGen, which applies a newly proposed GlueNet model to align features from single-modal or multi-modal encoders with the latent space of an existing T2I model.
- Score: 143.81719619351335
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-to-image (T2I) models based on diffusion processes have achieved
remarkable success in controllable image generation using user-provided
captions. However, the tight coupling between the current text encoder and
image decoder in T2I models makes it challenging to replace or upgrade. Such
changes often require massive fine-tuning or even training from scratch with
the prohibitive expense. To address this problem, we propose GlueGen, which
applies a newly proposed GlueNet model to align features from single-modal or
multi-modal encoders with the latent space of an existing T2I model. The
approach introduces a new training objective that leverages parallel corpora to
align the representation spaces of different encoders. Empirical results show
that GlueNet can be trained efficiently and enables various capabilities beyond
previous state-of-the-art models: 1) multilingual language models such as
XLM-Roberta can be aligned with existing T2I models, allowing for the
generation of high-quality images from captions beyond English; 2) GlueNet can
align multi-modal encoders such as AudioCLIP with the Stable Diffusion model,
enabling sound-to-image generation; 3) it can also upgrade the current text
encoder of the latent diffusion model for challenging case generation. By the
alignment of various feature representations, the GlueNet allows for flexible
and efficient integration of new functionality into existing T2I models and
sheds light on X-to-image (X2I) generation.
Related papers
- StreamMultiDiffusion: Real-Time Interactive Generation with Region-Based Semantic Control [43.04874003852966]
StreamMultiDiffusion is the first real-time region-based text-to-image generation framework.
Our solution opens up a new paradigm for interactive image generation named semantic palette.
arXiv Detail & Related papers (2024-03-14T02:51:01Z) - Mini-DALLE3: Interactive Text to Image by Prompting Large Language
Models [71.49054220807983]
A prevalent limitation persists in the effective communication with T2I models, such as Stable Diffusion, using natural language descriptions.
Inspired by the recently released DALLE3, we revisit the existing T2I systems endeavoring to align human intent and introduce a new task - interactive text to image (iT2I)
We present a simple approach that augments LLMs for iT2I with prompting techniques and off-the-shelf T2I models.
arXiv Detail & Related papers (2023-10-11T16:53:40Z) - Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction
Tuning [115.50132185963139]
CM3Leon is a decoder-only multi-modal language model capable of generating and infilling both text and images.
It is the first multi-modal model trained with a recipe adapted from text-only language models.
CM3Leon achieves state-of-the-art performance in text-to-image generation with 5x less training compute than comparable methods.
arXiv Detail & Related papers (2023-09-05T21:27:27Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z) - BLIP-Diffusion: Pre-trained Subject Representation for Controllable
Text-to-Image Generation and Editing [73.74570290836152]
BLIP-Diffusion is a new subject-driven image generation model that supports multimodal control.
Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation.
arXiv Detail & Related papers (2023-05-24T04:51:04Z) - LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image
Diffusion Models with Large Language Models [62.75006608940132]
This work proposes to enhance prompt understanding capabilities in text-to-image diffusion models.
Our method leverages a pretrained large language model for grounded generation in a novel two-stage process.
Our method significantly outperforms the base diffusion model and several strong baselines in accurately generating images.
arXiv Detail & Related papers (2023-05-23T03:59:06Z) - Tune-A-Video: One-Shot Tuning of Image Diffusion Models for
Text-to-Video Generation [31.882356164068753]
To reproduce the success of text-to-image (T2I) generation, recent works in text-to-video (T2V) generation employ massive dataset for dataset for T2V generation.
We propose Tune-A-Video is capable of producing temporally-coherent videos over various applications.
arXiv Detail & Related papers (2022-12-22T09:43:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.