Related papers: GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation

GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation

URL: http://arxiv.org/abs/2303.10056v2
Date: Thu, 2 Nov 2023 15:19:12 GMT
Title: GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation
Authors: Can Qin, Ning Yu, Chen Xing, Shu Zhang, Zeyuan Chen, Stefano Ermon, Yun Fu, Caiming Xiong, Ran Xu
Abstract summary: Text-to-image (T2I) models based on diffusion processes have achieved remarkable success in controllable image generation using user-provided captions. The tight coupling between the current text encoder and image decoder in T2I models makes it challenging to replace or upgrade. We propose GlueGen, which applies a newly proposed GlueNet model to align features from single-modal or multi-modal encoders with the latent space of an existing T2I model.
Score: 143.81719619351335
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-to-image (T2I) models based on diffusion processes have achieved remarkable success in controllable image generation using user-provided captions. However, the tight coupling between the current text encoder and image decoder in T2I models makes it challenging to replace or upgrade. Such changes often require massive fine-tuning or even training from scratch with the prohibitive expense. To address this problem, we propose GlueGen, which applies a newly proposed GlueNet model to align features from single-modal or multi-modal encoders with the latent space of an existing T2I model. The approach introduces a new training objective that leverages parallel corpora to align the representation spaces of different encoders. Empirical results show that GlueNet can be trained efficiently and enables various capabilities beyond previous state-of-the-art models: 1) multilingual language models such as XLM-Roberta can be aligned with existing T2I models, allowing for the generation of high-quality images from captions beyond English; 2) GlueNet can align multi-modal encoders such as AudioCLIP with the Stable Diffusion model, enabling sound-to-image generation; 3) it can also upgrade the current text encoder of the latent diffusion model for challenging case generation. By the alignment of various feature representations, the GlueNet allows for flexible and efficient integration of new functionality into existing T2I models and sheds light on X-to-image (X2I) generation.

Related papers

X2I: Seamless Integration of Multimodal Understanding into Diffusion Transformer via Attention Distillation [7.61087111021017]
We propose the X2I framework, which endows Diffusion Transformer (DiT) models with the capability to comprehend various modalities. X2I shows a decrease in performance degradation of less than 1% while gaining various multimodal understanding abilities.
arXiv Detail & Related papers (2025-03-08T09:07:45Z)
Decoder-Only LLMs are Better Controllers for Diffusion Models [63.22040456010123]
We propose to enhance text-to-image diffusion models by borrowing the strength of semantic understanding from large language models. Our adapter module is superior to the stat-of-the-art models in terms of text-to-image generation quality and reliability.
arXiv Detail & Related papers (2025-02-06T12:17:35Z)
Kandinsky 3: Text-to-Image Synthesis for Multifunctional Generative Framework [3.7953598825170753]
Kandinsky 3 is a novel T2I model based on latent diffusion, achieving a high level of quality and photorealism. We extend the base T2I model for various applications and create a multifunctional generation system. Human evaluations show that Kandinsky 3 demonstrates one of the highest quality scores among open source generation systems.
arXiv Detail & Related papers (2024-10-28T14:22:08Z)
When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding [112.44822009714461]
Cross-Modality Video Coding (CMVC) is a pioneering approach to explore multimodality representation and video generative models in video coding. During decoding, previously encoded components and video generation models are leveraged to create multiple encoding-decoding modes. Experiments indicate that TT2V achieves effective semantic reconstruction, while IT2V exhibits competitive perceptual consistency.
arXiv Detail & Related papers (2024-08-15T11:36:18Z)
FBSDiff: Plug-and-Play Frequency Band Substitution of Diffusion Features for Highly Controllable Text-Driven Image Translation [19.65838242227773]
This paper contributes a novel, concise, and efficient approach that adapts pre-trained large-scale text-to-image (T2I) diffusion model to the image-to-image (I2I) paradigm in a plug-and-play manner. Our method allows flexible control over both guiding factor and guiding intensity of the reference image simply by tuning the type and bandwidth of the substituted frequency band.
arXiv Detail & Related papers (2024-08-02T04:13:38Z)
Mini-DALLE3: Interactive Text to Image by Prompting Large Language Models [71.49054220807983]
A prevalent limitation persists in the effective communication with T2I models, such as Stable Diffusion, using natural language descriptions. Inspired by the recently released DALLE3, we revisit the existing T2I systems endeavoring to align human intent and introduce a new task - interactive text to image (iT2I) We present a simple approach that augments LLMs for iT2I with prompting techniques and off-the-shelf T2I models.
arXiv Detail & Related papers (2023-10-11T16:53:40Z)
Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models. Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z)
BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing [73.74570290836152]
BLIP-Diffusion is a new subject-driven image generation model that supports multimodal control. Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation.
arXiv Detail & Related papers (2023-05-24T04:51:04Z)
LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models [62.75006608940132]
This work proposes to enhance prompt understanding capabilities in text-to-image diffusion models. Our method leverages a pretrained large language model for grounded generation in a novel two-stage process. Our method significantly outperforms the base diffusion model and several strong baselines in accurately generating images.
arXiv Detail & Related papers (2023-05-23T03:59:06Z)
Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation [31.882356164068753]
To reproduce the success of text-to-image (T2I) generation, recent works in text-to-video (T2V) generation employ massive dataset for dataset for T2V generation. We propose Tune-A-Video is capable of producing temporally-coherent videos over various applications.
arXiv Detail & Related papers (2022-12-22T09:43:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.