SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with
Large Language Models
- URL: http://arxiv.org/abs/2305.05189v4
- Date: Wed, 29 Nov 2023 08:18:14 GMT
- Title: SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with
Large Language Models
- Authors: Shanshan Zhong, Zhongzhan Huang, Wushao Wen, Jinghui Qin, Liang Lin
- Abstract summary: We propose a simple-yet-effective parameter-efficient fine-tuning approach called the Semantic Understanding and Reasoning adapter (SUR-adapter) for pre-trained diffusion models.
Our approach can make text-to-image diffusion models easier to use with better user experience.
- Score: 56.88192537044364
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion models, which have emerged to become popular text-to-image
generation models, can produce high-quality and content-rich images guided by
textual prompts. However, there are limitations to semantic understanding and
commonsense reasoning in existing models when the input prompts are concise
narrative, resulting in low-quality image generation. To improve the capacities
for narrative prompts, we propose a simple-yet-effective parameter-efficient
fine-tuning approach called the Semantic Understanding and Reasoning adapter
(SUR-adapter) for pre-trained diffusion models. To reach this goal, we first
collect and annotate a new dataset SURD which consists of more than 57,000
semantically corrected multi-modal samples. Each sample contains a simple
narrative prompt, a complex keyword-based prompt, and a high-quality image.
Then, we align the semantic representation of narrative prompts to the complex
prompts and transfer knowledge of large language models (LLMs) to our
SUR-adapter via knowledge distillation so that it can acquire the powerful
semantic understanding and reasoning capabilities to build a high-quality
textual semantic representation for text-to-image generation. We conduct
experiments by integrating multiple LLMs and popular pre-trained diffusion
models to show the effectiveness of our approach in enabling diffusion models
to understand and reason concise natural language without image quality
degradation. Our approach can make text-to-image diffusion models easier to use
with better user experience, which demonstrates our approach has the potential
for further advancing the development of user-friendly text-to-image generation
models by bridging the semantic gap between simple narrative prompts and
complex keyword-based prompts. The code is released at
https://github.com/Qrange-group/SUR-adapter.
Related papers
- Conditional Text-to-Image Generation with Reference Guidance [81.99538302576302]
This paper explores using additional conditions of an image that provides visual guidance of the particular subjects for diffusion models to generate.
We develop several small-scale expert plugins that efficiently endow a Stable Diffusion model with the capability to take different references.
Our expert plugins demonstrate superior results than the existing methods on all tasks, each containing only 28.55M trainable parameters.
arXiv Detail & Related papers (2024-11-22T21:38:51Z) - LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation [30.897935761304034]
We propose a novel framework called textbfLLM4GEN, which enhances the semantic understanding of text-to-image diffusion models.
A specially designed Cross-Adapter Module (CAM) integrates the original text features of text-to-image models with LLM features.
DensePrompts, which contains $7,000$ dense prompts, provides a comprehensive evaluation for the text-to-image generation task.
arXiv Detail & Related papers (2024-06-30T15:50:32Z) - ARTIST: Improving the Generation of Text-rich Images with Disentangled Diffusion Models [52.23899502520261]
We introduce a new framework named ARTIST to focus on the learning of text structures.
We finetune a visual diffusion model, enabling it to assimilate textual structure information from the pretrained textual model.
Empirical results on the MARIO-Eval benchmark underscore the effectiveness of the proposed method, showing an improvement of up to 15% in various metrics.
arXiv Detail & Related papers (2024-06-17T19:31:24Z) - ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment [20.868216061750402]
We introduce an Efficient Large Language Model Adapter, termed ELLA, which equips text-to-image diffusion models with powerful Large Language Models (LLM)
Our approach adapts semantic features at different stages of the denoising process, assisting diffusion models in interpreting lengthy and intricate prompts over sampling timesteps.
To assess text-to-image models in dense prompt following, we introduce a challenging benchmark consisting of 1K dense prompts.
arXiv Detail & Related papers (2024-03-08T08:08:10Z) - Seek for Incantations: Towards Accurate Text-to-Image Diffusion
Synthesis through Prompt Engineering [118.53208190209517]
We propose a framework to learn the proper textual descriptions for diffusion models through prompt learning.
Our method can effectively learn the prompts to improve the matches between the input text and the generated images.
arXiv Detail & Related papers (2024-01-12T03:46:29Z) - LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image
Diffusion Models with Large Language Models [62.75006608940132]
This work proposes to enhance prompt understanding capabilities in text-to-image diffusion models.
Our method leverages a pretrained large language model for grounded generation in a novel two-stage process.
Our method significantly outperforms the base diffusion model and several strong baselines in accurately generating images.
arXiv Detail & Related papers (2023-05-23T03:59:06Z) - Unleashing Text-to-Image Diffusion Models for Visual Perception [84.41514649568094]
VPD (Visual Perception with a pre-trained diffusion model) is a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks.
We show that VPD can be faster adapted to downstream visual perception tasks using the proposed VPD.
arXiv Detail & Related papers (2023-03-03T18:59:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.