Related papers: A Comprehensive Study of Decoder-Only LLMs for Text-to-Image Generation

A Comprehensive Study of Decoder-Only LLMs for Text-to-Image Generation

URL: http://arxiv.org/abs/2506.08210v1
Date: Mon, 09 Jun 2025 20:29:53 GMT
Title: A Comprehensive Study of Decoder-Only LLMs for Text-to-Image Generation
Authors: Andrew Z. Wang, Songwei Ge, Tero Karras, Ming-Yu Liu, Yogesh Balaji,
Abstract summary: Many text-to-image models still employ the somewhat outdated T5 and CLIP as their text encoders.<n>We build a standardized training and evaluation pipeline that allows us to isolate and evaluate the effect of different text embeddings.<n>Our experiments reveal that the de facto way of using last-layer embeddings as conditioning leads to inferior performance.
Score: 30.041283605038316
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Both text-to-image generation and large language models (LLMs) have made significant advancements. However, many text-to-image models still employ the somewhat outdated T5 and CLIP as their text encoders. In this work, we investigate the effectiveness of using modern decoder-only LLMs as text encoders for text-to-image diffusion models. We build a standardized training and evaluation pipeline that allows us to isolate and evaluate the effect of different text embeddings. We train a total of 27 text-to-image models with 12 different text encoders to analyze the critical aspects of LLMs that could impact text-to-image generation, including the approaches to extract embeddings, different LLMs variants, and model sizes. Our experiments reveal that the de facto way of using last-layer embeddings as conditioning leads to inferior performance. Instead, we explore embeddings from various layers and find that using layer-normalized averaging across all layers significantly improves alignment with complex prompts. Most LLMs with this conditioning outperform the baseline T5 model, showing enhanced performance in advanced visio-linguistic reasoning skills.

Related papers

Think-Then-Generate: Reasoning-Aware Text-to-Image Diffusion with LLM Encoders [46.79030733172859]
We propose a think-then-rewrite (T2G) paradigm for text-to-image (T2I) diffusion models.<n>We show substantial improvements in factual consistency, semantic alignment, and visual realism across reasoning-based image generation and editing benchmarks.<n>Our results constitute a promising step toward next-generation unified models with reasoning, expression, and demonstration capacities.
arXiv Detail & Related papers (2026-01-15T12:19:05Z)
Multimodal LLMs as Customized Reward Models for Text-to-Image Generation [60.164968941945645]
We introduce LLaVA-Reward, an efficient reward model designed to automatically evaluate text-to-image (T2I) generations across multiple perspectives.<n>LLaVA-Reward directly utilizes the hidden states of multimodal large language models (MLLMs)<n>We train LLaVA-Reward on four evaluation perspectives: text-image alignment, fidelity/artifact, safety, and overall ranking.
arXiv Detail & Related papers (2025-07-28T23:52:53Z)
MLLM-Guided VLM Fine-Tuning with Joint Inference for Zero-Shot Composed Image Retrieval [50.062817677022586]
Zero-Shot Image Retrieval (ZS-CIR) methods typically train adapters that convert reference images into pseudo-text tokens.<n>We propose MLLM-Guided VLM Fine-Tuning with Joint Inference (MVFT-JI) to construct two complementary training tasks using only unlabeled images.
arXiv Detail & Related papers (2025-05-26T08:56:59Z)
Decoder-Only LLMs are Better Controllers for Diffusion Models [63.22040456010123]
We propose to enhance text-to-image diffusion models by borrowing the strength of semantic understanding from large language models.<n>Our adapter module is superior to the stat-of-the-art models in terms of text-to-image generation quality and reliability.
arXiv Detail & Related papers (2025-02-06T12:17:35Z)
HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding [91.0552157725366]
This paper presents a novel high-performance monolithic VLM named HoVLE.<n>It converts visual and textual inputs into a shared space, allowing LLMs to process images in the same way as texts.<n>Our experiments show that HoVLE achieves performance close to leading compositional models on various benchmarks.
arXiv Detail & Related papers (2024-12-20T18:59:59Z)
Seeing Syntax: Uncovering Syntactic Learning Limitations in Vision-Language Models [18.87130615326443]
Vision-language models (VLMs) serve as foundation models for image captioning and text-to-image generation.<n>Recent studies have highlighted limitations in VLM text encoders, particularly in areas like compositionality and semantic understanding.
arXiv Detail & Related papers (2024-12-11T05:37:04Z)
Mimir: Improving Video Diffusion Models for Precise Text Understanding [53.72393225042688]
Text serves as the key control signal in video generation due to its narrative nature.<n>The recent success of large language models (LLMs) showcases the power of decoder-only transformers.<n>This work addresses this challenge with Mimir, an end-to-end training framework featuring a carefully tailored token fuser.
arXiv Detail & Related papers (2024-12-04T07:26:44Z)
An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation [21.154973705998945]
Existing methods leverage the text encoder of the CLIP model to represent input prompts. Large Language Models (LLMs) offer multilingual input, accommodate longer context, and achieve superior text representation. We propose a lightweight adapter that enables fast training of the text-to-image model using the textual representations from LLMs.
arXiv Detail & Related papers (2024-05-21T16:35:02Z)
LASER: Tuning-Free LLM-Driven Attention Control for Efficient Text-conditioned Image-to-Animation [52.16008431411513]
LASER is a tuning-free LLM-driven attention control framework.<n>We propose a Text-conditioned Image-to-Animation Benchmark to validate the effectiveness and efficacy of LASER.
arXiv Detail & Related papers (2024-04-21T07:13:56Z)
LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders [34.421335513040795]
Large decoder-only language models (LLMs) are the state-of-the-art models on most of today's NLP tasks and benchmarks. We introduce LLM2Vec, a simple unsupervised approach that can transform any decoder-only LLM into a strong text encoder.
arXiv Detail & Related papers (2024-04-09T02:51:05Z)
Aligned with LLM: a new multi-modal training paradigm for encoding fMRI activity in visual cortex [4.57590454144072]
Recently, there has been a surge in the popularity of pre trained large language models (LLMs) This paper proposes a new multi-modal training paradigm, aligning with LLM, encoding fMRI activity in visual cortex.
arXiv Detail & Related papers (2024-01-08T12:30:23Z)
MLLMs-Augmented Visual-Language Representation Learning [70.5293060238008]
We demonstrate that Multi-modal Large Language Models (MLLMs) can enhance visual-language representation learning. Our approach is simple, utilizing MLLMs to extend multiple diverse captions for each image. We propose "text shearing" to maintain the quality and availability of extended captions.
arXiv Detail & Related papers (2023-11-30T18:05:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.