Related papers: Self-correcting LLM-controlled Diffusion Models

Self-correcting LLM-controlled Diffusion Models

URL: http://arxiv.org/abs/2311.16090v1
Date: Mon, 27 Nov 2023 18:56:37 GMT
Title: Self-correcting LLM-controlled Diffusion Models
Authors: Tsung-Han Wu, Long Lian, Joseph E. Gonzalez, Boyi Li, Trevor Darrell
Abstract summary: We introduce Self-correcting LLM-controlled Diffusion (SLD) SLD is a framework that generates an image from the input prompt, assesses its alignment with the prompt, and performs self-corrections on the inaccuracies in the generated image. Our approach can rectify a majority of incorrect generations, particularly in generative numeracy, attribute binding, and spatial relationships.
Score: 83.26605445217334
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-to-image generation has witnessed significant progress with the advent of diffusion models. Despite the ability to generate photorealistic images, current text-to-image diffusion models still often struggle to accurately interpret and follow complex input text prompts. In contrast to existing models that aim to generate images only with their best effort, we introduce Self-correcting LLM-controlled Diffusion (SLD). SLD is a framework that generates an image from the input prompt, assesses its alignment with the prompt, and performs self-corrections on the inaccuracies in the generated image. Steered by an LLM controller, SLD turns text-to-image generation into an iterative closed-loop process, ensuring correctness in the resulting image. SLD is not only training-free but can also be seamlessly integrated with diffusion models behind API access, such as DALL-E 3, to further boost the performance of state-of-the-art diffusion models. Experimental results show that our approach can rectify a majority of incorrect generations, particularly in generative numeracy, attribute binding, and spatial relationships. Furthermore, by simply adjusting the instructions to the LLM, SLD can perform image editing tasks, bridging the gap between text-to-image generation and image editing pipelines. We will make our code available for future research and applications.

Related papers

Learning an Image Editing Model without Image Editing Pairs [83.03646586929638]
Recent image editing models have achieved impressive results while following natural language editing instructions.<n>They rely on supervised fine-tuning with large datasets of input-target pairs.<n>Current workarounds use synthetic training pairs that leverage the zero-shot capabilities of existing models.<n>We present a new training paradigm that eliminates the need for paired data entirely.
arXiv Detail & Related papers (2025-10-16T17:59:57Z)
UniFusion: Vision-Language Model as Unified Encoder in Image Generation [12.811191961286852]
We present UniFusion, a diffusion-based generative model conditioned on a frozen large vision-language model (VLM) that serves as a unified multimodal encoder.<n>We demonstrate that LAP outperforms other shallow fusion architectures on text-image alignment for generation and faithful transfer of visual information from VLM to the diffusion model which is key for editing.<n>We propose VLM-Enabled Rewriting Injection with Flexibile Inference (VERIFI), which conditions a diffusion transformer (DiT) only on the text tokens generated by the VLM during in-model prompt rewriting.
arXiv Detail & Related papers (2025-10-14T17:57:56Z)
ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement [68.05833403672274]
Existing unified models have struggled to handle the three fundamental capabilities in a unified model: understanding, generation, and editing. ILLUME+ introduces a unified dual visual tokenizer, DualViTok, which preserves fine-grained textures and text-aligned semantics. We also employ a diffusion model as the image detokenizer for enhanced generation quality and efficient super-resolution.
arXiv Detail & Related papers (2025-04-02T17:45:00Z)
Decoder-Only LLMs are Better Controllers for Diffusion Models [63.22040456010123]
We propose to enhance text-to-image diffusion models by borrowing the strength of semantic understanding from large language models. Our adapter module is superior to the stat-of-the-art models in terms of text-to-image generation quality and reliability.
arXiv Detail & Related papers (2025-02-06T12:17:35Z)
Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models [42.891427362223176]
Large language models (LLMs) based on decoder-only transformers have demonstrated superior text understanding capabilities. We propose a novel framework to fully harness the capabilities of LLMs. We further design an LLM-Infused Diffusion Transformer (LI-DiT) based on the framework.
arXiv Detail & Related papers (2024-06-17T17:59:43Z)
LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation? [10.72249123249003]
We revisit diffusion models, highlighting their capacity for holistic context modeling and parallel decoding. We introduce a novel architecture, LaDiC, which utilizes a split BERT to create a dedicated latent space for captions. LaDiC achieves state-of-the-art performance for diffusion-based methods on the MS dataset with 38.2 BLEU@4 and 126.2 CIDEr.
arXiv Detail & Related papers (2024-04-16T17:47:16Z)
MirrorDiffusion: Stabilizing Diffusion Process in Zero-shot Image Translation by Prompts Redescription and Beyond [57.14128305383768]
We propose a prompt redescription strategy to realize a mirror effect between the source and reconstructed image in the diffusion model (MirrorDiffusion) MirrorDiffusion achieves superior performance over the state-of-the-art methods on zero-shot image translation benchmarks.
arXiv Detail & Related papers (2024-01-06T14:12:16Z)
LLMGA: Multimodal Large Language Model based Generation Assistant [53.150283805515926]
We introduce a Multimodal Large Language Model-based Generation Assistant (LLMGA) to assist users in image generation and editing. We train the MLLM to grasp the properties of image generation and editing, enabling it to generate detailed prompts. Extensive results show that LLMGA has promising generation and editing capabilities and can enable more flexible and expansive applications.
arXiv Detail & Related papers (2023-11-27T13:37:26Z)
LLM-grounded Video Diffusion Models [57.23066793349706]
Video diffusion models have emerged as a promising tool for neuraltemporal generation. Current models struggle with prompts and often restricted or incorrect motion. We introduce LLM-grounded Video Diffusion (LVD) Our results demonstrate that LVD significantly outperforms its base video diffusion model.
arXiv Detail & Related papers (2023-09-29T17:54:46Z)
BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing [73.74570290836152]
BLIP-Diffusion is a new subject-driven image generation model that supports multimodal control. Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation.
arXiv Detail & Related papers (2023-05-24T04:51:04Z)
LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models [62.75006608940132]
This work proposes to enhance prompt understanding capabilities in text-to-image diffusion models. Our method leverages a pretrained large language model for grounded generation in a novel two-stage process. Our method significantly outperforms the base diffusion model and several strong baselines in accurately generating images.
arXiv Detail & Related papers (2023-05-23T03:59:06Z)
DiffUTE: Universal Text Editing Diffusion Model [32.384236053455]
We propose a universal self-supervised text editing diffusion model (DiffUTE) It aims to replace or modify words in the source image with another one while maintaining its realistic appearance. Our method achieves an impressive performance and enables controllable editing on in-the-wild images with high fidelity.
arXiv Detail & Related papers (2023-05-18T09:06:01Z)
Blended Latent Diffusion [18.043090347648157]
We present an accelerated solution to the task of local text-driven editing of generic images, where the desired edits are confined to a user-provided mask. Our solution leverages a recent text-to-image Latent Diffusion Model (LDM), which speeds up diffusion by operating in a lower-dimensional latent space.
arXiv Detail & Related papers (2022-06-06T17:58:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.