Towards Understanding and Quantifying Uncertainty for Text-to-Image Generation
- URL: http://arxiv.org/abs/2412.03178v1
- Date: Wed, 04 Dec 2024 10:03:52 GMT
- Title: Towards Understanding and Quantifying Uncertainty for Text-to-Image Generation
- Authors: Gianni Franchi, Dat Nguyen Trong, Nacim Belkhir, Guoxuan Xia, Andrea Pilzer,
- Abstract summary: Uncertainty quantification in text-to-image (T2I) generative models is crucial for understanding model behavior and improving output reliability.<n>We are the first to quantify and evaluate the uncertainty of T2I models with respect to the prompt.<n>We introduce Prompt-based UNCertainty Estimation for T2I models (PUNC) to better address uncertainties arising from the semantics of the prompt and generated images.
- Score: 4.1364578693016325
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Uncertainty quantification in text-to-image (T2I) generative models is crucial for understanding model behavior and improving output reliability. In this paper, we are the first to quantify and evaluate the uncertainty of T2I models with respect to the prompt. Alongside adapting existing approaches designed to measure uncertainty in the image space, we also introduce Prompt-based UNCertainty Estimation for T2I models (PUNC), a novel method leveraging Large Vision-Language Models (LVLMs) to better address uncertainties arising from the semantics of the prompt and generated images. PUNC utilizes a LVLM to caption a generated image, and then compares the caption with the original prompt in the more semantically meaningful text space. PUNC also enables the disentanglement of both aleatoric and epistemic uncertainties via precision and recall, which image-space approaches are unable to do. Extensive experiments demonstrate that PUNC outperforms state-of-the-art uncertainty estimation techniques across various settings. Uncertainty quantification in text-to-image generation models can be used on various applications including bias detection, copyright protection, and OOD detection. We also introduce a comprehensive dataset of text prompts and generation pairs to foster further research in uncertainty quantification for generative models. Our findings illustrate that PUNC not only achieves competitive performance but also enables novel applications in evaluating and improving the trustworthiness of text-to-image models.
Related papers
- D2C: Unlocking the Potential of Continuous Autoregressive Image Generation with Discrete Tokens [80.75893450536577]
We propose D2C, a novel two-stage method to enhance model generation capacity.
In the first stage, the discrete-valued tokens representing coarse-grained image features are sampled by employing a small discrete-valued generator.
In the second stage, the continuous-valued tokens representing fine-grained image features are learned conditioned on the discrete token sequence.
arXiv Detail & Related papers (2025-03-21T13:58:49Z) - RFMI: Estimating Mutual Information on Rectified Flow for Text-to-Image Alignment [51.85242063075333]
Rectified Flow (RF) models trained with a Flow matching framework have achieved state-of-the-art performance on Text-to-Image (T2I) conditional generation.
Yet, multiple benchmarks show that synthetic images can still suffer from poor alignment with the prompt.
We introduce RFMI, a novel Mutual Information (MI) estimator for RF models that uses the pre-trained model itself for the MI estimation.
arXiv Detail & Related papers (2025-03-18T15:41:45Z) - Text-to-Image Diffusion Models Cannot Count, and Prompt Refinement Cannot Help [18.70937620674227]
We introduce T2ICountBench, a novel benchmark designed to rigorously evaluate the counting ability of state-of-the-art text-to-image diffusion models.
Our evaluations reveal that all state-of-the-art diffusion models fail to generate the correct number of objects, with accuracy dropping significantly as the number of objects increases.
arXiv Detail & Related papers (2025-03-10T03:28:18Z) - DILLEMA: Diffusion and Large Language Models for Multi-Modal Augmentation [0.13124513975412253]
We present a novel framework for testing vision neural networks that leverages Large Language Models and control-conditioned Diffusion Models.
Our approach begins by translating images into detailed textual descriptions using a captioning model.
These descriptions are then used to produce new test images through a text-to-image diffusion process.
arXiv Detail & Related papers (2025-02-05T16:35:42Z) - PromptLA: Towards Integrity Verification of Black-box Text-to-Image Diffusion Models [17.12906933388337]
Malicious actors can fine-tune text-to-image (T2I) diffusion models to generate illegal content.
We propose a novel prompt selection algorithm based on learning automaton (PromptLA) for efficient and accurate verification.
arXiv Detail & Related papers (2024-12-20T07:24:32Z) - Safeguarding Text-to-Image Generation via Inference-Time Prompt-Noise Optimization [29.378296359782585]
Text-to-Image (T2I) diffusion models are widely recognized for their ability to generate high-quality and diverse images based on text prompts.
Current efforts to prevent inappropriate image generation for T2I models are easy to bypass and vulnerable to adversarial attacks.
We propose a novel, training-free approach, called Prompt-Noise Optimization (PNO), to mitigate unsafe image generation.
arXiv Detail & Related papers (2024-12-05T05:12:30Z) - Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models [54.052963634384945]
We introduce the Image Regeneration task to assess text-to-image models.
We use GPT4V to bridge the gap between the reference image and the text input for the T2I model.
We also present ImageRepainter framework to enhance the quality of generated images.
arXiv Detail & Related papers (2024-11-14T13:52:43Z) - VLEU: a Method for Automatic Evaluation for Generalizability of Text-to-Image Models [18.259733507395634]
We introduce a new metric called Visual Language Evaluation Understudy (VLEU)
VLEU quantifies a model's generalizability by computing the Kullback-Leibler divergence between the marginal distribution of the visual text and the conditional distribution of the images generated by the model.
Our experiments demonstrate the effectiveness of VLEU in evaluating the generalization capability of various T2I models.
arXiv Detail & Related papers (2024-09-23T04:50:36Z) - Six-CD: Benchmarking Concept Removals for Benign Text-to-image Diffusion Models [58.74606272936636]
Text-to-image (T2I) diffusion models have shown exceptional capabilities in generating images that closely correspond to textual prompts.
The models could be exploited for malicious purposes, such as generating images with violence or nudity, or creating unauthorized portraits of public figures in inappropriate contexts.
concept removal methods have been proposed to modify diffusion models to prevent the generation of malicious and unwanted concepts.
arXiv Detail & Related papers (2024-06-21T03:58:44Z) - Information Theoretic Text-to-Image Alignment [49.396917351264655]
We present a novel method that relies on an information-theoretic alignment measure to steer image generation.
Our method is on-par or superior to the state-of-the-art, yet requires nothing but a pre-trained denoising network to estimate MI.
arXiv Detail & Related papers (2024-05-31T12:20:02Z) - Visual Explanations of Image-Text Representations via Multi-Modal Information Bottleneck Attribution [49.762034744605955]
We propose a multi-modal information bottleneck approach to improve interpretability of vision-language models.
We demonstrate how M2IB can be applied to attribution analysis of vision-language pretrained models.
arXiv Detail & Related papers (2023-12-28T18:02:22Z) - Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust.
Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model.
We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z) - If at First You Don't Succeed, Try, Try Again: Faithful Diffusion-based
Text-to-Image Generation by Selection [53.320946030761796]
diffusion-based text-to-image (T2I) models can lack faithfulness to the text prompt.
We show that large T2I diffusion models are more faithful than usually assumed, and can generate images faithful to even complex prompts.
We introduce a pipeline that generates candidate images for a text prompt and picks the best one according to an automatic scoring system.
arXiv Detail & Related papers (2023-05-22T17:59:41Z) - LeftRefill: Filling Right Canvas based on Left Reference through
Generalized Text-to-Image Diffusion Model [55.20469538848806]
LeftRefill is an innovative approach to harness large Text-to-Image (T2I) diffusion models for reference-guided image synthesis.
This paper introduces LeftRefill, an innovative approach to efficiently harness large Text-to-Image (T2I) diffusion models for reference-guided image synthesis.
arXiv Detail & Related papers (2023-05-19T10:29:42Z) - Uncertainty-aware Generalized Adaptive CycleGAN [44.34422859532988]
Unpaired image-to-image translation refers to learning inter-image-domain mapping in an unsupervised manner.
Existing methods often learn deterministic mappings without explicitly modelling the robustness to outliers or predictive uncertainty.
We propose a novel probabilistic method called Uncertainty-aware Generalized Adaptive Cycle Consistency (UGAC)
arXiv Detail & Related papers (2021-02-23T15:22:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.