Related papers: Words Worth a Thousand Pictures: Measuring and Understanding Perceptual Variability in Text-to-Image Generation

Words Worth a Thousand Pictures: Measuring and Understanding Perceptual Variability in Text-to-Image Generation

URL: http://arxiv.org/abs/2406.08482v1
Date: Wed, 12 Jun 2024 17:59:27 GMT
Title: Words Worth a Thousand Pictures: Measuring and Understanding Perceptual Variability in Text-to-Image Generation
Authors: Raphael Tang, Xinyu Zhang, Lixinyu Xu, Yao Lu, Wenyan Li, Pontus Stenetorp, Jimmy Lin, Ferhan Ture,
Abstract summary: Diffusion models are the state of the art in text-to-image generation, but their perceptual variability remains understudied. We propose W1KP, a human-calibrated measure of variability in a set of images, bootstrapped from existing image-pair perceptual distances. Our best perceptual distance outperforms nine baselines by up to 18 points in accuracy, and our calibration matches human judgements 78% of the time.
Score: 58.77994391566484
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion models are the state of the art in text-to-image generation, but their perceptual variability remains understudied. In this paper, we examine how prompts affect image variability in black-box diffusion-based models. We propose W1KP, a human-calibrated measure of variability in a set of images, bootstrapped from existing image-pair perceptual distances. Current datasets do not cover recent diffusion models, thus we curate three test sets for evaluation. Our best perceptual distance outperforms nine baselines by up to 18 points in accuracy, and our calibration matches graded human judgements 78% of the time. Using W1KP, we study prompt reusability and show that Imagen prompts can be reused for 10-50 random seeds before new images become too similar to already generated images, while Stable Diffusion XL and DALL-E 3 can be reused 50-200 times. Lastly, we analyze 56 linguistic features of real prompts, finding that the prompt's length, CLIP embedding norm, concreteness, and word senses influence variability most. As far as we are aware, we are the first to analyze diffusion variability from a visuolinguistic perspective. Our project page is at http://w1kp.com

Related papers

Sparse Repellency for Shielded Generation in Text-to-image Diffusion Models [29.083402085790016]
We propose a method that coaxes the sampled trajectories of pretrained diffusion models to land on images that fall outside of a reference set. We achieve this by adding repellency terms to the diffusion SDE throughout the generation trajectory. We show that adding SPELL to popular diffusion models improves their diversity while impacting their FID only marginally, and performs comparatively better than other recent training-free diversity methods.
arXiv Detail & Related papers (2024-10-08T13:26:32Z)
DreamDistribution: Prompt Distribution Learning for Text-to-Image Diffusion Models [53.17454737232668]
We introduce a solution that allows a pretrained T2I diffusion model to learn a set of soft prompts. These prompts offer text-guided editing capabilities and additional flexibility in controlling variation and mixing between multiple distributions. We also show the adaptability of the learned prompt distribution to other tasks, such as text-to-3D.
arXiv Detail & Related papers (2023-12-21T12:11:00Z)
Diversity and Diffusion: Observations on Synthetic Image Distributions with Stable Diffusion [6.491645162078057]
Text-to-image (TTI) systems have made it possible to create realistic images with simple text prompts. In all of the experiments performed to date, classifiers trained solely with synthetic images perform poorly at inference. We find four issues that limit the usefulness of TTI systems for this task: ambiguity, adherence to prompt, lack of diversity, and inability to represent the underlying concept.
arXiv Detail & Related papers (2023-10-31T18:05:15Z)
Evaluating Picture Description Speech for Dementia Detection using Image-text Alignment [10.008388878255538]
We propose the first dementia detection models that take both the picture and the description texts as inputs. We observe the difference between dementia and healthy samples in terms of the text's relevance to the picture and the focused area of the picture. We propose three advanced models that pre-processed the samples based on their relevance to the picture, sub-image, and focused areas.
arXiv Detail & Related papers (2023-08-11T08:42:37Z)
Parents and Children: Distinguishing Multimodal DeepFakes from Natural Images [60.34381768479834]
Recent advancements in diffusion models have enabled the generation of realistic deepfakes from textual prompts in natural language. We pioneer a systematic study on deepfake detection generated by state-of-the-art diffusion models.
arXiv Detail & Related papers (2023-04-02T10:25:09Z)
Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities. We explicitly account for prior images and reports when available during both training and fine-tuning. Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z)
Invariant Learning via Diffusion Dreamed Distribution Shifts [121.71383835729848]
We propose a dataset called Diffusion Dreamed Distribution Shifts (D3S) D3S consists of synthetic images generated through StableDiffusion using text prompts and image guides obtained by pasting a sample foreground image onto a background template image. Due to the incredible photorealism of the diffusion model, our images are much closer to natural images than previous synthetic datasets.
arXiv Detail & Related papers (2022-11-18T17:07:43Z)
What the DAAM: Interpreting Stable Diffusion Using Cross Attention [39.97805685586423]
Large-scale diffusion neural networks represent a substantial milestone in text-to-image generation. They remain poorly understood, lacking explainability and interpretability analyses, largely due to their proprietary, closed-source nature. We propose DAAM, a novel method based on upscaling and aggregating cross-attention activations in the latent denoising subnetwork. We show that DAAM performs strongly on caption-generated images, achieving an mIoU of 61.0, and it outperforms supervised models on open-vocabulary segmentation.
arXiv Detail & Related papers (2022-10-10T17:55:41Z)
DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models [73.12069620086311]
We investigate the visual reasoning capabilities and social biases of text-to-image models. First, we measure three visual reasoning skills: object recognition, object counting, and spatial relation understanding. Second, we assess the gender and skin tone biases by measuring the gender/skin tone distribution of generated images.
arXiv Detail & Related papers (2022-02-08T18:36:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.