Related papers: GRADE: Quantifying Sample Diversity in Text-to-Image Models

GRADE: Quantifying Sample Diversity in Text-to-Image Models

URL: http://arxiv.org/abs/2410.22592v1
Date: Tue, 29 Oct 2024 23:10:28 GMT
Title: GRADE: Quantifying Sample Diversity in Text-to-Image Models
Authors: Royi Rassin, Aviv Slobodkin, Shauli Ravfogel, Yanai Elazar, Yoav Goldberg,
Abstract summary: We propose GRADE: Granular Attribute Diversity Evaluation, an automatic method for quantifying sample diversity. We measure the overall diversity of 12 T2I models using 400 concept-attribute pairs, revealing that all models display limited variation. Our work proposes a modern, semantically-driven approach to measure sample diversity and highlights the stunning homogeneity in outputs by T2I models.
Score: 66.12068246962762
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-to-image (T2I) models are remarkable at generating realistic images based on textual descriptions. However, textual prompts are inherently underspecified: they do not specify all possible attributes of the required image. This raises two key questions: Do T2I models generate diverse outputs on underspecified prompts? How can we automatically measure diversity? We propose GRADE: Granular Attribute Diversity Evaluation, an automatic method for quantifying sample diversity. GRADE leverages the world knowledge embedded in large language models and visual question-answering systems to identify relevant concept-specific axes of diversity (e.g., ``shape'' and ``color'' for the concept ``cookie''). It then estimates frequency distributions of concepts and their attributes and quantifies diversity using (normalized) entropy. GRADE achieves over 90% human agreement while exhibiting weak correlation to commonly used diversity metrics. We use GRADE to measure the overall diversity of 12 T2I models using 400 concept-attribute pairs, revealing that all models display limited variation. Further, we find that these models often exhibit default behaviors, a phenomenon where the model consistently generates concepts with the same attributes (e.g., 98% of the cookies are round). Finally, we demonstrate that a key reason for low diversity is due to underspecified captions in training data. Our work proposes a modern, semantically-driven approach to measure sample diversity and highlights the stunning homogeneity in outputs by T2I models.

Related papers

DIMCIM: A Quantitative Evaluation Framework for Default-mode Diversity and Generalization in Text-to-Image Generative Models [11.080727606381524]
We introduce the Does-it/Can-it framework, DIM-CIM, a reference-free measurement of default-mode diversity.<n>We find that widely-used models improve in generalization at the cost of default-mode diversity when scaling from 1.5B to 8.1B parameters.<n>We also use DIMCIM to evaluate the training data of a T2I model and observe a correlation of 0.85 between diversity in training images and default-mode diversity.
arXiv Detail & Related papers (2025-06-05T14:53:34Z)
Image Generation Diversity Issues and How to Tame Them [8.858030256056095]
Generative methods now produce outputs nearly indistinguishable from real data but often fail to fully capture the data distribution. In this paper, we draw attention to the current lack of diversity in generative models and the inability of common metrics to measure this. We achieve this by framing diversity as an image retrieval problem, where we measure how many real images can be retrieved using synthetic data as queries.
arXiv Detail & Related papers (2024-11-25T08:00:21Z)
Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling [49.41822427811098]
We present Kaleido, a novel approach that enhances the diversity of samples by incorporating autoregressive latent priors. Kaleido integrates an autoregressive language model that encodes the original caption and generates latent variables. We show that Kaleido adheres closely to the guidance provided by the generated latent variables, demonstrating its capability to effectively control and direct the image generation process.
arXiv Detail & Related papers (2024-05-31T17:41:11Z)
Semantic Approach to Quantifying the Consistency of Diffusion Model Image Generation [0.40792653193642503]
We identify the need for an interpretable, quantitative score of the repeatability, or consistency, of image generation in diffusion models. We propose a semantic approach, using a pairwise mean CLIP score as our semantic consistency score.
arXiv Detail & Related papers (2024-04-12T20:16:03Z)
DreamDistribution: Prompt Distribution Learning for Text-to-Image Diffusion Models [53.17454737232668]
We introduce a solution that allows a pretrained T2I diffusion model to learn a set of soft prompts. These prompts offer text-guided editing capabilities and additional flexibility in controlling variation and mixing between multiple distributions. We also show the adaptability of the learned prompt distribution to other tasks, such as text-to-3D.
arXiv Detail & Related papers (2023-12-21T12:11:00Z)
Sequential Modeling Enables Scalable Learning for Large Vision Models [120.91839619284431]
We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data. We define a common format, "visual sentences", in which we can represent raw images and videos as well as annotated data sources.
arXiv Detail & Related papers (2023-12-01T18:59:57Z)
Diversity and Diffusion: Observations on Synthetic Image Distributions with Stable Diffusion [6.491645162078057]
Text-to-image (TTI) systems have made it possible to create realistic images with simple text prompts. In all of the experiments performed to date, classifiers trained solely with synthetic images perform poorly at inference. We find four issues that limit the usefulness of TTI systems for this task: ambiguity, adherence to prompt, lack of diversity, and inability to represent the underlying concept.
arXiv Detail & Related papers (2023-10-31T18:05:15Z)
Diverse Diffusion: Enhancing Image Diversity in Text-to-Image Generation [0.0]
We introduce Diverse Diffusion, a method for boosting image diversity beyond gender and ethnicity. Our approach contributes to the creation of more inclusive and representative AI-generated art.
arXiv Detail & Related papers (2023-10-19T08:48:23Z)
Effective Data Augmentation With Diffusion Models [65.09758931804478]
We address the lack of diversity in data augmentation with image-to-image transformations parameterized by pre-trained text-to-image diffusion models. Our method edits images to change their semantics using an off-the-shelf diffusion model, and generalizes to novel visual concepts from a few labelled examples. We evaluate our approach on few-shot image classification tasks, and on a real-world weed recognition task, and observe an improvement in accuracy in tested domains.
arXiv Detail & Related papers (2023-02-07T20:42:28Z)
Auditing Gender Presentation Differences in Text-to-Image Models [54.16959473093973]
We study how gender is presented differently in text-to-image models. By probing gender indicators in the input text, we quantify the frequency differences of presentation-centric attributes. We propose an automatic method to estimate such differences.
arXiv Detail & Related papers (2023-02-07T18:52:22Z)
Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision [38.22842778742829]
Discriminative self-supervised learning allows training models on any random group of internet images. We train models on billions of random images without any data pre-processing or prior assumptions about what we want the model to learn. We extensively study and validate our model performance on over 50 benchmarks including fairness, to distribution shift, geographical diversity, fine grained recognition, image copy detection and many image classification datasets.
arXiv Detail & Related papers (2022-02-16T22:26:47Z)
DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models [73.12069620086311]
We investigate the visual reasoning capabilities and social biases of text-to-image models. First, we measure three visual reasoning skills: object recognition, object counting, and spatial relation understanding. Second, we assess the gender and skin tone biases by measuring the gender/skin tone distribution of generated images.
arXiv Detail & Related papers (2022-02-08T18:36:52Z)
Diverse Semantic Image Synthesis via Probability Distribution Modeling [103.88931623488088]
We propose a novel diverse semantic image synthesis framework. Our method can achieve superior diversity and comparable quality compared to state-of-the-art methods.
arXiv Detail & Related papers (2021-03-11T18:59:25Z)
Random Network Distillation as a Diversity Metric for Both Image and Text Generation [62.13444904851029]
We develop a new diversity metric that can be applied to data, both synthetic and natural, of any type. We validate and deploy this metric on both images and text.
arXiv Detail & Related papers (2020-10-13T22:03:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.