Related papers: Benchmarking Diversity in Image Generation via Attribute-Conditional Human Evaluation

Benchmarking Diversity in Image Generation via Attribute-Conditional Human Evaluation

URL: http://arxiv.org/abs/2511.10547v1
Date: Fri, 14 Nov 2025 01:57:27 GMT
Title: Benchmarking Diversity in Image Generation via Attribute-Conditional Human Evaluation
Authors: Isabela Albuquerque, Ira Ktena, Olivia Wiles, Ivana Kajić, Amal Rannen-Triki, Cristina Vasconcelos, Aida Nematzadeh,
Abstract summary: Current text-to-image (T2I) models often lack diversity, generating homogeneous outputs.<n>This work introduces a framework to address the need for robust diversity evaluation in T2I models.
Score: 11.51556047408882
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite advances in generation quality, current text-to-image (T2I) models often lack diversity, generating homogeneous outputs. This work introduces a framework to address the need for robust diversity evaluation in T2I models. Our framework systematically assesses diversity by evaluating individual concepts and their relevant factors of variation. Key contributions include: (1) a novel human evaluation template for nuanced diversity assessment; (2) a curated prompt set covering diverse concepts with their identified factors of variation (e.g. prompt: An image of an apple, factor of variation: color); and (3) a methodology for comparing models in terms of human annotations via binomial tests. Furthermore, we rigorously compare various image embeddings for diversity measurement. Notably, our principled approach enables ranking of T2I models by diversity, identifying categories where they particularly struggle. This research offers a robust methodology and insights, paving the way for improvements in T2I model diversity and metric development.

Related papers

DiverseGRPO: Mitigating Mode Collapse in Image Generation via Diversity-Aware GRPO [50.89703227426486]
Reinforcement learning (RL) improves image generation quality significantly by comparing the relative performance of images generated within the same group.<n>In the later stages of training, the model tends to produce homogenized outputs, lacking creativity and visual diversity.<n>This issue can be analyzed from both reward modeling and generation dynamics perspectives.
arXiv Detail & Related papers (2025-12-25T05:37:37Z)
Beyond Overcorrection: Evaluating Diversity in T2I Models with DivBench [26.148022772521493]
Current diversification strategies for text-to-image (T2I) models often ignore contextual appropriateness, leading to over-diversification.<n>This paper introduces DIVBENCH, a benchmark and evaluation framework for measuring both under- and over-diversification in T2I generation.
arXiv Detail & Related papers (2025-07-02T13:14:42Z)
DIMCIM: A Quantitative Evaluation Framework for Default-mode Diversity and Generalization in Text-to-Image Generative Models [11.080727606381524]
We introduce the Does-it/Can-it framework, DIM-CIM, a reference-free measurement of default-mode diversity.<n>We find that widely-used models improve in generalization at the cost of default-mode diversity when scaling from 1.5B to 8.1B parameters.<n>We also use DIMCIM to evaluate the training data of a T2I model and observe a correlation of 0.85 between diversity in training images and default-mode diversity.
arXiv Detail & Related papers (2025-06-05T14:53:34Z)
Evaluating the Diversity and Quality of LLM Generated Content [72.84945252821908]
We introduce a framework for measuring effective semantic diversity--diversity among outputs that meet quality thresholds.<n>Although preference-tuned models exhibit reduced lexical and syntactic diversity, they produce greater effective semantic diversity than SFT or base models.<n>These findings have important implications for applications that require diverse yet high-quality outputs.
arXiv Detail & Related papers (2025-04-16T23:02:23Z)
GRADE: Quantifying Sample Diversity in Text-to-Image Models [66.12068246962762]
GRADE is an automatic method for quantifying sample diversity in text-to-image models.<n>We use GRADE to measure the diversity of 12 models over a total of 720K images.
arXiv Detail & Related papers (2024-10-29T23:10:28Z)
The Factuality Tax of Diversity-Intervened Text-to-Image Generation: Benchmark and Fact-Augmented Intervention [61.80236015147771]
We quantify the trade-off between using diversity interventions and preserving demographic factuality in T2I models. Experiments on DoFaiR reveal that diversity-oriented instructions increase the number of different gender and racial groups. We propose Fact-Augmented Intervention (FAI) to reflect on verbalized or retrieved factual information about gender and racial compositions of generation subjects in history.
arXiv Detail & Related papers (2024-06-29T09:09:42Z)
Measuring Diversity in Co-creative Image Generation [1.4963011898406866]
We propose an alternative based on entropy of neural network encodings for comparing diversity between sets of images. We also compare two pre-trained networks and show how the choice relates to the notion of diversity that we want to evaluate.
arXiv Detail & Related papers (2024-03-06T01:55:14Z)
Stable Bias: Analyzing Societal Representations in Diffusion Models [72.27121528451528]
We propose a new method for exploring the social biases in Text-to-Image (TTI) systems. Our approach relies on characterizing the variation in generated images triggered by enumerating gender and ethnicity markers in the prompts. We leverage this method to analyze images generated by 3 popular TTI systems and find that while all of their outputs show correlations with US labor demographics, they also consistently under-represent marginalized identities to different extents.
arXiv Detail & Related papers (2023-03-20T19:32:49Z)
Interpretable Diversity Analysis: Visualizing Feature Representations In Low-Cost Ensembles [0.0]
This paper introduces several interpretability methods that can be used to qualitatively analyze diversity. We demonstrate these techniques by comparing the diversity of feature representations between child networks using two low-cost ensemble algorithms.
arXiv Detail & Related papers (2023-02-12T00:32:03Z)
Random Network Distillation as a Diversity Metric for Both Image and Text Generation [62.13444904851029]
We develop a new diversity metric that can be applied to data, both synthetic and natural, of any type. We validate and deploy this metric on both images and text.
arXiv Detail & Related papers (2020-10-13T22:03:52Z)
Evaluating the Evaluation of Diversity in Natural Language Generation [43.05127848086264]
We propose a framework for evaluating diversity metrics in natural language generation systems. Our framework can advance the understanding of different diversity metrics, an essential step on the road towards better NLG systems.
arXiv Detail & Related papers (2020-04-06T20:44:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.