Related papers: Towards Geographic Inclusion in the Evaluation of Text-to-Image Models

Towards Geographic Inclusion in the Evaluation of Text-to-Image Models

URL: http://arxiv.org/abs/2405.04457v1
Date: Tue, 7 May 2024 16:23:06 GMT
Title: Towards Geographic Inclusion in the Evaluation of Text-to-Image Models
Authors: Melissa Hall, Samuel J. Bell, Candace Ross, Adina Williams, Michal Drozdzal, Adriana Romero Soriano,
Abstract summary: We study how much annotators in Africa, Europe, and Southeast Asia vary in their perception of geographic representation, visual appeal, and consistency in real and generated images. For example, annotators in different locations often disagree on whether exaggerated, stereotypical depictions of a region are considered geographically representative. We recommend steps for improved automatic and human evaluations.
Score: 25.780536950323683
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Rapid progress in text-to-image generative models coupled with their deployment for visual content creation has magnified the importance of thoroughly evaluating their performance and identifying potential biases. In pursuit of models that generate images that are realistic, diverse, visually appealing, and consistent with the given prompt, researchers and practitioners often turn to automated metrics to facilitate scalable and cost-effective performance profiling. However, commonly-used metrics often fail to account for the full diversity of human preference; often even in-depth human evaluations face challenges with subjectivity, especially as interpretations of evaluation criteria vary across regions and cultures. In this work, we conduct a large, cross-cultural study to study how much annotators in Africa, Europe, and Southeast Asia vary in their perception of geographic representation, visual appeal, and consistency in real and generated images from state-of-the art public APIs. We collect over 65,000 image annotations and 20 survey responses. We contrast human annotations with common automated metrics, finding that human preferences vary notably across geographic location and that current metrics do not fully account for this diversity. For example, annotators in different locations often disagree on whether exaggerated, stereotypical depictions of a region are considered geographically representative. In addition, the utility of automatic evaluations is dependent on assumptions about their set-up, such as the alignment of feature extractors with human perception of object similarity or the definition of "appeal" captured in reference datasets used to ground evaluations. We recommend steps for improved automatic and human evaluations.

Related papers

Towards More Accurate Personalized Image Generation: Addressing Overfitting and Evaluation Bias [52.590072198551944]
The aim of image personalization is to create images based on a user-provided subject. Current methods face challenges in ensuring fidelity to the text prompt. We introduce a novel training pipeline that incorporates an attractor to filter out distractions in training images.
arXiv Detail & Related papers (2025-03-09T14:14:02Z)
Towards Automatic Evaluation for Image Transcreation [52.71090829502756]
We propose a suite of automatic evaluation metrics inspired by machine translation (MT) metrics. We identify three critical dimensions of image transcreation: cultural relevance, semantic equivalence and visual similarity. Our results show that proprietary VLMs best identify cultural relevance and semantic equivalence, while vision-encoder representations are adept at measuring visual similarity.
arXiv Detail & Related papers (2024-12-18T10:55:58Z)
Balancing the Scales: Enhancing Fairness in Facial Expression Recognition with Latent Alignment [5.784550537553534]
This workleverages representation learning based on latent spaces to mitigate bias in facial expression recognition systems. It also enhances a deep learning model's fairness and overall accuracy.
arXiv Detail & Related papers (2024-10-25T10:03:10Z)
When Does Perceptual Alignment Benefit Vision Representations? [76.32336818860965]
We investigate how aligning vision model representations to human perceptual judgments impacts their usability. We find that aligning models to perceptual judgments yields representations that improve upon the original backbones across many downstream tasks. Our results suggest that injecting an inductive bias about human perceptual knowledge into vision models can contribute to better representations.
arXiv Detail & Related papers (2024-10-14T17:59:58Z)
Evaluating Multiview Object Consistency in Humans and Image Models [68.36073530804296]
We leverage an experimental design from the cognitive sciences which requires zero-shot visual inferences about object shape. We collect 35K trials of behavioral data from over 500 participants. We then evaluate the performance of common vision models.
arXiv Detail & Related papers (2024-09-09T17:59:13Z)
Vision-Language Models under Cultural and Inclusive Considerations [53.614528867159706]
Large vision-language models (VLMs) can assist visually impaired people by describing images from their daily lives. Current evaluation datasets may not reflect diverse cultural user backgrounds or the situational context of this use case. We create a survey to determine caption preferences and propose a culture-centric evaluation benchmark by filtering VizWiz, an existing dataset with images taken by people who are blind. We then evaluate several VLMs, investigating their reliability as visual assistants in a culturally diverse setting.
arXiv Detail & Related papers (2024-07-08T17:50:00Z)
Decomposed evaluations of geographic disparities in text-to-image models [22.491466809896867]
We introduce a new set of metrics, Decomposed Indicators of Disparities in Image Generation (Decomposed-DIG), that allows us to measure geographic disparities in the depiction of objects and backgrounds in generated images. Using Decomposed-DIG, we audit a widely used latent diffusion model and find that generated images depict objects with better realism than backgrounds. We use Decomposed-DIG to pinpoint specific examples of disparities, such as stereotypical background generation in Africa, struggling to generate modern vehicles in Africa, and unrealistically placing some objects in outdoor settings.
arXiv Detail & Related papers (2024-06-17T18:04:23Z)
Improving Geo-diversity of Generated Images with Contextualized Vendi Score Guidance [12.33170407159189]
State-of-the-art text-to-image generative models struggle to depict everyday objects with the true diversity of the real world. We introduce an inference time intervention, contextualized Vendi Score Guidance (c-VSG), that guides the backwards steps of latent diffusion models to increase the diversity of a sample. We find that c-VSG substantially increases the diversity of generated images, both for the worst performing regions and on average, while simultaneously maintaining or improving image quality and consistency.
arXiv Detail & Related papers (2024-06-06T23:35:51Z)
DIG In: Evaluating Disparities in Image Generations with Indicators for Geographic Diversity [24.887571095245313]
We introduce three indicators to evaluate the realism, diversity and prompt-generation consistency of text-to-image generative systems. We find that models have less realism and diversity of generations when prompting for Africa and West Asia than Europe. Perhaps most interestingly, our indicators suggest that progress in image generation quality has come at the cost of real-world geographic representation.
arXiv Detail & Related papers (2023-08-11T15:43:37Z)
Social Biases through the Text-to-Image Generation Lens [9.137275391251517]
Text-to-Image (T2I) generation is enabling new applications that support creators, designers, and general end users of productivity software. We take a multi-dimensional approach to studying and quantifying common social biases as reflected in the generated images. We present findings for two popular T2I models: DALLE-v2 and Stable Diffusion.
arXiv Detail & Related papers (2023-03-30T05:29:13Z)
Fairness meets Cross-Domain Learning: a new perspective on Models and Metrics [80.07271410743806]
We study the relationship between cross-domain learning (CD) and model fairness. We introduce a benchmark on face and medical images spanning several demographic groups as well as classification and localization tasks. Our study covers 14 CD approaches alongside three state-of-the-art fairness algorithms and shows how the former can outperform the latter.
arXiv Detail & Related papers (2023-03-25T09:34:05Z)
Stable Bias: Analyzing Societal Representations in Diffusion Models [72.27121528451528]
We propose a new method for exploring the social biases in Text-to-Image (TTI) systems. Our approach relies on characterizing the variation in generated images triggered by enumerating gender and ethnicity markers in the prompts. We leverage this method to analyze images generated by 3 popular TTI systems and find that while all of their outputs show correlations with US labor demographics, they also consistently under-represent marginalized identities to different extents.
arXiv Detail & Related papers (2023-03-20T19:32:49Z)
Automatic Main Character Recognition for Photographic Studies [78.88882860340797]
Main characters in images are the most important humans that catch the viewer's attention upon first look. Identifying the main character in images plays an important role in traditional photographic studies and media analysis. We propose a method for identifying the main characters using machine learning based human pose estimation.
arXiv Detail & Related papers (2021-06-16T18:14:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.