Deconstructing Bias: A Multifaceted Framework for Diagnosing Cultural and Compositional Inequities in Text-to-Image Generative Models
- URL: http://arxiv.org/abs/2505.01430v1
- Date: Sat, 05 Apr 2025 06:17:43 GMT
- Title: Deconstructing Bias: A Multifaceted Framework for Diagnosing Cultural and Compositional Inequities in Text-to-Image Generative Models
- Authors: Muna Numan Said, Aarib Zaidi, Rabia Usman, Sonia Okon, Praneeth Medepalli, Kevin Zhu, Vasu Sharma, Sean O'Brien,
- Abstract summary: This paper benchmarks the Component Inclusion Score (CIS), a metric designed to evaluate the fidelity of image generation across cultural contexts.<n>We quantify biases in terms of compositional fragility and contextual misalignment, revealing significant performance gaps between Western and non-Western cultural prompts.
- Score: 3.6335172274433414
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The transformative potential of text-to-image (T2I) models hinges on their ability to synthesize culturally diverse, photorealistic images from textual prompts. However, these models often perpetuate cultural biases embedded within their training data, leading to systemic misrepresentations. This paper benchmarks the Component Inclusion Score (CIS), a metric designed to evaluate the fidelity of image generation across cultural contexts. Through extensive analysis involving 2,400 images, we quantify biases in terms of compositional fragility and contextual misalignment, revealing significant performance gaps between Western and non-Western cultural prompts. Our findings underscore the impact of data imbalance, attention entropy, and embedding superposition on model fairness. By benchmarking models like Stable Diffusion with CIS, we provide insights into architectural and data-centric interventions for enhancing cultural inclusivity in AI-generated imagery. This work advances the field by offering a comprehensive tool for diagnosing and mitigating biases in T2I generation, advocating for more equitable AI systems.
Related papers
- Can Generated Images Serve as a Viable Modality for Text-Centric Multimodal Learning? [3.966028515034415]
This work systematically investigates whether images generated on-the-fly by Text-to-Image (T2I) models can serve as a valuable complementary modality for text-centric tasks.
arXiv Detail & Related papers (2025-06-21T07:32:09Z) - CAIRe: Cultural Attribution of Images by Retrieval-Augmented Evaluation [61.130639734982395]
We introduce CAIRe, a novel evaluation metric that assesses the degree of cultural relevance of an image.<n>Our framework grounds entities and concepts in the image to a knowledge base and uses factual information to give independent graded judgments for each culture label.
arXiv Detail & Related papers (2025-06-10T17:16:23Z) - Diffusion Models Through a Global Lens: Are They Culturally Inclusive? [15.991121392458748]
We introduce CultDiff benchmark, evaluating state-of-the-art diffusion models.<n>We show that these models often fail to generate cultural artifacts in architecture, clothing, and food, especially for underrepresented country regions.<n>We develop a neural-based image-image similarity metric, namely, CultDiff-S, to predict human judgment on real and generated images with cultural artifacts.
arXiv Detail & Related papers (2025-02-13T03:05:42Z) - FairT2I: Mitigating Social Bias in Text-to-Image Generation via Large Language Model-Assisted Detection and Attribute Rebalancing [32.01426831450348]
We introduce FairT2I, a novel framework that harnesses large language models to detect and mitigate social biases in T2I generation.<n>Our results show that FairT2I successfully mitigates social biases and enhances the diversity of sensitive attributes in generated images.
arXiv Detail & Related papers (2025-02-06T07:22:57Z) - KITTEN: A Knowledge-Intensive Evaluation of Image Generation on Visual Entities [93.74881034001312]
We conduct a systematic study on the fidelity of entities in text-to-image generation models.
We focus on their ability to generate a wide range of real-world visual entities, such as landmark buildings, aircraft, plants, and animals.
Our findings reveal that even the most advanced text-to-image models often fail to generate entities with accurate visual details.
arXiv Detail & Related papers (2024-10-15T17:50:37Z) - PhyBench: A Physical Commonsense Benchmark for Evaluating Text-to-Image Models [50.33699462106502]
Text-to-image (T2I) models frequently fail to produce images consistent with physical commonsense.
Current T2I evaluation benchmarks focus on metrics such as accuracy, bias, and safety, neglecting the evaluation of models' internal knowledge.
We introduce PhyBench, a comprehensive T2I evaluation dataset comprising 700 prompts across 4 primary categories: mechanics, optics, thermodynamics, and material properties.
arXiv Detail & Related papers (2024-06-17T17:49:01Z) - Extrinsic Evaluation of Cultural Competence in Large Language Models [53.626808086522985]
We focus on extrinsic evaluation of cultural competence in two text generation tasks.
We evaluate model outputs when an explicit cue of culture, specifically nationality, is perturbed in the prompts.
We find weak correlations between text similarity of outputs for different countries and the cultural values of these countries.
arXiv Detail & Related papers (2024-06-17T14:03:27Z) - Lost in Translation? Translation Errors and Challenges for Fair Assessment of Text-to-Image Models on Multilingual Concepts [107.32683485639654]
Benchmarks of the multilingual capabilities of text-to-image (T2I) models compare generated images prompted in a test language to an expected image distribution over a concept set.
One such benchmark, "Conceptual Coverage Across Languages" (CoCo-CroLa), assesses the tangible noun inventory of T2I models by prompting them to generate pictures from a concept list translated to seven languages.
We find that this benchmark contains translation errors of varying severity in Spanish, Japanese, and Chinese.
arXiv Detail & Related papers (2024-03-17T05:05:11Z) - Separate-and-Enhance: Compositional Finetuning for Text2Image Diffusion
Models [58.46926334842161]
This work illuminates the fundamental reasons for such misalignment, pinpointing issues related to low attention activation scores and mask overlaps.
We propose two novel objectives, the Separate loss and the Enhance loss, that reduce object mask overlaps and maximize attention scores.
Our method diverges from conventional test-time-adaptation techniques, focusing on finetuning critical parameters, which enhances scalability and generalizability.
arXiv Detail & Related papers (2023-12-10T22:07:42Z) - On the Cultural Gap in Text-to-Image Generation [75.69755281031951]
One challenge in text-to-image (T2I) generation is the inadvertent reflection of culture gaps present in the training data.
There is no benchmark to systematically evaluate a T2I model's ability to generate cross-cultural images.
We propose a Challenging Cross-Cultural (C3) benchmark with comprehensive evaluation criteria, which can assess how well-suited a model is to a target culture.
arXiv Detail & Related papers (2023-07-06T13:17:55Z) - Towards Equitable Representation in Text-to-Image Synthesis Models with
the Cross-Cultural Understanding Benchmark (CCUB) Dataset [8.006068032606182]
We propose a culturally-aware priming approach for text-to-image synthesis using a small but culturally curated dataset.
Our experiments indicate that priming using both text and image is effective in improving the cultural relevance and decreasing the offensiveness of generated images.
arXiv Detail & Related papers (2023-01-28T03:10:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.