Related papers: CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics

CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics

URL: http://arxiv.org/abs/2506.08835v1
Date: Tue, 10 Jun 2025 14:21:46 GMT
Title: CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics
Authors: Shravan Nayak, Mehar Bhatia, Xiaofeng Zhang, Verena Rieser, Lisa Anne Hendricks, Sjoerd van Steenkiste, Yash Goyal, Karolina Stańczak, Aishwarya Agrawal,
Abstract summary: We quantify the alignment of text-to-image (T2I) models and evaluation metrics with respect to both explicit and implicit cultural expectations.<n>We introduce CulturalFrames, a novel benchmark designed for rigorous human evaluation of cultural representation in visual generations.<n>We find that T2I models fail to meet the more challenging implicit expectations but also the less challenging explicit expectations.
Score: 23.567641319277943
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The increasing ubiquity of text-to-image (T2I) models as tools for visual content generation raises concerns about their ability to accurately represent diverse cultural contexts. In this work, we present the first study to systematically quantify the alignment of T2I models and evaluation metrics with respect to both explicit as well as implicit cultural expectations. To this end, we introduce CulturalFrames, a novel benchmark designed for rigorous human evaluation of cultural representation in visual generations. Spanning 10 countries and 5 socio-cultural domains, CulturalFrames comprises 983 prompts, 3637 corresponding images generated by 4 state-of-the-art T2I models, and over 10k detailed human annotations. We find that T2I models not only fail to meet the more challenging implicit expectations but also the less challenging explicit expectations. Across models and countries, cultural expectations are missed an average of 44% of the time. Among these failures, explicit expectations are missed at a surprisingly high average rate of 68%, while implicit expectation failures are also significant, averaging 49%. Furthermore, we demonstrate that existing T2I evaluation metrics correlate poorly with human judgments of cultural alignment, irrespective of their internal reasoning. Collectively, our findings expose critical gaps, providing actionable directions for developing more culturally informed T2I models and evaluation methodologies.

Related papers

CAIRe: Cultural Attribution of Images by Retrieval-Augmented Evaluation [61.130639734982395]
We introduce CAIRe, a novel evaluation metric that assesses the degree of cultural relevance of an image.<n>Our framework grounds entities and concepts in the image to a knowledge base and uses factual information to give independent graded judgments for each culture label.
arXiv Detail & Related papers (2025-06-10T17:16:23Z)
Multimodal Cultural Safety: Evaluation Frameworks and Alignment Strategies [58.88053690412802]
Large vision-language models (LVLMs) are increasingly deployed in globally distributed applications, such as tourism assistants.<n> CROSS is a benchmark designed to assess the cultural safety reasoning capabilities of LVLMs.<n>We evaluate 21 leading LVLMs, including mixture-of-experts models and reasoning models.
arXiv Detail & Related papers (2025-05-20T23:20:38Z)
Deconstructing Bias: A Multifaceted Framework for Diagnosing Cultural and Compositional Inequities in Text-to-Image Generative Models [3.6335172274433414]
This paper benchmarks the Component Inclusion Score (CIS), a metric designed to evaluate the fidelity of image generation across cultural contexts.<n>We quantify biases in terms of compositional fragility and contextual misalignment, revealing significant performance gaps between Western and non-Western cultural prompts.
arXiv Detail & Related papers (2025-04-05T06:17:43Z)
CultureVLM: Characterizing and Improving Cultural Understanding of Vision-Language Models for over 100 Countries [63.00147630084146]
Vision-language models (VLMs) have advanced human-AI interaction but struggle with cultural understanding.<n>CultureVerse is a large-scale multimodal benchmark covering 19, 682 cultural concepts, 188 countries/regions, 15 cultural concepts, and 3 question types.<n>We propose CultureVLM, a series of VLMs fine-tuned on our dataset to achieve significant performance improvement in cultural understanding.
arXiv Detail & Related papers (2025-01-02T14:42:37Z)
Beyond Aesthetics: Cultural Competence in Text-to-Image Models [34.98692829036475]
CUBE is a first-of-its-kind benchmark to evaluate cultural competence of Text-to-Image models.<n>CUBE covers cultural artifacts associated with 8 countries across different geo-cultural regions.<n>CUBE-CSpace is a larger dataset of cultural artifacts that serves as grounding to evaluate cultural diversity.
arXiv Detail & Related papers (2024-07-09T13:50:43Z)
CulturePark: Boosting Cross-cultural Understanding in Large Language Models [63.452948673344395]
This paper introduces CulturePark, an LLM-powered multi-agent communication framework for cultural data collection. It generates high-quality cross-cultural dialogues encapsulating human beliefs, norms, and customs. We evaluate these models across three downstream tasks: content moderation, cultural alignment, and cultural education.
arXiv Detail & Related papers (2024-05-24T01:49:02Z)
Not All Countries Celebrate Thanksgiving: On the Cultural Dominance in Large Language Models [89.94270049334479]
This paper identifies a cultural dominance issue within large language models (LLMs) LLMs often provide inappropriate English-culture-related answers that are not relevant to the expected culture when users ask in non-English languages.
arXiv Detail & Related papers (2023-10-19T05:38:23Z)
Navigating Cultural Chasms: Exploring and Unlocking the Cultural POV of Text-To-Image Models [32.99865895211158]
We explore the cultural perception embedded in Text-To-Image (TTI) models by characterizing culture across three tiers. We propose a comprehensive suite of evaluation techniques, including intrinsic evaluations using the CLIP space. To bolster our research, we introduce the CulText2I dataset, derived from six diverse TTI models and spanning ten languages.
arXiv Detail & Related papers (2023-10-03T10:13:36Z)
On the Cultural Gap in Text-to-Image Generation [75.69755281031951]
One challenge in text-to-image (T2I) generation is the inadvertent reflection of culture gaps present in the training data. There is no benchmark to systematically evaluate a T2I model's ability to generate cross-cultural images. We propose a Challenging Cross-Cultural (C3) benchmark with comprehensive evaluation criteria, which can assess how well-suited a model is to a target culture.
arXiv Detail & Related papers (2023-07-06T13:17:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.