CAIRe: Cultural Attribution of Images by Retrieval-Augmented Evaluation
- URL: http://arxiv.org/abs/2506.09109v1
- Date: Tue, 10 Jun 2025 17:16:23 GMT
- Title: CAIRe: Cultural Attribution of Images by Retrieval-Augmented Evaluation
- Authors: Arnav Yayavaram, Siddharth Yayavaram, Simran Khanuja, Michael Saxon, Graham Neubig,
- Abstract summary: We introduce CAIRe, a novel evaluation metric that assesses the degree of cultural relevance of an image.<n>Our framework grounds entities and concepts in the image to a knowledge base and uses factual information to give independent graded judgments for each culture label.
- Score: 61.130639734982395
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As text-to-image models become increasingly prevalent, ensuring their equitable performance across diverse cultural contexts is critical. Efforts to mitigate cross-cultural biases have been hampered by trade-offs, including a loss in performance, factual inaccuracies, or offensive outputs. Despite widespread recognition of these challenges, an inability to reliably measure these biases has stalled progress. To address this gap, we introduce CAIRe, a novel evaluation metric that assesses the degree of cultural relevance of an image, given a user-defined set of labels. Our framework grounds entities and concepts in the image to a knowledge base and uses factual information to give independent graded judgments for each culture label. On a manually curated dataset of culturally salient but rare items built using language models, CAIRe surpasses all baselines by 28% F1 points. Additionally, we construct two datasets for culturally universal concept, one comprising of T2I-generated outputs and another retrieved from naturally occurring data. CAIRe achieves Pearson's correlations of 0.56 and 0.66 with human ratings on these sets, based on a 5-point Likert scale of cultural relevance. This demonstrates its strong alignment with human judgment across diverse image sources.
Related papers
- CultureCLIP: Empowering CLIP with Cultural Awareness through Synthetic Images and Contextualized Captions [4.149285362505653]
Pretrained vision-language models (VLMs) excel in general multimodal comprehension but often struggle to capture nuanced, context-dependent visual cues.<n>We design a data curation pipeline leveraging open-sourced VLMs and text-to-image models to construct CulTwin, a synthetic cultural dataset.<n>Then, we fine-tune CLIP on CulTwin to develop CultureCLIP, which aligns cultural concepts with contextually enhanced captions and synthetic images.
arXiv Detail & Related papers (2025-07-08T17:38:56Z) - CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics [23.567641319277943]
We quantify the alignment of text-to-image (T2I) models and evaluation metrics with respect to both explicit and implicit cultural expectations.<n>We introduce CulturalFrames, a novel benchmark designed for rigorous human evaluation of cultural representation in visual generations.<n>We find that T2I models fail to meet the more challenging implicit expectations but also the less challenging explicit expectations.
arXiv Detail & Related papers (2025-06-10T14:21:46Z) - From Surveys to Narratives: Rethinking Cultural Value Adaptation in LLMs [57.43233760384488]
Adapting cultural values in Large Language Models (LLMs) presents significant challenges.<n>Prior work primarily aligns LLMs with different cultural values using World Values Survey (WVS) data.<n>In this paper, we investigate WVS-based training for cultural value adaptation and find that relying solely on survey data cane cultural norms and interfere with factual knowledge.
arXiv Detail & Related papers (2025-05-22T09:00:01Z) - RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding [79.44246283490665]
We introduce RAVENEA, a new benchmark designed to advance visual culture understanding through retrieval.<n>RAVENEA focuses on two tasks: culture-focused visual question answering (cVQA) and culture-informed image captioning (cIC)<n>We train and evaluate seven multimodal retrievers for each image query, and measure the downstream impact of retrieval-augmented inputs across fourteen state-of-the-art vision-language models.
arXiv Detail & Related papers (2025-05-20T14:57:16Z) - Deconstructing Bias: A Multifaceted Framework for Diagnosing Cultural and Compositional Inequities in Text-to-Image Generative Models [3.6335172274433414]
This paper benchmarks the Component Inclusion Score (CIS), a metric designed to evaluate the fidelity of image generation across cultural contexts.<n>We quantify biases in terms of compositional fragility and contextual misalignment, revealing significant performance gaps between Western and non-Western cultural prompts.
arXiv Detail & Related papers (2025-04-05T06:17:43Z) - Beyond Aesthetics: Cultural Competence in Text-to-Image Models [34.98692829036475]
CUBE is a first-of-its-kind benchmark to evaluate cultural competence of Text-to-Image models.<n>CUBE covers cultural artifacts associated with 8 countries across different geo-cultural regions.<n>CUBE-CSpace is a larger dataset of cultural artifacts that serves as grounding to evaluate cultural diversity.
arXiv Detail & Related papers (2024-07-09T13:50:43Z) - Extrinsic Evaluation of Cultural Competence in Large Language Models [53.626808086522985]
We focus on extrinsic evaluation of cultural competence in two text generation tasks.
We evaluate model outputs when an explicit cue of culture, specifically nationality, is perturbed in the prompts.
We find weak correlations between text similarity of outputs for different countries and the cultural values of these countries.
arXiv Detail & Related papers (2024-06-17T14:03:27Z) - CulturePark: Boosting Cross-cultural Understanding in Large Language Models [63.452948673344395]
This paper introduces CulturePark, an LLM-powered multi-agent communication framework for cultural data collection.
It generates high-quality cross-cultural dialogues encapsulating human beliefs, norms, and customs.
We evaluate these models across three downstream tasks: content moderation, cultural alignment, and cultural education.
arXiv Detail & Related papers (2024-05-24T01:49:02Z) - On the Cultural Gap in Text-to-Image Generation [75.69755281031951]
One challenge in text-to-image (T2I) generation is the inadvertent reflection of culture gaps present in the training data.
There is no benchmark to systematically evaluate a T2I model's ability to generate cross-cultural images.
We propose a Challenging Cross-Cultural (C3) benchmark with comprehensive evaluation criteria, which can assess how well-suited a model is to a target culture.
arXiv Detail & Related papers (2023-07-06T13:17:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.