Related papers: CuRe: Cultural Gaps in the Long Tail of Text-to-Image Systems

CuRe: Cultural Gaps in the Long Tail of Text-to-Image Systems

URL: http://arxiv.org/abs/2506.08071v1
Date: Mon, 09 Jun 2025 17:54:41 GMT
Title: CuRe: Cultural Gaps in the Long Tail of Text-to-Image Systems
Authors: Aniket Rege, Zinnia Nie, Mahesh Ramesh, Unmesh Raskar, Zhuoran Yu, Aditya Kusupati, Yong Jae Lee, Ramya Korlakai Vinayak,
Abstract summary: CuRe is a benchmarking and scoring suite for cultural representativeness.<n>Our dataset has 300 cultural artifacts across 32 cultural subcategories grouped into six broad cultural axes.<n>We empirically observe much stronger correlations of our class of scorers to human judgments of perceptual similarity, image-text alignment, and cultural diversity.
Score: 28.181690831408833
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Popular text-to-image (T2I) systems are trained on web-scraped data, which is heavily Amero and Euro-centric, underrepresenting the cultures of the Global South. To analyze these biases, we introduce CuRe, a novel and scalable benchmarking and scoring suite for cultural representativeness that leverages the marginal utility of attribute specification to T2I systems as a proxy for human judgments. Our CuRe benchmark dataset has a novel categorical hierarchy built from the crowdsourced Wikimedia knowledge graph, with 300 cultural artifacts across 32 cultural subcategories grouped into six broad cultural axes (food, art, fashion, architecture, celebrations, and people). Our dataset's categorical hierarchy enables CuRe scorers to evaluate T2I systems by analyzing their response to increasing the informativeness of text conditioning, enabling fine-grained cultural comparisons. We empirically observe much stronger correlations of our class of scorers to human judgments of perceptual similarity, image-text alignment, and cultural diversity across image encoders (SigLIP 2, AIMV2 and DINOv2), vision-language models (OpenCLIP, SigLIP 2, Gemini 2.0 Flash) and state-of-the-art text-to-image systems, including three variants of Stable Diffusion (1.5, XL, 3.5 Large), FLUX.1 [dev], Ideogram 2.0, and DALL-E 3. The code and dataset is open-sourced and available at https://aniketrege.github.io/cure/.

Related papers

CAIRe: Cultural Attribution of Images by Retrieval-Augmented Evaluation [61.130639734982395]
We introduce CAIRe, a novel evaluation metric that assesses the degree of cultural relevance of an image.<n>Our framework grounds entities and concepts in the image to a knowledge base and uses factual information to give independent graded judgments for each culture label.
arXiv Detail & Related papers (2025-06-10T17:16:23Z)
Can we Debias Social Stereotypes in AI-Generated Images? Examining Text-to-Image Outputs and User Perceptions [6.87895735248661]
This paper proposes a theory-driven bias detection rubric and a Social Stereotype Index (SSI) to evaluate social biases in T2I outputs.<n>We audited three major T2I model outputs using 100 queries across three categories -- geocultural, occupational, and adjectival.<n>Our findings reveal a key tension -- although prompt refinement can mitigate stereotypes, it can limit contextual alignment.
arXiv Detail & Related papers (2025-05-27T04:01:03Z)
RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding [79.44246283490665]
We introduce RAVENEA, a new benchmark designed to advance visual culture understanding through retrieval.<n>RAVENEA focuses on two tasks: culture-focused visual question answering (cVQA) and culture-informed image captioning (cIC)<n>We train and evaluate seven multimodal retrievers for each image query, and measure the downstream impact of retrieval-augmented inputs across fourteen state-of-the-art vision-language models.
arXiv Detail & Related papers (2025-05-20T14:57:16Z)
Deconstructing Bias: A Multifaceted Framework for Diagnosing Cultural and Compositional Inequities in Text-to-Image Generative Models [3.6335172274433414]
This paper benchmarks the Component Inclusion Score (CIS), a metric designed to evaluate the fidelity of image generation across cultural contexts.<n>We quantify biases in terms of compositional fragility and contextual misalignment, revealing significant performance gaps between Western and non-Western cultural prompts.
arXiv Detail & Related papers (2025-04-05T06:17:43Z)
Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model [69.09404597939744]
Seedream 2.0 is a native Chinese-English bilingual image generation foundation model.<n>It adeptly manages text prompt in both Chinese and English, supporting bilingual image generation and text rendering.<n>It is integrated with a self-developed bilingual large language model as a text encoder, allowing it to learn native knowledge directly from massive data.
arXiv Detail & Related papers (2025-03-10T17:58:33Z)
What's In My Big Data? [67.04525616289949]
We propose What's In My Big Data? (WIMBD), a platform and a set of sixteen analyses that allow us to reveal and compare the contents of large text corpora. WIMBD builds on two basic capabilities -- count and search -- at scale, which allows us to analyze more than 35 terabytes on a standard compute node. Our analysis uncovers several surprising and previously undocumented findings about these corpora, including the high prevalence of duplicate, synthetic, and low-quality content.
arXiv Detail & Related papers (2023-10-31T17:59:38Z)
On the Cultural Gap in Text-to-Image Generation [75.69755281031951]
One challenge in text-to-image (T2I) generation is the inadvertent reflection of culture gaps present in the training data. There is no benchmark to systematically evaluate a T2I model's ability to generate cross-cultural images. We propose a Challenging Cross-Cultural (C3) benchmark with comprehensive evaluation criteria, which can assess how well-suited a model is to a target culture.
arXiv Detail & Related papers (2023-07-06T13:17:55Z)
Towards Equitable Representation in Text-to-Image Synthesis Models with the Cross-Cultural Understanding Benchmark (CCUB) Dataset [8.006068032606182]
We propose a culturally-aware priming approach for text-to-image synthesis using a small but culturally curated dataset. Our experiments indicate that priming using both text and image is effective in improving the cultural relevance and decreasing the offensiveness of generated images.
arXiv Detail & Related papers (2023-01-28T03:10:33Z)
Minimally-Supervised Structure-Rich Text Categorization via Learning on Text-Rich Networks [61.23408995934415]
We propose a novel framework for minimally supervised categorization by learning from the text-rich network. Specifically, we jointly train two modules with different inductive biases -- a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning. Our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%.
arXiv Detail & Related papers (2021-02-23T04:14:34Z)
Structure-Augmented Text Representation Learning for Efficient Knowledge Graph Completion [53.31911669146451]
Human-curated knowledge graphs provide critical supportive information to various natural language processing tasks. These graphs are usually incomplete, urging auto-completion of them. graph embedding approaches, e.g., TransE, learn structured knowledge via representing graph elements into dense embeddings. textual encoding approaches, e.g., KG-BERT, resort to graph triple's text and triple-level contextualized representations.
arXiv Detail & Related papers (2020-04-30T13:50:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.