Related papers: CulturalBench: A Robust, Diverse, and Challenging Cultural Benchmark by Human-AI CulturalTeaming

CulturalBench: A Robust, Diverse, and Challenging Cultural Benchmark by Human-AI CulturalTeaming

URL: http://arxiv.org/abs/2410.02677v2
Date: Tue, 03 Jun 2025 01:56:26 GMT
Title: CulturalBench: A Robust, Diverse, and Challenging Cultural Benchmark by Human-AI CulturalTeaming
Authors: Yu Ying Chiu, Liwei Jiang, Bill Yuchen Lin, Chan Young Park, Shuyue Stella Li, Sahithya Ravi, Mehar Bhatia, Maria Antoniak, Yulia Tsvetkov, Vered Shwartz, Yejin Choi,
Abstract summary: CulturalBench is a set of 1,696 human-written and human-verified questions to assess LMs' cultural knowledge.<n>It covers 45 global regions including underrepresented ones like Bangladesh, Zimbabwe, and Peru.<n>We construct CulturalBench using methods inspired by Human-AI Red-Teaming.
Score: 75.82306181299153
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Robust, diverse, and challenging cultural knowledge benchmarks are essential for measuring our progress towards making LMs that are helpful across diverse cultures. We introduce CulturalBench: a set of 1,696 human-written and human-verified questions to assess LMs' cultural knowledge, covering 45 global regions including underrepresented ones like Bangladesh, Zimbabwe, and Peru. Questions are each verified by five independent annotators and span 17 diverse topics ranging from food preferences to greeting etiquette. We construct CulturalBench using methods inspired by Human-AI Red-Teaming. Compared to human performance (92.4% accuracy), the hard version of CulturalBench is challenging even for the best-performing frontier LMs, ranging from 28.7% to 61.5% in accuracy. We find that LMs often struggle with tricky questions that have multiple correct answers (e.g., What utensils do the Chinese usually use?), revealing a tendency to overfit to a single answer. Our results indicate that GPT-4o substantially outperform other models across cultures, besting local providers (e.g., Mistral on European culture and DeepSeek on Chinese culture). Across the board, models under-perform on questions related to North Africa, South America and Middle East.

Related papers

Fluent but Culturally Distant: Can Regional Training Teach Cultural Understanding? [17.231806929840015]
We evaluate five Indic and five global LLMs along two key dimensions: values and practices.<n>Across all four tasks, we find that Indic models do not align more closely with Indian cultural norms than global models.<n>We trace this failure to the scarcity of high-quality, untranslated, and culturally grounded pretraining and fine-tuning data.
arXiv Detail & Related papers (2025-05-25T01:59:23Z)
From Surveys to Narratives: Rethinking Cultural Value Adaptation in LLMs [57.43233760384488]
Adapting cultural values in Large Language Models (LLMs) presents significant challenges.<n>Prior work primarily aligns LLMs with different cultural values using World Values Survey (WVS) data.<n>In this paper, we investigate WVS-based training for cultural value adaptation and find that relying solely on survey data cane cultural norms and interfere with factual knowledge.
arXiv Detail & Related papers (2025-05-22T09:00:01Z)
An Evaluation of Cultural Value Alignment in LLM [27.437888319382893]
We conduct the first large-scale evaluation of LLM culture assessing 20 countries' cultures and languages across ten LLMs. Our findings show that the output over all models represents a moderate cultural middle ground. Deeper investigation sheds light on the influence of model origin, prompt language, and value dimensions on cultural output.
arXiv Detail & Related papers (2025-04-11T09:13:19Z)
CAReDiO: Cultural Alignment of LLM via Representativeness and Distinctiveness Guided Data Optimization [50.90288681622152]
Large Language Models (LLMs) more deeply integrate into human life across various regions. Existing approaches develop culturally aligned LLMs through fine-tuning with culture-specific corpora. We introduce CAReDiO, a novel cultural data construction framework.
arXiv Detail & Related papers (2025-04-09T13:40:13Z)
SaudiCulture: A Benchmark for Evaluating Large Language Models Cultural Competence within Saudi Arabia [0.1499944454332829]
This research focuses on Saudi Arabia, a country characterized by diverse dialects and rich cultural traditions. We introduce SaudiCulture, a novel benchmark designed to evaluate the cultural competence of Large Language Models (LLMs) The dataset encompasses a broad spectrum of cultural domains, including food, clothing, entertainment, celebrations, and crafts.
arXiv Detail & Related papers (2025-03-21T18:55:10Z)
When Tom Eats Kimchi: Evaluating Cultural Bias of Multimodal Large Language Models in Cultural Mixture Contexts [15.78054683369659]
We introduce MixCuBe, a cross-cultural bias benchmark, and study elements from five countries and four ethnicities.<n>Our findings reveal that MLLMs achieve both higher accuracy and lower sensitivity to such perturbations for high-resource cultures.<n>GPT-4o, the best-performing model overall, shows up to 58% difference in accuracy between the original and perturbed cultural settings in low-resource cultures.
arXiv Detail & Related papers (2025-03-21T03:50:05Z)
GIMMICK -- Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking [29.664707739055068]
We introduce GIMMICK, an extensive benchmark designed to assess a broad spectrum of cultural knowledge across 144 countries. GIMMICK comprises six tasks built upon three new datasets that span 728 unique cultural events or facets. We examine (1) regional cultural biases, (2) the influence of model size, (3) input modalities, and (4) external cues.
arXiv Detail & Related papers (2025-02-19T14:27:40Z)
CultureVLM: Characterizing and Improving Cultural Understanding of Vision-Language Models for over 100 Countries [63.00147630084146]
Vision-language models (VLMs) have advanced human-AI interaction but struggle with cultural understanding.<n>CultureVerse is a large-scale multimodal benchmark covering 19, 682 cultural concepts, 188 countries/regions, 15 cultural concepts, and 3 question types.<n>We propose CultureVLM, a series of VLMs fine-tuned on our dataset to achieve significant performance improvement in cultural understanding.
arXiv Detail & Related papers (2025-01-02T14:42:37Z)
Self-Pluralising Culture Alignment for Large Language Models [36.689491885394034]
We propose CultureSPA, a framework that allows large language models to align to pluralistic cultures. By comparing culture-aware/unaware outputs, we are able to detect and collect culture-related instances. Extensive experiments demonstrate that CultureSPA significantly improves the alignment of LLMs to diverse cultures without compromising general abilities.
arXiv Detail & Related papers (2024-10-16T19:06:08Z)
Cultural Value Differences of LLMs: Prompt, Language, and Model Size [35.176429953825924]
Our study aims to identify behavior patterns in cultural values exhibited by large language models (LLMs) The studied variants include question ordering, prompting language, and model size. Our experiments reveal that query language and model size of LLM are the main factors resulting in cultural value differences.
arXiv Detail & Related papers (2024-06-17T12:35:33Z)
BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages [39.17279399722437]
Large language models (LLMs) often lack culture-specific knowledge of daily life, especially across diverse regions and non-English languages. We introduce BLEnD, a hand-crafted benchmark designed to evaluate LLMs' everyday knowledge across diverse cultures and languages. We construct the benchmark to include two formats of questions: short-answer and multiple-choice.
arXiv Detail & Related papers (2024-06-14T11:48:54Z)
CulturePark: Boosting Cross-cultural Understanding in Large Language Models [63.452948673344395]
This paper introduces CulturePark, an LLM-powered multi-agent communication framework for cultural data collection. It generates high-quality cross-cultural dialogues encapsulating human beliefs, norms, and customs. We evaluate these models across three downstream tasks: content moderation, cultural alignment, and cultural education.
arXiv Detail & Related papers (2024-05-24T01:49:02Z)
Understanding the Capabilities and Limitations of Large Language Models for Cultural Commonsense [98.09670425244462]
Large language models (LLMs) have demonstrated substantial commonsense understanding. This paper examines the capabilities and limitations of several state-of-the-art LLMs in the context of cultural commonsense tasks.
arXiv Detail & Related papers (2024-05-07T20:28:34Z)
Does Mapo Tofu Contain Coffee? Probing LLMs for Food-related Cultural Knowledge [47.57055368312541]
We introduce FmLAMA, a multilingual dataset centered on food-related cultural facts and variations in food practices. We analyze LLMs across various architectures and configurations, evaluating their performance in both monolingual and multilingual settings.
arXiv Detail & Related papers (2024-04-10T08:49:27Z)
CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs' (Lack of) Multicultural Knowledge [69.82940934994333]
We introduce CulturalTeaming, an interactive red-teaming system that leverages human-AI collaboration to build challenging evaluation dataset. Our study reveals that CulturalTeaming's various modes of AI assistance support annotators in creating cultural questions. CULTURALBENCH-V0.1 is a compact yet high-quality evaluation dataset with users' red-teaming attempts.
arXiv Detail & Related papers (2024-04-10T00:25:09Z)
Not All Countries Celebrate Thanksgiving: On the Cultural Dominance in Large Language Models [89.94270049334479]
This paper identifies a cultural dominance issue within large language models (LLMs) LLMs often provide inappropriate English-culture-related answers that are not relevant to the expected culture when users ask in non-English languages.
arXiv Detail & Related papers (2023-10-19T05:38:23Z)
Large language models can replicate cross-cultural differences in personality [0.0]
We use a large-scale experiment to determine whether GPT-4 can replicate cross-cultural differences in the Big Five.<n>We used the US and South Korea as the cultural pair.
arXiv Detail & Related papers (2023-10-12T11:17:23Z)
Cultural Alignment in Large Language Models: An Explanatory Analysis Based on Hofstede's Cultural Dimensions [10.415002561977655]
This research proposes a Cultural Alignment Test (Hoftede's CAT) to quantify cultural alignment using Hofstede's cultural dimension framework. We quantitatively evaluate large language models (LLMs) against the cultural dimensions of regions like the United States, China, and Arab countries. Our results quantify the cultural alignment of LLMs and reveal the difference between LLMs in explanatory cultural dimensions.
arXiv Detail & Related papers (2023-08-25T14:50:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.