Related papers: Break the Checkbox: Challenging Closed-Style Evaluations of Cultural Alignment in LLMs

Related papers

Do Large Language Models Truly Understand Cross-cultural Differences? [53.481048019144644]
We develop a scenario-based benchmark to evaluate large language models' cross-cultural understanding and reasoning.<n>Grounded in cultural theory, we categorize cross-cultural capabilities into nine dimensions.<n>The dataset supports continuous expansion, and experiments confirm its transferability to other languages.
arXiv Detail & Related papers (2025-12-08T01:21:58Z)
CURE: Cultural Understanding and Reasoning Evaluation - A Framework for "Thick" Culture Alignment Evaluation in LLMs [24.598338950728234]
Large language models (LLMs) are increasingly deployed in culturally diverse environments.<n>Existing methods focus on de-contextualized correctness or forced-choice judgments.<n>We introduce a set of benchmarks that present models with realistic situational contexts.
arXiv Detail & Related papers (2025-11-15T03:39:13Z)
LLMs and Cultural Values: the Impact of Prompt Language and Explicit Cultural Framing [0.21485350418225244]
Large Language Models (LLMs) are rapidly being adopted by users across the globe, who interact with them in a diverse range of languages.<n>We examine how prompt language and cultural framing influence model responses and their alignment with human values in different countries.
arXiv Detail & Related papers (2025-11-06T02:09:29Z)
Cross-Cultural Transfer of Commonsense Reasoning in LLMs: Evidence from the Arab World [68.19795061447044]
This paper investigates cross-cultural transfer of commonsense reasoning in the Arab world.<n>Using a culturally grounded commonsense reasoning dataset covering 13 Arab countries, we evaluate lightweight alignment methods.<n>Our results show that merely 12 culture-specific examples from one country can improve performance in others by 10% on average.
arXiv Detail & Related papers (2025-09-23T17:24:14Z)
CultureScope: A Dimensional Lens for Probing Cultural Understanding in LLMs [57.653830744706305]
CultureScope is the most comprehensive evaluation framework to date for assessing cultural understanding in large language models.<n>Inspired by the cultural iceberg theory, we design a novel dimensional schema for cultural knowledge classification.<n> Experimental results demonstrate that our method can effectively evaluate cultural understanding.
arXiv Detail & Related papers (2025-09-19T17:47:48Z)
Culture is Everywhere: A Call for Intentionally Cultural Evaluation [36.20861746863831]
We argue for textbfintentionally cultural evaluation: an approach that systematically examines the cultural assumptions embedded in all aspects of evaluation.<n>We discuss implications and future directions for moving beyond current benchmarking practices.
arXiv Detail & Related papers (2025-09-01T09:39:21Z)
MyCulture: Exploring Malaysia's Diverse Culture under Low-Resource Language Constraints [7.822567458977689]
MyCulture is a benchmark designed to comprehensively evaluate Large Language Models (LLMs) on Malaysian culture.<n>Unlike conventional benchmarks, MyCulture employs a novel open-ended multiple-choice question format without predefined options.<n>We analyze structural bias by comparing model performance on structured versus free-form outputs, and assess language bias through multilingual prompt variations.
arXiv Detail & Related papers (2025-08-07T14:17:43Z)
MCEval: A Dynamic Framework for Fair Multilingual Cultural Evaluation of LLMs [25.128936333806678]
Large language models exhibit cultural biases and limited cross-cultural understanding capabilities.<n>We propose MCEval, a novel multilingual evaluation framework that employs dynamic cultural question construction.
arXiv Detail & Related papers (2025-07-13T16:24:35Z)
From Word to World: Evaluate and Mitigate Culture Bias via Word Association Test [48.623761108859085]
We extend the human-centered word association test (WAT) to assess the alignment of large language models with cross-cultural cognition.<n>To mitigate the culture preference, we propose CultureSteer, an innovative approach that integrates a culture-aware steering mechanism.
arXiv Detail & Related papers (2025-05-24T07:05:10Z)
From Surveys to Narratives: Rethinking Cultural Value Adaptation in LLMs [57.43233760384488]
Adapting cultural values in Large Language Models (LLMs) presents significant challenges.<n>Prior work primarily aligns LLMs with different cultural values using World Values Survey (WVS) data.<n>In this paper, we investigate WVS-based training for cultural value adaptation and find that relying solely on survey data cane cultural norms and interfere with factual knowledge.
arXiv Detail & Related papers (2025-05-22T09:00:01Z)
An Evaluation of Cultural Value Alignment in LLM [27.437888319382893]
We conduct the first large-scale evaluation of LLM culture assessing 20 countries' cultures and languages across ten LLMs. Our findings show that the output over all models represents a moderate cultural middle ground. Deeper investigation sheds light on the influence of model origin, prompt language, and value dimensions on cultural output.
arXiv Detail & Related papers (2025-04-11T09:13:19Z)
CAReDiO: Cultural Alignment of LLM via Representativeness and Distinctiveness Guided Data Optimization [50.90288681622152]
Large Language Models (LLMs) more deeply integrate into human life across various regions. Existing approaches develop culturally aligned LLMs through fine-tuning with culture-specific corpora. We introduce CAReDiO, a novel cultural data construction framework.
arXiv Detail & Related papers (2025-04-09T13:40:13Z)
Cultural Learning-Based Culture Adaptation of Language Models [70.1063219524999]
Adapting large language models (LLMs) to diverse cultural values is a challenging task. We present CLCA, a novel framework for enhancing LLM alignment with cultural values based on cultural learning.
arXiv Detail & Related papers (2025-04-03T18:16:26Z)
Randomness, Not Representation: The Unreliability of Evaluating Cultural Alignment in LLMs [7.802103248428407]
We identify and test three assumptions behind current survey-based evaluation methods. We find a high level of instability across presentation formats, incoherence between evaluated versus held-out cultural dimensions, and erratic behavior under prompt steering.
arXiv Detail & Related papers (2025-03-11T17:59:53Z)
Rethinking AI Cultural Evaluation [1.8434042562191815]
Current evaluation methods predominantly rely on multiple-choice question (MCQ) datasets.<n>Our findings highlight significant discrepancies between MCQ-based assessments and the values conveyed in unconstrained interactions.<n>We recommend moving beyond MCQs to adopt more open-ended, context-specific assessments.
arXiv Detail & Related papers (2025-01-13T23:42:37Z)
Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation [71.59208664920452]
Cultural biases in multilingual datasets pose significant challenges for their effectiveness as global benchmarks. We show that progress on MMLU depends heavily on learning Western-centric concepts, with 28% of all questions requiring culturally sensitive knowledge. We release Global MMLU, an improved MMLU with evaluation coverage across 42 languages.
arXiv Detail & Related papers (2024-12-04T13:27:09Z)
LLM-GLOBE: A Benchmark Evaluating the Cultural Values Embedded in LLM Output [8.435090588116973]
We propose the LLM-GLOBE benchmark for evaluating the cultural value systems of LLMs. We then leverage the benchmark to compare the values of Chinese and US LLMs. Our methodology includes a novel "LLMs-as-a-Jury" pipeline which automates the evaluation of open-ended content.
arXiv Detail & Related papers (2024-11-09T01:38:55Z)
DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.<n>The question of how reliable these evaluators are has emerged as a crucial research question.<n>We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z)
Understanding the Capabilities and Limitations of Large Language Models for Cultural Commonsense [98.09670425244462]
Large language models (LLMs) have demonstrated substantial commonsense understanding. This paper examines the capabilities and limitations of several state-of-the-art LLMs in the context of cultural commonsense tasks.
arXiv Detail & Related papers (2024-05-07T20:28:34Z)
CDEval: A Benchmark for Measuring the Cultural Dimensions of Large Language Models [41.885600036131045]
CDEval is a benchmark aimed at evaluating the cultural dimensions of Large Language Models. It is constructed by incorporating both GPT-4's automated generation and human verification, covering six cultural dimensions across seven domains.
arXiv Detail & Related papers (2023-11-28T02:01:25Z)
Not All Countries Celebrate Thanksgiving: On the Cultural Dominance in Large Language Models [89.94270049334479]
This paper identifies a cultural dominance issue within large language models (LLMs) LLMs often provide inappropriate English-culture-related answers that are not relevant to the expected culture when users ask in non-English languages.
arXiv Detail & Related papers (2023-10-19T05:38:23Z)
Cultural Alignment in Large Language Models: An Explanatory Analysis Based on Hofstede's Cultural Dimensions [10.415002561977655]
This research proposes a Cultural Alignment Test (Hoftede's CAT) to quantify cultural alignment using Hofstede's cultural dimension framework. We quantitatively evaluate large language models (LLMs) against the cultural dimensions of regions like the United States, China, and Arab countries. Our results quantify the cultural alignment of LLMs and reveal the difference between LLMs in explanatory cultural dimensions.
arXiv Detail & Related papers (2023-08-25T14:50:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.