Related papers: CURE: Cultural Understanding and Reasoning Evaluation - A Framework for "Thick" Culture Alignment Evaluation in LLMs

CURE: Cultural Understanding and Reasoning Evaluation - A Framework for "Thick" Culture Alignment Evaluation in LLMs

URL: http://arxiv.org/abs/2511.12014v1
Date: Sat, 15 Nov 2025 03:39:13 GMT
Title: CURE: Cultural Understanding and Reasoning Evaluation - A Framework for "Thick" Culture Alignment Evaluation in LLMs
Authors: Truong Vo, Sanmi Koyejo,
Abstract summary: Large language models (LLMs) are increasingly deployed in culturally diverse environments.<n>Existing methods focus on de-contextualized correctness or forced-choice judgments.<n>We introduce a set of benchmarks that present models with realistic situational contexts.
Score: 24.598338950728234
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are increasingly deployed in culturally diverse environments, yet existing evaluations of cultural competence remain limited. Existing methods focus on de-contextualized correctness or forced-choice judgments, overlooking the need for cultural understanding and reasoning required for appropriate responses. To address this gap, we introduce a set of benchmarks that, instead of directly probing abstract norms or isolated statements, present models with realistic situational contexts that require culturally grounded reasoning. In addition to the standard Exact Match metric, we introduce four complementary metrics (Coverage, Specificity, Connotation, and Coherence) to capture different dimensions of model's response quality. Empirical analysis across frontier models reveals that thin evaluation systematically overestimates cultural competence and produces unstable assessments with high variance. In contrast, thick evaluation exposes differences in reasoning depth, reduces variance, and provides more stable, interpretable signals of cultural understanding.

Related papers

Do Large Language Models Truly Understand Cross-cultural Differences? [53.481048019144644]
We develop a scenario-based benchmark to evaluate large language models' cross-cultural understanding and reasoning.<n>Grounded in cultural theory, we categorize cross-cultural capabilities into nine dimensions.<n>The dataset supports continuous expansion, and experiments confirm its transferability to other languages.
arXiv Detail & Related papers (2025-12-08T01:21:58Z)
Hire Your Anthropologist! Rethinking Culture Benchmarks Through an Anthropological Lens [9.000522371422628]
We introduce a four-part framework that categorizes how benchmarks frame culture.<n>We qualitatively examine 20 cultural benchmarks and identify six recurring methodological issues.<n>Our aim is to guide the development of cultural benchmarks that go beyond static recall tasks.
arXiv Detail & Related papers (2025-10-07T13:42:44Z)
'Too much alignment; not enough culture': Re-balancing cultural alignment practices in LLMs [0.0]
This paper argues for a shift towards integrating qualitative approaches into AI alignment practices.<n> Drawing inspiration from Clifford Geertz's concept of "thick description," we propose that AI systems must produce outputs that reflect deeper cultural meanings.
arXiv Detail & Related papers (2025-09-30T12:22:53Z)
CultureScope: A Dimensional Lens for Probing Cultural Understanding in LLMs [57.653830744706305]
CultureScope is the most comprehensive evaluation framework to date for assessing cultural understanding in large language models.<n>Inspired by the cultural iceberg theory, we design a novel dimensional schema for cultural knowledge classification.<n> Experimental results demonstrate that our method can effectively evaluate cultural understanding.
arXiv Detail & Related papers (2025-09-19T17:47:48Z)
Culture is Everywhere: A Call for Intentionally Cultural Evaluation [36.20861746863831]
We argue for textbfintentionally cultural evaluation: an approach that systematically examines the cultural assumptions embedded in all aspects of evaluation.<n>We discuss implications and future directions for moving beyond current benchmarking practices.
arXiv Detail & Related papers (2025-09-01T09:39:21Z)
MyCulture: Exploring Malaysia's Diverse Culture under Low-Resource Language Constraints [7.822567458977689]
MyCulture is a benchmark designed to comprehensively evaluate Large Language Models (LLMs) on Malaysian culture.<n>Unlike conventional benchmarks, MyCulture employs a novel open-ended multiple-choice question format without predefined options.<n>We analyze structural bias by comparing model performance on structured versus free-form outputs, and assess language bias through multilingual prompt variations.
arXiv Detail & Related papers (2025-08-07T14:17:43Z)
From Word to World: Evaluate and Mitigate Culture Bias in LLMs via Word Association Test [50.51344198689069]
We extend the human-centered word association test (WAT) to assess the alignment of large language models with cross-cultural cognition.<n>To address culture preference, we propose CultureSteer, an innovative approach by embedding cultural-specific semantic associations directly within the model's internal representation space.
arXiv Detail & Related papers (2025-05-24T07:05:10Z)
Cultural Learning-Based Culture Adaptation of Language Models [70.1063219524999]
Adapting large language models (LLMs) to diverse cultural values is a challenging task.<n>We present CLCA, a novel framework for enhancing LLM alignment with cultural values based on cultural learning.
arXiv Detail & Related papers (2025-04-03T18:16:26Z)
Randomness, Not Representation: The Unreliability of Evaluating Cultural Alignment in LLMs [7.802103248428407]
We identify and test three assumptions behind current survey-based evaluation methods.<n>We find a high level of instability across presentation formats, incoherence between evaluated versus held-out cultural dimensions, and erratic behavior under prompt steering.
arXiv Detail & Related papers (2025-03-11T17:59:53Z)
CROPE: Evaluating In-Context Adaptation of Vision and Language Models to Culture-Specific Concepts [45.77570690529597]
We introduce CROPE, a visual question answering benchmark designed to probe the knowledge of culture-specific concepts.<n>Our evaluation of several state-of-the-art open Vision and Language models shows large performance disparities between culture-specific and common concepts.<n>Experiments with contextual knowledge indicate that models struggle to effectively utilize multimodal information and bind culture-specific concepts to their depictions.
arXiv Detail & Related papers (2024-10-20T17:31:19Z)
Extrinsic Evaluation of Cultural Competence in Large Language Models [53.626808086522985]
We focus on extrinsic evaluation of cultural competence in two text generation tasks. We evaluate model outputs when an explicit cue of culture, specifically nationality, is perturbed in the prompts. We find weak correlations between text similarity of outputs for different countries and the cultural values of these countries.
arXiv Detail & Related papers (2024-06-17T14:03:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.