Related papers: Culture is Everywhere: A Call for Intentionally Cultural Evaluation

Culture is Everywhere: A Call for Intentionally Cultural Evaluation

URL: http://arxiv.org/abs/2509.01301v2
Date: Wed, 24 Sep 2025 14:02:11 GMT
Title: Culture is Everywhere: A Call for Intentionally Cultural Evaluation
Authors: Juhyun Oh, Inha Cha, Michael Saxon, Hyunseung Lim, Shaily Bhatt, Alice Oh,
Abstract summary: We argue for textbfintentionally cultural evaluation: an approach that systematically examines the cultural assumptions embedded in all aspects of evaluation.<n>We discuss implications and future directions for moving beyond current benchmarking practices.
Score: 36.20861746863831
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The prevailing ``trivia-centered paradigm'' for evaluating the cultural alignment of large language models (LLMs) is increasingly inadequate as these models become more advanced and widely deployed. Existing approaches typically reduce culture to static facts or values, testing models via multiple-choice or short-answer questions that treat culture as isolated trivia. Such methods neglect the pluralistic and interactive realities of culture, and overlook how cultural assumptions permeate even ostensibly ``neutral'' evaluation settings. In this position paper, we argue for \textbf{intentionally cultural evaluation}: an approach that systematically examines the cultural assumptions embedded in all aspects of evaluation, not just in explicitly cultural tasks. We systematically characterize the what, how, and circumstances by which culturally contingent considerations arise in evaluation, and emphasize the importance of researcher positionality for fostering inclusive, culturally aligned NLP research. Finally, we discuss implications and future directions for moving beyond current benchmarking practices, discovering important applications that we don't know exist, and involving communities in evaluation design through HCI-inspired participatory methodologies.

Related papers

CURE: Cultural Understanding and Reasoning Evaluation - A Framework for "Thick" Culture Alignment Evaluation in LLMs [24.598338950728234]
Large language models (LLMs) are increasingly deployed in culturally diverse environments.<n>Existing methods focus on de-contextualized correctness or forced-choice judgments.<n>We introduce a set of benchmarks that present models with realistic situational contexts.
arXiv Detail & Related papers (2025-11-15T03:39:13Z)
Hire Your Anthropologist! Rethinking Culture Benchmarks Through an Anthropological Lens [9.000522371422628]
We introduce a four-part framework that categorizes how benchmarks frame culture.<n>We qualitatively examine 20 cultural benchmarks and identify six recurring methodological issues.<n>Our aim is to guide the development of cultural benchmarks that go beyond static recall tasks.
arXiv Detail & Related papers (2025-10-07T13:42:44Z)
'Too much alignment; not enough culture': Re-balancing cultural alignment practices in LLMs [0.0]
This paper argues for a shift towards integrating qualitative approaches into AI alignment practices.<n> Drawing inspiration from Clifford Geertz's concept of "thick description," we propose that AI systems must produce outputs that reflect deeper cultural meanings.
arXiv Detail & Related papers (2025-09-30T12:22:53Z)
CultureScope: A Dimensional Lens for Probing Cultural Understanding in LLMs [57.653830744706305]
CultureScope is the most comprehensive evaluation framework to date for assessing cultural understanding in large language models.<n>Inspired by the cultural iceberg theory, we design a novel dimensional schema for cultural knowledge classification.<n> Experimental results demonstrate that our method can effectively evaluate cultural understanding.
arXiv Detail & Related papers (2025-09-19T17:47:48Z)
MCEval: A Dynamic Framework for Fair Multilingual Cultural Evaluation of LLMs [25.128936333806678]
Large language models exhibit cultural biases and limited cross-cultural understanding capabilities.<n>We propose MCEval, a novel multilingual evaluation framework that employs dynamic cultural question construction.
arXiv Detail & Related papers (2025-07-13T16:24:35Z)
CulFiT: A Fine-grained Cultural-aware LLM Training Paradigm via Multilingual Critique Data Synthesis [41.261808170896686]
CulFiT is a novel training paradigm that leverages multilingual data and fine-grained reward modeling to enhance cultural sensitivity and inclusivity.<n>Our approach synthesizes diverse cultural-related questions, constructs critique data in culturally relevant languages, and employs fine-grained rewards to decompose cultural texts into verifiable knowledge units.
arXiv Detail & Related papers (2025-05-26T04:08:26Z)
From Word to World: Evaluate and Mitigate Culture Bias via Word Association Test [48.623761108859085]
We extend the human-centered word association test (WAT) to assess the alignment of large language models with cross-cultural cognition.<n>To mitigate the culture preference, we propose CultureSteer, an innovative approach that integrates a culture-aware steering mechanism.
arXiv Detail & Related papers (2025-05-24T07:05:10Z)
From Surveys to Narratives: Rethinking Cultural Value Adaptation in LLMs [57.43233760384488]
Adapting cultural values in Large Language Models (LLMs) presents significant challenges.<n>Prior work primarily aligns LLMs with different cultural values using World Values Survey (WVS) data.<n>In this paper, we investigate WVS-based training for cultural value adaptation and find that relying solely on survey data cane cultural norms and interfere with factual knowledge.
arXiv Detail & Related papers (2025-05-22T09:00:01Z)
Cultural Learning-Based Culture Adaptation of Language Models [70.1063219524999]
Adapting large language models (LLMs) to diverse cultural values is a challenging task.<n>We present CLCA, a novel framework for enhancing LLM alignment with cultural values based on cultural learning.
arXiv Detail & Related papers (2025-04-03T18:16:26Z)
Randomness, Not Representation: The Unreliability of Evaluating Cultural Alignment in LLMs [7.802103248428407]
We identify and test three assumptions behind current survey-based evaluation methods.<n>We find a high level of instability across presentation formats, incoherence between evaluated versus held-out cultural dimensions, and erratic behavior under prompt steering.
arXiv Detail & Related papers (2025-03-11T17:59:53Z)
Extrinsic Evaluation of Cultural Competence in Large Language Models [53.626808086522985]
We focus on extrinsic evaluation of cultural competence in two text generation tasks. We evaluate model outputs when an explicit cue of culture, specifically nationality, is perturbed in the prompts. We find weak correlations between text similarity of outputs for different countries and the cultural values of these countries.
arXiv Detail & Related papers (2024-06-17T14:03:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.