Related papers: Randomness, Not Representation: The Unreliability of Evaluating Cultural Alignment in LLMs

Randomness, Not Representation: The Unreliability of Evaluating Cultural Alignment in LLMs

URL: http://arxiv.org/abs/2503.08688v2
Date: Tue, 08 Apr 2025 21:11:19 GMT
Title: Randomness, Not Representation: The Unreliability of Evaluating Cultural Alignment in LLMs
Authors: Ariba Khan, Stephen Casper, Dylan Hadfield-Menell,
Abstract summary: We identify and test three assumptions behind current survey-based evaluation methods.<n>We find a high level of instability across presentation formats, incoherence between evaluated versus held-out cultural dimensions, and erratic behavior under prompt steering.
Score: 7.802103248428407
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Research on the 'cultural alignment' of Large Language Models (LLMs) has emerged in response to growing interest in understanding representation across diverse stakeholders. Current approaches to evaluating cultural alignment through survey-based assessments that borrow from social science methodologies often overlook systematic robustness checks. Here, we identify and test three assumptions behind current survey-based evaluation methods: (1) Stability: that cultural alignment is a property of LLMs rather than an artifact of evaluation design, (2) Extrapolability: that alignment with one culture on a narrow set of issues predicts alignment with that culture on others, and (3) Steerability: that LLMs can be reliably prompted to represent specific cultural perspectives. Through experiments examining both explicit and implicit preferences of leading LLMs, we find a high level of instability across presentation formats, incoherence between evaluated versus held-out cultural dimensions, and erratic behavior under prompt steering. We show that these inconsistencies can cause the results of an evaluation to be very sensitive to minor variations in methodology. Finally, we demonstrate in a case study on evaluation design that narrow experiments and a selective assessment of evidence can be used to paint an incomplete picture of LLMs' cultural alignment properties. Overall, these results highlight significant limitations of current survey-based approaches to evaluating the cultural alignment of LLMs and highlight a need for systematic robustness checks and red-teaming for evaluation results. Data and code are available at https://huggingface.co/datasets/akhan02/cultural-dimension-cover-letters and https://github.com/ariba-k/llm-cultural-alignment-evaluation, respectively.

Related papers

Do Large Language Models Understand Morality Across Cultures? [0.5356944479760104]
This study investigates the extent to which large language models capture cross-cultural differences and similarities in moral perspectives.<n>Our results reveal that current LLMs often fail to reproduce the full spectrum of cross-cultural moral variation.<n>These findings highlight a pressing need for more robust approaches to mitigate biases and improve cultural representativeness in LLMs.
arXiv Detail & Related papers (2025-07-28T20:25:36Z)
MCEval: A Dynamic Framework for Fair Multilingual Cultural Evaluation of LLMs [25.128936333806678]
Large language models exhibit cultural biases and limited cross-cultural understanding capabilities.<n>We propose MCEval, a novel multilingual evaluation framework that employs dynamic cultural question construction.
arXiv Detail & Related papers (2025-07-13T16:24:35Z)
CAIRe: Cultural Attribution of Images by Retrieval-Augmented Evaluation [61.130639734982395]
We introduce CAIRe, a novel evaluation metric that assesses the degree of cultural relevance of an image.<n>Our framework grounds entities and concepts in the image to a knowledge base and uses factual information to give independent graded judgments for each culture label.
arXiv Detail & Related papers (2025-06-10T17:16:23Z)
Disentangling Language and Culture for Evaluating Multilingual Large Language Models [48.06219053598005]
This paper introduces a Dual Evaluation Framework to comprehensively assess the multilingual capabilities of LLMs.<n>By decomposing the evaluation along the dimensions of linguistic medium and cultural context, this framework enables a nuanced analysis of LLMs' ability to process questions cross-lingually.
arXiv Detail & Related papers (2025-05-30T14:25:45Z)
From Word to World: Evaluate and Mitigate Culture Bias via Word Association Test [48.623761108859085]
We extend the human-centered word association test (WAT) to assess the alignment of large language models with cross-cultural cognition.<n>To mitigate the culture preference, we propose CultureSteer, an innovative approach that integrates a culture-aware steering mechanism.
arXiv Detail & Related papers (2025-05-24T07:05:10Z)
From Surveys to Narratives: Rethinking Cultural Value Adaptation in LLMs [57.43233760384488]
Adapting cultural values in Large Language Models (LLMs) presents significant challenges.<n>Prior work primarily aligns LLMs with different cultural values using World Values Survey (WVS) data.<n>In this paper, we investigate WVS-based training for cultural value adaptation and find that relying solely on survey data cane cultural norms and interfere with factual knowledge.
arXiv Detail & Related papers (2025-05-22T09:00:01Z)
Cultural Learning-Based Culture Adaptation of Language Models [70.1063219524999]
Adapting large language models (LLMs) to diverse cultural values is a challenging task. We present CLCA, a novel framework for enhancing LLM alignment with cultural values based on cultural learning.
arXiv Detail & Related papers (2025-04-03T18:16:26Z)
Break the Checkbox: Challenging Closed-Style Evaluations of Cultural Alignment in LLMs [17.673012459377375]
A large number of studies rely on closed-style multiple-choice surveys to evaluate cultural alignment in Large Language Models (LLMs)<n>In this work, we challenge this constrained evaluation paradigm and explore more realistic, unconstrained approaches.
arXiv Detail & Related papers (2025-02-12T01:04:13Z)
Value Compass Leaderboard: A Platform for Fundamental and Validated Evaluation of LLMs Values [76.70893269183684]
Large Language Models (LLMs) achieve remarkable breakthroughs, aligning their values with humans has become imperative.<n>Existing evaluations focus narrowly on safety risks such as bias and toxicity.<n>Existing benchmarks are prone to data contamination.<n>The pluralistic nature of human values across individuals and cultures is largely ignored in measuring LLMs value alignment.
arXiv Detail & Related papers (2025-01-13T05:53:56Z)
ValuesRAG: Enhancing Cultural Alignment Through Retrieval-Augmented Contextual Learning [1.1343849658875087]
We propose ValuesRAG to integrate cultural and demographic knowledge dynamically during text generation. ValuesRAG consistently outperforms baseline methods, both in the main experiment and in the ablation study. Notably, ValuesRAG demonstrates an accuracy of 21% improvement over other baseline methods.
arXiv Detail & Related papers (2025-01-02T03:26:13Z)
Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation [71.59208664920452]
Cultural biases in multilingual datasets pose significant challenges for their effectiveness as global benchmarks.<n>We show that progress on MMLU depends heavily on learning Western-centric concepts, with 28% of all questions requiring culturally sensitive knowledge.<n>We release Global MMLU, an improved MMLU with evaluation coverage across 42 languages.
arXiv Detail & Related papers (2024-12-04T13:27:09Z)
LLMs as mirrors of societal moral standards: reflection of cultural divergence and agreement across ethical topics [0.5852077003870417]
Large language models (LLMs) have become increasingly pivotal in various domains due to the recent advancements in their performance capabilities.<n>This study investigates whether LLMs accurately reflect cross-cultural variations and similarities in moral perspectives.
arXiv Detail & Related papers (2024-12-01T20:39:42Z)
LLM-GLOBE: A Benchmark Evaluating the Cultural Values Embedded in LLM Output [8.435090588116973]
We propose the LLM-GLOBE benchmark for evaluating the cultural value systems of LLMs. We then leverage the benchmark to compare the values of Chinese and US LLMs. Our methodology includes a novel "LLMs-as-a-Jury" pipeline which automates the evaluation of open-ended content.
arXiv Detail & Related papers (2024-11-09T01:38:55Z)
Navigating the Cultural Kaleidoscope: A Hitchhiker's Guide to Sensitivity in Large Language Models [4.771099208181585]
LLMs are increasingly deployed in global applications, ensuring users from diverse backgrounds feel respected and understood.<n>Cultural harm can arise when these models fail to align with specific cultural norms, resulting in misrepresentations or violations of cultural values.<n>We present two key contributions: A cultural harm test dataset, created to assess model outputs across different cultural contexts through scenarios that expose potential cultural insensitivities, and a culturally aligned preference dataset, aimed at restoring cultural sensitivity through fine-tuning based on feedback from diverse annotators.
arXiv Detail & Related papers (2024-10-15T18:13:10Z)
Extrinsic Evaluation of Cultural Competence in Large Language Models [53.626808086522985]
We focus on extrinsic evaluation of cultural competence in two text generation tasks. We evaluate model outputs when an explicit cue of culture, specifically nationality, is perturbed in the prompts. We find weak correlations between text similarity of outputs for different countries and the cultural values of these countries.
arXiv Detail & Related papers (2024-06-17T14:03:27Z)
Understanding the Capabilities and Limitations of Large Language Models for Cultural Commonsense [98.09670425244462]
Large language models (LLMs) have demonstrated substantial commonsense understanding. This paper examines the capabilities and limitations of several state-of-the-art LLMs in the context of cultural commonsense tasks.
arXiv Detail & Related papers (2024-05-07T20:28:34Z)
CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs' (Lack of) Multicultural Knowledge [69.82940934994333]
We introduce CulturalTeaming, an interactive red-teaming system that leverages human-AI collaboration to build challenging evaluation dataset. Our study reveals that CulturalTeaming's various modes of AI assistance support annotators in creating cultural questions. CULTURALBENCH-V0.1 is a compact yet high-quality evaluation dataset with users' red-teaming attempts.
arXiv Detail & Related papers (2024-04-10T00:25:09Z)
Bias and Fairness in Large Language Models: A Survey [73.87651986156006]
We present a comprehensive survey of bias evaluation and mitigation techniques for large language models (LLMs) We first consolidate, formalize, and expand notions of social bias and fairness in natural language processing. We then unify the literature by proposing three intuitive, two for bias evaluation, and one for mitigation.
arXiv Detail & Related papers (2023-09-02T00:32:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.