CCD-Bench: Probing Cultural Conflict in Large Language Model Decision-Making
- URL: http://arxiv.org/abs/2510.03553v1
- Date: Fri, 03 Oct 2025 22:55:37 GMT
- Title: CCD-Bench: Probing Cultural Conflict in Large Language Model Decision-Making
- Authors: Hasibur Rahman, Hanan Salam,
- Abstract summary: Large language models can navigate explicit conflicts between legitimately different cultural value systems.<n>CCD-Bench is a benchmark that assesses decision-making under cross-cultural value conflict.<n>CCD-Bench shifts evaluation beyond isolated bias detection toward pluralistic decision making.
- Score: 0.9310318514564272
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Although large language models (LLMs) are increasingly implicated in interpersonal and societal decision-making, their ability to navigate explicit conflicts between legitimately different cultural value systems remains largely unexamined. Existing benchmarks predominantly target cultural knowledge (CulturalBench), value prediction (WorldValuesBench), or single-axis bias diagnostics (CDEval); none evaluate how LLMs adjudicate when multiple culturally grounded values directly clash. We address this gap with CCD-Bench, a benchmark that assesses LLM decision-making under cross-cultural value conflict. CCD-Bench comprises 2,182 open-ended dilemmas spanning seven domains, each paired with ten anonymized response options corresponding to the ten GLOBE cultural clusters. These dilemmas are presented using a stratified Latin square to mitigate ordering effects. We evaluate 17 non-reasoning LLMs. Models disproportionately prefer Nordic Europe (mean 20.2 percent) and Germanic Europe (12.4 percent), while options for Eastern Europe and the Middle East and North Africa are underrepresented (5.6 to 5.8 percent). Although 87.9 percent of rationales reference multiple GLOBE dimensions, this pluralism is superficial: models recombine Future Orientation and Performance Orientation, and rarely ground choices in Assertiveness or Gender Egalitarianism (both under 3 percent). Ordering effects are negligible (Cramer's V less than 0.10), and symmetrized KL divergence shows clustering by developer lineage rather than geography. These patterns suggest that current alignment pipelines promote a consensus-oriented worldview that underserves scenarios demanding power negotiation, rights-based reasoning, or gender-aware analysis. CCD-Bench shifts evaluation beyond isolated bias detection toward pluralistic decision making and highlights the need for alignment strategies that substantively engage diverse worldviews.
Related papers
- Clause-Internal or Clause-External? Testing Turkish Reflexive Binding in Adapted versus Chain of Thought Large Language Models [0.0]
This study evaluates whether state-of-the-art large language models capture the binding relations of Turkish reflexive pronouns.<n>We construct a balanced set of 100 sentences that pit local against non-local antecedents for the reflexives kendi and kendisi.<n>We test two contrasting systems: an OpenAI chain-of-thought model designed for multi-step reasoning and Trendyol-LLM-7B-base-v0.1, a LLaMA-2-derived model extensively fine-tuned on Turkish data.
arXiv Detail & Related papers (2026-01-30T23:00:04Z) - Mitigating Cultural Bias in LLMs via Multi-Agent Cultural Debate [7.517766226036547]
Large language models (LLMs) exhibit systematic Western-centric bias, yet whether prompting in non-Western languages can mitigate this remains understudied.<n>We introduce CEBiasBench, a Chinese--English bilingual benchmark, and Multi-Agent Vote (MAV), which enables explicit no bias'' judgments.<n>Using this framework, we find that Chinese prompting merely shifts bias toward East Asian perspectives rather than eliminating it.
arXiv Detail & Related papers (2026-01-17T16:00:34Z) - I Am Aligned, But With Whom? MENA Values Benchmark for Evaluating Cultural Alignment and Multilingual Bias in LLMs [5.060243371992739]
We introduce MENAValues, a novel benchmark designed to evaluate the cultural alignment and multilingual biases of large language models (LLMs)<n> Drawing from large-scale, authoritative human surveys, we curate a structured dataset that captures the sociocultural landscape of MENA with population-level response distributions from 16 countries.<n>Our analysis reveals three critical phenomena: "Cross-Lingual Value Shifts" where identical questions yield drastically different responses based on language, "Reasoning-Induced Degradation" where prompting models to explain their reasoning worsens cultural alignment, and "Logit Leakage" where models refuse sensitive questions while internal probabilities reveal strong hidden
arXiv Detail & Related papers (2025-10-15T05:10:57Z) - MMA-ASIA: A Multilingual and Multimodal Alignment Framework for Culturally-Grounded Evaluation [91.22008265721952]
MMA-ASIA centers on a human-curated, multilingual, and multimodally aligned benchmark covering 8 Asian countries and 10 languages.<n>This is the first dataset aligned at the input level across three modalities: text, image (visual question answering), and speech.<n>We propose a five-dimensional evaluation protocol that measures: (i) cultural-awareness disparities across countries, (ii) cross-lingual consistency, (iii) cross-modal consistency, (iv) cultural knowledge generalization, and (v) grounding validity.
arXiv Detail & Related papers (2025-10-07T14:12:12Z) - Red Lines and Grey Zones in the Fog of War: Benchmarking Legal Risk, Moral Harm, and Regional Bias in Large Language Model Military Decision-Making [0.0]
This study develops a benchmarking framework for evaluating aspects of legal and moral risk in targeting behaviour.<n>We introduce four metrics grounded in International Humanitarian Law (IHL) and military doctrine.<n>We evaluate three frontier models, GPT-4o, Gemini-2.5, and LLaMA-3.1, through 90 multi-agent, multi-turn crisis simulations.
arXiv Detail & Related papers (2025-10-03T20:55:04Z) - Which Cultural Lens Do Models Adopt? On Cultural Positioning Bias and Agentic Mitigation in LLMs [53.07843733899881]
Large language models (LLMs) have unlocked a wide range of downstream generative applications.<n>We find that they also risk perpetuating subtle fairness issues tied to culture, positioning their generations from the perspectives of the mainstream US culture.<n>We propose 2 inference-time mitigation methods to resolve these biases.
arXiv Detail & Related papers (2025-09-25T12:28:25Z) - The Cultural Gene of Large Language Models: A Study on the Impact of Cross-Corpus Training on Model Values and Biases [0.0]
Large language models (LLMs) are deployed globally, yet their underlying cultural and ethical assumptions remain underexplored.<n>We compare a Western-centric model (GPT-4) and an Eastern-centric model (ERNIE Bot)<n>Human annotation shows significant and consistent divergence across both dimensions.
arXiv Detail & Related papers (2025-08-17T15:54:14Z) - Do Large Language Models Understand Morality Across Cultures? [0.5356944479760104]
This study investigates the extent to which large language models capture cross-cultural differences and similarities in moral perspectives.<n>Our results reveal that current LLMs often fail to reproduce the full spectrum of cross-cultural moral variation.<n>These findings highlight a pressing need for more robust approaches to mitigate biases and improve cultural representativeness in LLMs.
arXiv Detail & Related papers (2025-07-28T20:25:36Z) - A Game-Theoretic Negotiation Framework for Cross-Cultural Consensus in LLMs [10.655783463895325]
Large language models (LLMs) exhibit a pronounced WEIRD (Western, Educated, Industrialized, Rich, Democratic) cultural bias.<n>This monocultural perspective may reinforce dominant values and marginalize diverse cultural viewpoints.<n>We introduce a systematic framework designed to boost fair and robust cross-cultural consensus.
arXiv Detail & Related papers (2025-06-16T08:42:39Z) - Multiple LLM Agents Debate for Equitable Cultural Alignment [52.01956042197423]
We introduce a Multi-Agent Debate framework, where two LLM-based agents debate over a cultural scenario and collaboratively reach a final decision.<n>We evaluate these approaches on 7 open-weight LLMs (and 21 LLM combinations) using the NormAd-ETI benchmark for social etiquette norms in 75 countries.<n>Experiments show that debate improves both overall accuracy and cultural group parity over single-LLM baselines.
arXiv Detail & Related papers (2025-05-30T15:01:52Z) - Multimodal Cultural Safety: Evaluation Frameworks and Alignment Strategies [58.88053690412802]
Large vision-language models (LVLMs) are increasingly deployed in globally distributed applications, such as tourism assistants.<n> CROSS is a benchmark designed to assess the cultural safety reasoning capabilities of LVLMs.<n>We evaluate 21 leading LVLMs, including mixture-of-experts models and reasoning models.
arXiv Detail & Related papers (2025-05-20T23:20:38Z) - WorldView-Bench: A Benchmark for Evaluating Global Cultural Perspectives in Large Language Models [1.094065133109559]
Large Language Models (LLMs) are predominantly trained and aligned in ways that reinforce Western-centric epistemologies and socio-cultural norms.<n>We introduce WorldView-Bench, a benchmark designed to evaluate Global Cultural Inclusivity (GCI) in LLMs by analyzing their ability to accommodate diverse worldviews.
arXiv Detail & Related papers (2025-05-14T17:43:40Z) - Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation [71.59208664920452]
Cultural biases in multilingual datasets pose significant challenges for their effectiveness as global benchmarks.<n>We show that progress on MMLU depends heavily on learning Western-centric concepts, with 28% of all questions requiring culturally sensitive knowledge.<n>We release Global MMLU, an improved MMLU with evaluation coverage across 42 languages.
arXiv Detail & Related papers (2024-12-04T13:27:09Z) - The Root Shapes the Fruit: On the Persistence of Gender-Exclusive Harms in Aligned Language Models [91.86718720024825]
We center transgender, nonbinary, and other gender-diverse identities to investigate how alignment procedures interact with pre-existing gender-diverse bias.<n>Our findings reveal that DPO-aligned models are particularly sensitive to supervised finetuning.<n>We conclude with recommendations tailored to DPO and broader alignment practices.
arXiv Detail & Related papers (2024-11-06T06:50:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.