Related papers: BengaliMoralBench: A Benchmark for Auditing Moral Reasoning in Large Language Models within Bengali Language and Culture

BengaliMoralBench: A Benchmark for Auditing Moral Reasoning in Large Language Models within Bengali Language and Culture

URL: http://arxiv.org/abs/2511.03180v1
Date: Wed, 05 Nov 2025 04:55:35 GMT
Title: BengaliMoralBench: A Benchmark for Auditing Moral Reasoning in Large Language Models within Bengali Language and Culture
Authors: Shahriyar Zaman Ridoy, Azmine Toushik Wasi, Koushik Ahamed Tonmoy,
Abstract summary: Bengali is spoken by over 285 million people and ranked 6th globally.<n>Existing ethics benchmarks are largely English-centric and shaped by Western frameworks.<n>We introduce BengaliMoralBench, the first large-scale ethics benchmark for the Bengali language and socio-cultural contexts.
Score: 5.215285027585101
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: As multilingual Large Language Models (LLMs) gain traction across South Asia, their alignment with local ethical norms, particularly for Bengali, which is spoken by over 285 million people and ranked 6th globally, remains underexplored. Existing ethics benchmarks are largely English-centric and shaped by Western frameworks, overlooking cultural nuances critical for real-world deployment. To address this, we introduce BengaliMoralBench, the first large-scale ethics benchmark for the Bengali language and socio-cultural contexts. It covers five moral domains, Daily Activities, Habits, Parenting, Family Relationships, and Religious Activities, subdivided into 50 culturally relevant subtopics. Each scenario is annotated via native-speaker consensus using three ethical lenses: Virtue, Commonsense, and Justice ethics. We conduct systematic zero-shot evaluation of prominent multilingual LLMs, including Llama, Gemma, Qwen, and DeepSeek, using a unified prompting protocol and standard metrics. Performance varies widely (50-91% accuracy), with qualitative analysis revealing consistent weaknesses in cultural grounding, commonsense reasoning, and moral fairness. BengaliMoralBench provides a foundation for responsible localization, enabling culturally aligned evaluation and supporting the deployment of ethically robust AI in diverse, low-resource multilingual settings such as Bangladesh.

Related papers

Tears or Cheers? Benchmarking LLMs via Culturally Elicited Distinct Affective Responses [28.3173238194554]
We introduce CEDAR, a benchmark constructed entirely from scenarios capturing culturally underlinetextscElicited underlinetextscDistinct underlinetextscAffective underlinetextscResponses.<n>The resulting benchmark comprises 10,962 instances across seven languages and 14 fine-grained emotion categories, with each language including 400 multimodal and 1,166 text-only samples.
arXiv Detail & Related papers (2026-01-19T13:04:26Z)
Do Large Language Models Truly Understand Cross-cultural Differences? [53.481048019144644]
We develop a scenario-based benchmark to evaluate large language models' cross-cultural understanding and reasoning.<n>Grounded in cultural theory, we categorize cross-cultural capabilities into nine dimensions.<n>The dataset supports continuous expansion, and experiments confirm its transferability to other languages.
arXiv Detail & Related papers (2025-12-08T01:21:58Z)
BengaliFig: A Low-Resource Challenge for Figurative and Culturally Grounded Reasoning in Bengali [0.0]
We present BengaliFig, a compact yet richly annotated challenge set.<n>The dataset contains 435 unique riddles drawn from Bengali oral and literary traditions.<n>Each item is annotated along five dimensions capturing reasoning type, trap type, cultural depth, answer category, and difficulty.
arXiv Detail & Related papers (2025-11-25T15:26:47Z)
CRaFT: An Explanation-Based Framework for Evaluating Cultural Reasoning in Multilingual Language Models [0.42970700836450487]
We introduce CRaFT, an explanation-based multilingual evaluation framework designed to assess how large language models (LLMs) reason across cultural contexts.<n>We apply the framework to 50 culturally grounded questions from the World Values Survey, translated into Arabic, Bengali, and Spanish, and evaluate three models (GPT, DeepSeek, and FANAR) across over 2,100 answer-explanation pairs.<n>Results reveal significant cross-lingual variation in reasoning: Arabic reduces fluency, Bengali enhances it, and Spanish remains largely stable.
arXiv Detail & Related papers (2025-10-15T18:49:10Z)
MMA-ASIA: A Multilingual and Multimodal Alignment Framework for Culturally-Grounded Evaluation [91.22008265721952]
MMA-ASIA centers on a human-curated, multilingual, and multimodally aligned benchmark covering 8 Asian countries and 10 languages.<n>This is the first dataset aligned at the input level across three modalities: text, image (visual question answering), and speech.<n>We propose a five-dimensional evaluation protocol that measures: (i) cultural-awareness disparities across countries, (ii) cross-lingual consistency, (iii) cross-modal consistency, (iv) cultural knowledge generalization, and (v) grounding validity.
arXiv Detail & Related papers (2025-10-07T14:12:12Z)
Camellia: Benchmarking Cultural Biases in LLMs for Asian Languages [46.3747338016989]
We introduce Camellia, a benchmark for measuring entity-centric cultural biases in nine Asian languages spanning six distinct Asian cultures.<n>We evaluate cultural biases in four recent multilingual Large Language Models across various tasks such as cultural context adaptation, sentiment association, and entity extractive QA.<n>Our analyses show a struggle by LLMs at cultural adaptation in all Asian languages, with performance differing across models developed in regions with varying access to culturally-relevant data.
arXiv Detail & Related papers (2025-10-06T18:59:11Z)
Cross-Cultural Transfer of Commonsense Reasoning in LLMs: Evidence from the Arab World [68.19795061447044]
This paper investigates cross-cultural transfer of commonsense reasoning in the Arab world.<n>Using a culturally grounded commonsense reasoning dataset covering 13 Arab countries, we evaluate lightweight alignment methods.<n>Our results show that merely 12 culture-specific examples from one country can improve performance in others by 10% on average.
arXiv Detail & Related papers (2025-09-23T17:24:14Z)
BharatBBQ: A Multilingual Bias Benchmark for Question Answering in the Indian Context [36.56689822791777]
Existing benchmarks, such as the Bias Benchmark for Question Answering (BBQ), primarily focus on Western contexts.<n>We introduce BharatBBQ, a culturally adapted benchmark designed to assess biases in Hindi, English, Marathi, Bengali, Tamil, Telugu, Odia, and Assamese.<n>Our dataset contains 49,108 examples in one language that are expanded using translation and verification to 392,864 examples in eight different languages.
arXiv Detail & Related papers (2025-08-09T20:24:24Z)
BLUCK: A Benchmark Dataset for Bengali Linguistic Understanding and Cultural Knowledge [11.447710593895831]
BLUCK is a new dataset designed to measure the performance of Large Language Models (LLMs) in Bengali linguistic understanding and cultural knowledge.<n>Our dataset comprises 2366 multiple-choice questions (MCQs)<n>We benchmarked BLUCK using 6 proprietary and 3 open-source LLMs - including GPT-4o, Claude-3.5-Sonnet, Gemini-1.5-Pro, Llama-3.3-70B-Instruct, and DeepSeekV3.
arXiv Detail & Related papers (2025-05-27T12:19:12Z)
Fluent but Foreign: Even Regional LLMs Lack Cultural Alignment [24.871503011248777]
Large language models (LLMs) are used worldwide, yet exhibit Western cultural tendencies.<n>We evaluate six Indic and six global LLMs on two dimensions -- values and practices.<n>Across tasks, Indic models do not align better with Indian norms than global models.
arXiv Detail & Related papers (2025-05-25T01:59:23Z)
Whose Morality Do They Speak? Unraveling Cultural Bias in Multilingual Language Models [0.0]
Large language models (LLMs) have become integral tools in diverse domains, yet their moral reasoning capabilities remain underexplored.<n>This study investigates whether multilingual LLMs, such as GPT-3.5-Turbo, reflect culturally specific moral values or impose dominant moral norms.<n>Using the updated Moral Foundations Questionnaire (MFQ-2) in eight languages, the study analyzes the models' adherence to six core moral foundations.
arXiv Detail & Related papers (2024-12-25T10:17:15Z)
CulturalBench: A Robust, Diverse, and Challenging Cultural Benchmark by Human-AI CulturalTeaming [75.82306181299153]
CulturalBench is a set of 1,696 human-written and human-verified questions to assess LMs' cultural knowledge.<n>It covers 45 global regions including underrepresented ones like Bangladesh, Zimbabwe, and Peru.<n>We construct CulturalBench using methods inspired by Human-AI Red-Teaming.
arXiv Detail & Related papers (2024-10-03T17:04:31Z)
AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic. The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.