Related papers: BLUCK: A Benchmark Dataset for Bengali Linguistic Understanding and Cultural Knowledge

BLUCK: A Benchmark Dataset for Bengali Linguistic Understanding and Cultural Knowledge

URL: http://arxiv.org/abs/2505.21092v1
Date: Tue, 27 May 2025 12:19:12 GMT
Title: BLUCK: A Benchmark Dataset for Bengali Linguistic Understanding and Cultural Knowledge
Authors: Daeen Kabir, Minhajur Rahman Chowdhury Mahim, Sheikh Shafayat, Adnan Sadik, Arian Ahmed, Eunsu Kim, Alice Oh,
Abstract summary: BLUCK is a new dataset designed to measure the performance of Large Language Models (LLMs) in Bengali linguistic understanding and cultural knowledge.<n>Our dataset comprises 2366 multiple-choice questions (MCQs)<n>We benchmarked BLUCK using 6 proprietary and 3 open-source LLMs - including GPT-4o, Claude-3.5-Sonnet, Gemini-1.5-Pro, Llama-3.3-70B-Instruct, and DeepSeekV3.
Score: 11.447710593895831
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: In this work, we introduce BLUCK, a new dataset designed to measure the performance of Large Language Models (LLMs) in Bengali linguistic understanding and cultural knowledge. Our dataset comprises 2366 multiple-choice questions (MCQs) carefully curated from compiled collections of several college and job level examinations and spans 23 categories covering knowledge on Bangladesh's culture and history and Bengali linguistics. We benchmarked BLUCK using 6 proprietary and 3 open-source LLMs - including GPT-4o, Claude-3.5-Sonnet, Gemini-1.5-Pro, Llama-3.3-70B-Instruct, and DeepSeekV3. Our results show that while these models perform reasonably well overall, they, however, struggles in some areas of Bengali phonetics. Although current LLMs' performance on Bengali cultural and linguistic contexts is still not comparable to that of mainstream languages like English, our results indicate Bengali's status as a mid-resource language. Importantly, BLUCK is also the first MCQ-based evaluation benchmark that is centered around native Bengali culture, history, and linguistics.

Related papers

Bengali Text Classification: An Evaluation of Large Language Model Approaches [0.0]
Unlike English, Bengali faces challenges due to the lack of extensive annotated datasets and pre-trained language models.<n>This study explores the effectiveness of large language models (LLMs) in classifying Bengali newspaper articles.
arXiv Detail & Related papers (2026-01-17T18:25:19Z)
BengaliFig: A Low-Resource Challenge for Figurative and Culturally Grounded Reasoning in Bengali [0.0]
We present BengaliFig, a compact yet richly annotated challenge set.<n>The dataset contains 435 unique riddles drawn from Bengali oral and literary traditions.<n>Each item is annotated along five dimensions capturing reasoning type, trap type, cultural depth, answer category, and difficulty.
arXiv Detail & Related papers (2025-11-25T15:26:47Z)
BengaliMoralBench: A Benchmark for Auditing Moral Reasoning in Large Language Models within Bengali Language and Culture [5.215285027585101]
Bengali is spoken by over 285 million people and ranked 6th globally.<n>Existing ethics benchmarks are largely English-centric and shaped by Western frameworks.<n>We introduce BengaliMoralBench, the first large-scale ethics benchmark for the Bengali language and socio-cultural contexts.
arXiv Detail & Related papers (2025-11-05T04:55:35Z)
BnMMLU: Measuring Massive Multitask Language Understanding in Bengali [0.0]
We introduce BnMMLU, a benchmark to evaluate the language understanding capabilities of Bengali in language models.<n>The dataset spans 23 domains, including science, humanities, mathematics and general knowledge.<n>We benchmark several proprietary and open-source large language models (LLMs) on the BnMMLU test set.
arXiv Detail & Related papers (2025-05-25T02:54:31Z)
PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts [79.84059473102778]
PolyMath is a multilingual mathematical reasoning benchmark covering 18 languages and 4 easy-to-hard difficulty levels.<n>Our benchmark ensures difficulty comprehensiveness, language diversity, and high-quality translation.
arXiv Detail & Related papers (2025-04-25T15:39:04Z)
Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model [66.17354128553244]
Most Large Vision-Language Models (LVLMs) to date are trained predominantly on English data.<n>We investigate how different training mixes tip the scale for different groups of languages.<n>We train Centurio, a 100-language LVLM, offering state-of-the-art performance in an evaluation covering 14 tasks and 56 languages.
arXiv Detail & Related papers (2025-01-09T10:26:14Z)
Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages [55.36534539177367]
This paper introduces Pangea, a multilingual multimodal large language model (MLLM) trained on a diverse 6M instruction dataset spanning 39 languages.<n>P Pangea significantly outperforms existing open-source models in multilingual settings and diverse cultural contexts.<n>We fully open-source our data, code, and trained checkpoints, to facilitate the development of inclusive and robust multilingual MLLMs.
arXiv Detail & Related papers (2024-10-21T16:19:41Z)
BEnQA: A Question Answering and Reasoning Benchmark for Bengali and English [18.217122567176585]
We introduce BEnQA, a dataset comprising parallel Bengali and English exam questions for middle and high school levels in Bangladesh. Our dataset consists of approximately 5K questions covering several subjects in science with different types of questions, including factual, application, and reasoning-based questions. We benchmark several Large Language Models (LLMs) with our parallel dataset and observe a notable performance disparity between the models in Bengali and English.
arXiv Detail & Related papers (2024-03-16T11:27:42Z)
CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models [53.9835961434552]
We introduce the Chinese Instruction-Following Benchmark (CIF-Bench) to evaluate the generalizability of large language models (LLMs) to the Chinese language. CIF-Bench comprises 150 tasks and 15,000 input-output pairs, developed by native speakers to test complex reasoning and Chinese cultural nuances. To mitigate data contamination, we release only half of the dataset publicly, with the remainder kept private, and introduce diversified instructions to minimize score variance.
arXiv Detail & Related papers (2024-02-20T16:02:12Z)
ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic [51.922112625469836]
We present datasetname, the first multi-task language understanding benchmark for the Arabic language. Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region. Our evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models.
arXiv Detail & Related papers (2024-02-20T09:07:41Z)
BenLLMEval: A Comprehensive Evaluation into the Potentials and Pitfalls of Large Language Models on Bengali NLP [17.362068473064717]
Large Language Models (LLMs) have emerged as one of the most important breakthroughs in NLP. This paper introduces BenLLM-Eval, which consists of a comprehensive evaluation of LLMs to benchmark their performance in the Bengali language. Our experimental results demonstrate that while in some Bengali NLP tasks, zero-shot LLMs could achieve performance on par, or even better than current SOTA fine-tuned models.
arXiv Detail & Related papers (2023-09-22T20:29:34Z)
BHASA: A Holistic Southeast Asian Linguistic and Cultural Evaluation Suite for Large Language Models [0.06597195879147556]
BHASA is a holistic linguistic and cultural evaluation suite for Large Language Models (LLMs) in Southeast Asian languages. It comprises three components: (1) a NLP benchmark covering eight tasks across Natural Language Understanding (NLU), Generation (NLG) and Reasoning (NLR) tasks, (2) LINDSEA, a linguistic diagnostic toolkit that spans the gamut of linguistic phenomena including syntax, semantics and pragmatics, and (3) a cultural diagnostics dataset that probes for both cultural representation and sensitivity.
arXiv Detail & Related papers (2023-09-12T09:31:25Z)
Zero-Shot Cross-Lingual Summarization via Large Language Models [108.30673793281987]
Cross-lingual summarization ( CLS) generates a summary in a different target language. Recent emergence of Large Language Models (LLMs) has attracted wide attention from the computational linguistics community. In this report, we empirically use various prompts to guide LLMs to perform zero-shot CLS from different paradigms.
arXiv Detail & Related papers (2023-02-28T01:27:37Z)
Simple or Complex? Learning to Predict Readability of Bengali Texts [6.860272388539321]
We present a readability analysis tool capable of analyzing text written in the Bengali language. Despite being the 7th most spoken language in the world with 230 million native speakers, Bengali suffers from a lack of fundamental resources for natural language processing.
arXiv Detail & Related papers (2020-12-09T01:41:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.