BnMMLU: Measuring Massive Multitask Language Understanding in Bengali
- URL: http://arxiv.org/abs/2505.18951v1
- Date: Sun, 25 May 2025 02:54:31 GMT
- Title: BnMMLU: Measuring Massive Multitask Language Understanding in Bengali
- Authors: Saman Sarker Joy,
- Abstract summary: We introduce BnMMLU, a benchmark to evaluate the language understanding capabilities of Bengali in language models.<n>The dataset spans 23 domains, including science, humanities, mathematics and general knowledge.<n>We benchmark several proprietary and open-source large language models (LLMs) on the BnMMLU test set.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Massive Multitask Language Understanding (MMLU) benchmark has been widely used to evaluate language models across various domains. However, existing MMLU datasets primarily focus on high-resource languages such as English, which leaves low-resource languages like Bengali underrepresented. In this paper, we introduce BnMMLU, a benchmark to evaluate the multitask language understanding capabilities of Bengali in language models. The dataset spans 23 domains, including science, humanities, mathematics and general knowledge and is structured in a multiple-choice format to assess factual knowledge, application-based problem-solving and reasoning abilities of language models. It consists of 138,949 question-option pairs. We benchmark several proprietary and open-source large language models (LLMs) on the BnMMLU test set. Additionally, we annotate the test set with three cognitive categories-factual knowledge, procedural application and reasoning-to gain deeper insights into model strengths and weaknesses across various cognitive tasks. The results reveal significant performance gaps, highlighting the need for improved pre-training and fine-tuning strategies tailored to Bengali data. We release the dataset and benchmark results to facilitate further research in this area.
Related papers
- Evaluating LLMs' Multilingual Capabilities for Bengali: Benchmark Creation and Performance Analysis [0.0]
Bengali is an underrepresented language in NLP research.<n>We systematically investigate the challenges that hinder Bengali NLP performance.<n>Our findings reveal consistent performance gaps for Bengali compared to English.
arXiv Detail & Related papers (2025-07-31T05:16:43Z) - Language Matters: How Do Multilingual Input and Reasoning Paths Affect Large Reasoning Models? [59.970391602080205]
Despite multilingual training, LRMs tend to default to reasoning in high-resource languages at test time.<n>Cultural reasoning degrades performance on reasoning tasks but benefits cultural tasks, while safety evaluations exhibit language-specific behavior.
arXiv Detail & Related papers (2025-05-23T02:46:18Z) - Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering [73.73820209993515]
We introduce KoLasSimpleQA, the first benchmark evaluating the multilingual factual ability of Large Language Models (LLMs)<n>Inspired by existing research, we created the question set with features such as single knowledge point coverage, absolute objectivity, unique answers, and temporal stability.<n>Results show significant performance differences between the two domains.
arXiv Detail & Related papers (2025-05-22T12:27:02Z) - TUMLU: A Unified and Native Language Understanding Benchmark for Turkic Languages [2.115206401188031]
We propose two benchmarks for Turkic language MMLU: TUMLU and TUMLU-mini.<n>TUMLU-mini consists of middle- and high-school level questions spanning 11 academic subjects in Azerbaijani, Crimean Tatar, Karakalpak, Kazakh, Tatar, Turkish, Uyghur, and Uzbek.<n>We also present TUMLU-mini, a more concise, balanced, and manually verified subset of the dataset.
arXiv Detail & Related papers (2025-02-16T07:07:38Z) - IndicMMLU-Pro: Benchmarking Indic Large Language Models on Multi-Task Language Understanding [2.062076715606512]
Known by more than 1.5 billion people in the Indian subcontinent, Indic languages present unique challenges and opportunities for natural language processing (NLP) research.<n>IndicMMLU-Pro is a benchmark designed to evaluate Large Language Models (LLMs) across Indic languages.
arXiv Detail & Related papers (2025-01-27T03:19:03Z) - MILU: A Multi-task Indic Language Understanding Benchmark [7.652738829153342]
We introduce MILU, a comprehensive evaluation benchmark designed to assess Large Language Models in Indic languages.<n>With an India-centric design, MILU incorporates material from regional and state-level examinations, covering topics such as local history, arts, festivals, and laws, alongside standard subjects like science and mathematics.<n>Open multilingual models outperform language-specific fine-tuned models, which perform only slightly better than random baselines.
arXiv Detail & Related papers (2024-11-04T19:17:17Z) - MLaKE: Multilingual Knowledge Editing Benchmark for Large Language Models [65.10456412127405]
MLaKE is a benchmark for the adaptability of knowledge editing methods across five languages.<n>MLaKE aggregates fact chains from Wikipedia across languages and generates questions in both free-form and multiple-choice.<n>We evaluate the multilingual knowledge editing generalization capabilities of existing methods on MLaKE.
arXiv Detail & Related papers (2024-04-07T15:23:28Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.<n>This survey delves into an important attribute of these datasets: the dialect of a language.<n>Motivated by the performance degradation of NLP models for dialectal datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - IndicSUPERB: A Speech Processing Universal Performance Benchmark for
Indian languages [16.121708272597154]
We release the IndicSUPERB benchmark for speech recognition in 12 Indian languages.
We train and evaluate different self-supervised models alongside a commonly used baseline benchmark.
We show that language-specific fine-tuned models are more accurate than baseline on most of the tasks.
arXiv Detail & Related papers (2022-08-24T20:14:52Z) - X-FACTR: Multilingual Factual Knowledge Retrieval from Pretrained
Language Models [103.75890012041366]
Language models (LMs) have proven surprisingly successful at capturing factual knowledge.
However, studies on LMs' factual representation ability have almost invariably been performed on English.
We create a benchmark of cloze-style probes for 23 typologically diverse languages.
arXiv Detail & Related papers (2020-10-13T05:29:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.