Large Language Models Only Pass Primary School Exams in Indonesia: A
Comprehensive Test on IndoMMLU
- URL: http://arxiv.org/abs/2310.04928v2
- Date: Sat, 21 Oct 2023 17:13:05 GMT
- Title: Large Language Models Only Pass Primary School Exams in Indonesia: A
Comprehensive Test on IndoMMLU
- Authors: Fajri Koto and Nurul Aisyah and Haonan Li and Timothy Baldwin
- Abstract summary: IndoMMLU is the first multi-task language understanding benchmark for Indonesian culture and languages.
It consists of questions from primary school to university entrance exams in Indonesia.
- Score: 31.555098850095817
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Although large language models (LLMs) are often pre-trained on large-scale
multilingual texts, their reasoning abilities and real-world knowledge are
mainly evaluated based on English datasets. Assessing LLM capabilities beyond
English is increasingly vital but hindered due to the lack of suitable
datasets. In this work, we introduce IndoMMLU, the first multi-task language
understanding benchmark for Indonesian culture and languages, which consists of
questions from primary school to university entrance exams in Indonesia. By
employing professional teachers, we obtain 14,981 questions across 64 tasks and
education levels, with 46% of the questions focusing on assessing proficiency
in the Indonesian language and knowledge of nine local languages and cultures
in Indonesia. Our empirical evaluations show that GPT-3.5 only manages to pass
the Indonesian primary school level, with limited knowledge of local Indonesian
languages and culture. Other smaller models such as BLOOMZ and Falcon perform
at even lower levels.
Related papers
- OmniEduBench: A Comprehensive Chinese Benchmark for Evaluating Large Language Models in Education [72.40048732210055]
We introduce OmniEduBench, a comprehensive Chinese educational benchmark.<n>The data is meticulously divided into two core dimensions: the knowledge dimension and the cultivation dimension.<n>The dataset features a rich variety of question formats, including 11 common exam question types.
arXiv Detail & Related papers (2025-10-30T12:16:29Z) - SinhalaMMLU: A Comprehensive Benchmark for Evaluating Multitask Language Understanding in Sinhala [39.525952729268994]
We introduce SinhalaMMLU, the first multiple-choice question answering benchmark designed specifically for Sinhala.<n>The dataset includes over 7,000 questions spanning secondary to collegiate education levels, aligned with the Sri Lankan national curriculum.<n>We evaluate 26 LLMs on SinhalaMMLU and observe that, while Claude 3.5 sonnet and GPT-4o achieve the highest average accuracies at 67% and 62% respectively, overall model performance remains limited.
arXiv Detail & Related papers (2025-09-03T09:22:39Z) - MELAC: Massive Evaluation of Large Language Models with Alignment of Culture in Persian Language [0.8182812460605992]
This study focuses on the Persian language and Iranian culture.<n>We introduce 19 new evaluation datasets specifically designed to assess LLMs on topics such as Iranian law, Persian grammar, Persian idioms, and university entrance exams.<n>Using these datasets, we benchmarked 41 prominent LLMs, aiming to bridge the existing cultural and linguistic evaluation gap in the field.
arXiv Detail & Related papers (2025-08-01T14:46:57Z) - MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs [56.87573414161703]
We introduce the Multilingual Native Reasoning Challenge (MultiNRC), a benchmark to assess Large Language Models (LLMs)<n>MultiNRC covers four core reasoning categories: language-specific linguistic reasoning, wordplay & riddles, cultural/tradition reasoning, and math reasoning with cultural relevance.<n>For cultural/tradition reasoning and math reasoning with cultural relevance, we also provide English equivalent translations of the multilingual questions by manual translation from native speakers fluent in English.
arXiv Detail & Related papers (2025-07-23T12:56:31Z) - Multilingual Performance Biases of Large Language Models in Education [39.14806026620442]
Large language models (LLMs) are increasingly being adopted in educational settings.
This work ascertains if their use in education settings in non-English languages is warranted.
arXiv Detail & Related papers (2025-04-24T16:32:31Z) - LAG-MMLU: Benchmarking Frontier LLM Understanding in Latvian and Giriama [4.533057394214656]
OpenAI's o1 model outperforms others across all languages, scoring 92.8% in English, 88.8% in Latvian, and 70.8% in Giriama on 0-shot tasks.
Our results underscore the need for localized benchmarks and human evaluations in advancing cultural AI contextualization.
arXiv Detail & Related papers (2025-03-14T22:50:50Z) - Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation [71.59208664920452]
Cultural biases in multilingual datasets pose significant challenges for their effectiveness as global benchmarks.
We show that progress on MMLU depends heavily on learning Western-centric concepts, with 28% of all questions requiring culturally sensitive knowledge.
We release Global MMLU, an improved MMLU with evaluation coverage across 42 languages.
arXiv Detail & Related papers (2024-12-04T13:27:09Z) - MILU: A Multi-task Indic Language Understanding Benchmark [7.652738829153342]
Existing benchmarks predominantly focus on English, leaving substantial gaps in assessing Large Language Models in Indic languages.
We introduce MILU, a comprehensive evaluation benchmark designed to address this gap.
With an India-centric design, MILU incorporates material from regional and state-level examinations, covering topics such as local history, arts, festivals, and laws, alongside standard subjects like science and mathematics.
arXiv Detail & Related papers (2024-11-04T19:17:17Z) - MLaKE: Multilingual Knowledge Editing Benchmark for Large Language Models [65.10456412127405]
MLaKE is a benchmark for the adaptability of knowledge editing methods across five languages.
MLaKE aggregates fact chains from Wikipedia across languages and generates questions in both free-form and multiple-choice.
We evaluate the multilingual knowledge editing generalization capabilities of existing methods on MLaKE.
arXiv Detail & Related papers (2024-04-07T15:23:28Z) - ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic [51.922112625469836]
We present datasetname, the first multi-task language understanding benchmark for the Arabic language.
Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region.
Our evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models.
arXiv Detail & Related papers (2024-02-20T09:07:41Z) - LLaMA Beyond English: An Empirical Study on Language Capability Transfer [49.298360366468934]
We focus on how to effectively transfer the capabilities of language generation and following instructions to a non-English language.
We analyze the impact of key factors such as vocabulary extension, further pretraining, and instruction tuning on transfer.
We employ four widely used standardized testing benchmarks: C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench.
arXiv Detail & Related papers (2024-01-02T06:29:02Z) - BHASA: A Holistic Southeast Asian Linguistic and Cultural Evaluation
Suite for Large Language Models [0.06597195879147556]
BHASA is a holistic linguistic and cultural evaluation suite for Large Language Models (LLMs) in Southeast Asian languages.
It comprises three components: (1) a NLP benchmark covering eight tasks across Natural Language Understanding (NLU), Generation (NLG) and Reasoning (NLR) tasks, (2) LINDSEA, a linguistic diagnostic toolkit that spans the gamut of linguistic phenomena including syntax, semantics and pragmatics, and (3) a cultural diagnostics dataset that probes for both cultural representation and sensitivity.
arXiv Detail & Related papers (2023-09-12T09:31:25Z) - M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining
Large Language Models [76.88692952308084]
M3Exam is a benchmark for evaluating large language models (LLMs) in a multilingual, multimodal, and multilevel context.
M3Exam contains 12,317 questions in 9 diverse languages with three educational levels.
We assess the performance of top-performing LLMs on M3Exam and find that current models, including GPT-4, still struggle with multilingual text.
arXiv Detail & Related papers (2023-06-08T13:21:29Z) - One Country, 700+ Languages: NLP Challenges for Underrepresented
Languages and Dialects in Indonesia [60.87739250251769]
We provide an overview of the current state of NLP research for Indonesia's 700+ languages.
We highlight challenges in Indonesian NLP and how these affect the performance of current NLP systems.
arXiv Detail & Related papers (2022-03-24T22:07:22Z) - IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model
for Indonesian NLP [41.57622648924415]
The Indonesian language is spoken by almost 200 million people and the 10th most spoken language in the world.
Previous work on Indonesian has been hampered by a lack of annotated datasets, a sparsity of language resources, and a lack of resource standardization.
We release the IndoLEM dataset comprising seven tasks for the Indonesian language, spanning morpho-syntax, semantics, and discourse.
We additionally release IndoBERT, a new pre-trained language model for Indonesian, and evaluate it over IndoLEM.
arXiv Detail & Related papers (2020-11-02T01:54:56Z) - IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural
Language Understanding [41.691861010118394]
We introduce the first-ever vast resource for the training, evaluating, and benchmarking on Indonesian natural language understanding tasks.
IndoNLU includes twelve tasks, ranging from single sentence classification to pair-sentences sequence labeling with different levels of complexity.
The datasets for the tasks lie in different domains and styles to ensure task diversity.
We also provide a set of Indonesian pre-trained models (IndoBERT) trained from a large and clean Indonesian dataset Indo4B.
arXiv Detail & Related papers (2020-09-11T12:21:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.