Related papers: LAG-MMLU: Benchmarking Frontier LLM Understanding in Latvian and Giriama

LAG-MMLU: Benchmarking Frontier LLM Understanding in Latvian and Giriama

URL: http://arxiv.org/abs/2503.11911v2
Date: Tue, 18 Mar 2025 04:01:37 GMT
Title: LAG-MMLU: Benchmarking Frontier LLM Understanding in Latvian and Giriama
Authors: Naome A. Etori, Kevin Lu, Randu Karisa, Arturs Kanepajs,
Abstract summary: OpenAI's o1 model outperforms others across all languages, scoring 92.8% in English, 88.8% in Latvian, and 70.8% in Giriama on 0-shot tasks.<n>Our results underscore the need for localized benchmarks and human evaluations in advancing cultural AI contextualization.
Score: 4.533057394214656
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As large language models (LLMs) rapidly advance, evaluating their performance is critical. LLMs are trained on multilingual data, but their reasoning abilities are mainly evaluated using English datasets. Hence, robust evaluation frameworks are needed using high-quality non-English datasets, especially low-resource languages (LRLs). This study evaluates eight state-of-the-art (SOTA) LLMs on Latvian and Giriama using a Massive Multitask Language Understanding (MMLU) subset curated with native speakers for linguistic and cultural relevance. Giriama is benchmarked for the first time. Our evaluation shows that OpenAI's o1 model outperforms others across all languages, scoring 92.8% in English, 88.8% in Latvian, and 70.8% in Giriama on 0-shot tasks. Mistral-large (35.6%) and Llama-70B IT (41%) have weak performance, on both Latvian and Giriama. Our results underscore the need for localized benchmarks and human evaluations in advancing cultural AI contextualization.

Related papers

Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering [73.73820209993515]
We introduce KoLasSimpleQA, the first benchmark evaluating the multilingual factual ability of Large Language Models (LLMs)<n>Inspired by existing research, we created the question set with features such as single knowledge point coverage, absolute objectivity, unique answers, and temporal stability.<n>Results show significant performance differences between the two domains.
arXiv Detail & Related papers (2025-05-22T12:27:02Z)
MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation [86.7047714187813]
MMLU-ProX is a benchmark covering 29 languages, built on an English benchmark.<n>Each language version consists of 11,829 identical questions, enabling direct cross-linguistic comparisons.<n>To meet efficient evaluation needs, we provide a lite version containing 658 questions per language.
arXiv Detail & Related papers (2025-03-13T15:59:20Z)
HellaSwag-Pro: A Large-Scale Bilingual Benchmark for Evaluating the Robustness of LLMs in Commonsense Reasoning [56.221060995324436]
Large language models (LLMs) have shown remarkable capabilities in commonsense reasoning.<n>Do these models truly understand commonsense knowledge, or just memorize expression patterns?<n>We introduce HellaSwag-Pro, a large-scale bilingual benchmark consisting of 11,200 cases.
arXiv Detail & Related papers (2025-02-17T03:24:02Z)
INJONGO: A Multicultural Intent Detection and Slot-filling Dataset for 16 African Languages [15.983678567785004]
Slot-filling and intent detection are well-established tasks in Conversational AI.<n>We introduce Injongo -- a multicultural, open-source benchmark dataset for 16 African languages.<n>We show the advantage of leveraging African-cultural utterances over Western-centric utterances for improving cross-lingual transfer.
arXiv Detail & Related papers (2025-02-13T23:17:10Z)
Bridging the Gap: Enhancing LLM Performance for Low-Resource African Languages with New Benchmarks, Fine-Tuning, and Cultural Adjustments [0.9214083577876088]
This paper creates approximately 1 million human-translated words of new benchmark data in 8 low-resource African languages.<n>Our benchmarks are translations of Winogrande and three sections of MMLU: college medicine, clinical knowledge, and virology.<n>Using the benchmarks translated, we report previously unknown performance gaps between state-of-the-art (SOTA) LLMs in English and African languages.
arXiv Detail & Related papers (2024-12-16T23:50:21Z)
One Language, Many Gaps: Evaluating Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks [68.33068005789116]
We present the first study aimed at objectively assessing the fairness and robustness of Large Language Models (LLMs) in handling dialects in canonical reasoning tasks.<n>We hire AAVE speakers, including experts with computer science backgrounds, to rewrite seven popular benchmarks, such as HumanEval and GSM8K.<n>Our findings reveal that textbfalmost all of these widely used models show significant brittleness and unfairness to queries in AAVE.
arXiv Detail & Related papers (2024-10-14T18:44:23Z)
IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models [18.083861654053585]
This paper introduces IrokoBench -- a human-translated benchmark dataset for 17 typologically-diverse low-resource African languages.<n>We use IrokoBench to evaluate zero-shot, few-shot, and translate-test settings(where test sets are translated into English) across 10 open and six proprietary language models.<n>We observe a significant performance gap between open and proprietary models, with the highest performing open model, Gemma 2 27B only at 63% of the best-performing proprietary model GPT-4o performance.
arXiv Detail & Related papers (2024-06-05T15:23:08Z)
Benchmarking Large Language Models on CFLUE -- A Chinese Financial Language Understanding Evaluation Dataset [7.954348293179786]
We propose CFLUE, a benchmark to assess the capability of large language models (LLMs) across various dimensions. In knowledge assessment, it consists of 38K+ multiple-choice questions with associated solution explanations. In application assessment, it features 16K+ test instances across distinct groups of NLP tasks such as text classification, machine translation, relation extraction, reading comprehension, and text generation.
arXiv Detail & Related papers (2024-05-17T05:03:40Z)
OMGEval: An Open Multilingual Generative Evaluation Benchmark for Large Language Models [59.54423478596468]
We introduce OMGEval, the first Open-source Multilingual Generative test set that can assess the capability of LLMs in different languages. For each language, OMGEval provides 804 open-ended questions, covering a wide range of important capabilities of LLMs. Specifically, the current version of OMGEval includes 5 languages (i.e., Zh, Ru, Fr, Es, Ar)
arXiv Detail & Related papers (2024-02-21T04:42:41Z)
LLaMA Beyond English: An Empirical Study on Language Capability Transfer [49.298360366468934]
We focus on how to effectively transfer the capabilities of language generation and following instructions to a non-English language. We analyze the impact of key factors such as vocabulary extension, further pretraining, and instruction tuning on transfer. We employ four widely used standardized testing benchmarks: C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench.
arXiv Detail & Related papers (2024-01-02T06:29:02Z)
Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations [59.056367787688146]
This paper pioneers exploring and training powerful Multilingual Math Reasoning (xMR) LLMs. We construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages. By utilizing translation, we construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
arXiv Detail & Related papers (2023-10-31T08:09:20Z)
ChatGPT MT: Competitive for High- (but not Low-) Resource Languages [62.178282377729566]
Large language models (LLMs) implicitly learn to perform a range of language tasks, including machine translation (MT) We present the first experimental evidence for an expansive set of 204 languages, along with MT cost analysis. Our analysis reveals that a language's resource level is the most important feature in determining ChatGPT's relative ability to translate it.
arXiv Detail & Related papers (2023-09-14T04:36:00Z)
Extrapolating Large Language Models to Non-English by Aligning Languages [109.09051737966178]
Existing large language models show disparate capability across different languages. In this paper, we empower pre-trained LLMs on non-English languages by building semantic alignment across languages.
arXiv Detail & Related papers (2023-08-09T13:32:06Z)
Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis [103.89753784762445]
Large language models (LLMs) have demonstrated remarkable potential in handling multilingual machine translation (MMT) This paper systematically investigates the advantages and challenges of LLMs for MMT. We thoroughly evaluate eight popular LLMs, including ChatGPT and GPT-4.
arXiv Detail & Related papers (2023-04-10T15:51:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.