ZNO-Eval: Benchmarking reasoning capabilities of large language models in Ukrainian
- URL: http://arxiv.org/abs/2501.06715v1
- Date: Sun, 12 Jan 2025 04:49:06 GMT
- Title: ZNO-Eval: Benchmarking reasoning capabilities of large language models in Ukrainian
- Authors: Mykyta Syromiatnikov, Victoria Ruvinskaya, Anastasiya Troynina,
- Abstract summary: This paper presents the ZNO-Eval benchmark based on real exam tasks from Ukraine's standardized educational testing system.
It paves the way toward a thorough analysis of reasoning capabilities across different domains and complexities.
evaluation of several well-known language models, such as GPT-3.5-Turbo, GPT-4o, GPT-4-Turbo, Mistral Large, Claude 3 Opus, and Gemini-1.5 Pro.
- Score: 0.0
- License:
- Abstract: As the usage of large language models for problems outside of simple text understanding or generation increases, assessing their abilities and limitations becomes crucial. While significant progress has been made in this area over the last few years, most research has focused on benchmarking English, leaving other languages underexplored. This makes evaluating the reasoning and robustness level of language models in Ukrainian particularly challenging. The purpose of this work is to establish a comprehensive benchmark for the reasoning capabilities evaluation of large language models in the Ukrainian language. This paper presents the ZNO-Eval benchmark based on real exam tasks from Ukraine's standardized educational testing system: the External Independent Evaluation and the National Multi-subject Test. With single-answer options, multiple-choice, matching, and open-ended questions from diverse subjects, including Ukrainian language, mathematics, history, and geography, this dataset paves the way toward a thorough analysis of reasoning capabilities across different domains and complexities. Evaluation of several well-known language models, such as GPT-3.5-Turbo, GPT-4o, GPT-4-Turbo, Mistral Large, Claude 3 Opus, and Gemini-1.5 Pro on this benchmark demonstrated the superiority of GPT-4o in both common knowledge reasoning and intricate language tasks. At the same time, Gemini Pro and GPT-4 Turbo excelled in the arithmetic domain, leading in single-answer and open-ended math problems. While all models were close to max performance in text-only common knowledge tasks like history and geography, there still is a gap for Ukrainian language and math, thus highlighting the importance of developing specialized language benchmarks for more accurate assessments of model capabilities and limitations across different languages and contexts.
Related papers
- TUMLU: A Unified and Native Language Understanding Benchmark for Turkic Languages [2.115206401188031]
We propose two benchmarks for Turkic language MMLU: TUMLU and TUMLU-mini.
TUMLU-mini consists of middle- and high-school level questions spanning 11 academic subjects in Azerbaijani, Crimean Tatar, Karakalpak, Kazakh, Tatar, Turkish, Uyghur, and Uzbek.
We also present TUMLU-mini, a more concise, balanced, and manually verified subset of the dataset.
arXiv Detail & Related papers (2025-02-16T07:07:38Z) - BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models [44.759766566414626]
We introduce BenchMAX, a multi-way multilingual evaluation benchmark.
To maintain high quality, three distinct native-speaking annotators independently annotate each sample.
Extensive experiments on BenchMAX reveal varying effectiveness of core capabilities across languages.
arXiv Detail & Related papers (2025-02-11T08:17:19Z) - Benchmarking Multimodal Models for Ukrainian Language Understanding Across Academic and Cultural Domains [0.0]
We introduce ZNO-Vision, a comprehensive multimodal Ukrainian-centric benchmark derived from standardized university entrance examination (ZNO)
The benchmark consists of over 4,300 expert-crafted questions spanning 12 academic disciplines, including mathematics, physics, chemistry, and humanities.
Alongside the new benchmark, we performed the first evaluation study of multimodal text generation for the Ukrainian language.
arXiv Detail & Related papers (2024-11-22T00:37:49Z) - MultiPragEval: Multilingual Pragmatic Evaluation of Large Language Models [0.5822010906632046]
This study introduces MultiPragEval, the first pragmatic evaluation of Large Language Models (LLMs)
Comprising 1200 question units categorized according to Grice's Cooperative Principle, MultiPragEval enables an in-depth assessment of LLMs' contextual awareness and their ability to infer implied meanings.
Our findings demonstrate that Claude3-Opus significantly outperforms other models in all tested languages, establishing a state-of-the-art in the field.
arXiv Detail & Related papers (2024-06-11T21:46:03Z) - MLaKE: Multilingual Knowledge Editing Benchmark for Large Language Models [65.10456412127405]
MLaKE is a benchmark for the adaptability of knowledge editing methods across five languages.
MLaKE aggregates fact chains from Wikipedia across languages and generates questions in both free-form and multiple-choice.
We evaluate the multilingual knowledge editing generalization capabilities of existing methods on MLaKE.
arXiv Detail & Related papers (2024-04-07T15:23:28Z) - Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations [59.056367787688146]
This paper pioneers exploring and training powerful Multilingual Math Reasoning (xMR) LLMs.
We construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
By utilizing translation, we construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
arXiv Detail & Related papers (2023-10-31T08:09:20Z) - HAE-RAE Bench: Evaluation of Korean Knowledge in Language Models [0.0]
We introduce the HAE-RAE Bench, a dataset curated to challenge models lacking Korean cultural and contextual depth.
The dataset encompasses six downstream tasks across four domains: vocabulary, history, general knowledge, and reading comprehension.
arXiv Detail & Related papers (2023-09-06T04:38:16Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - Lila: A Unified Benchmark for Mathematical Reasoning [59.97570380432861]
LILA is a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions.
We construct our benchmark by extending 20 datasets benchmark by collecting task instructions and solutions in the form of Python programs.
We introduce BHASKARA, a general-purpose mathematical reasoning model trained on LILA.
arXiv Detail & Related papers (2022-10-31T17:41:26Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.