TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish
- URL: http://arxiv.org/abs/2407.12402v2
- Date: Thu, 3 Oct 2024 15:45:52 GMT
- Title: TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish
- Authors: Arda Yüksel, Abdullatif Köksal, Lütfi Kerem Şenel, Anna Korhonen, Hinrich Schütze,
- Abstract summary: We introduce the first multitask, multiple-choice Turkish QA benchmark, TurkishMMLU.
TurkishMMLU includes over 10,000 questions, covering 9 different subjects from Turkish high-school education curricula.
We evaluate over 20 LLMs, including multilingual open-source (e.g., Gemma, Llama, MT5), closed-source (GPT 4o, Claude, Gemini), and Turkish-adapted (e.g., Trendyol) models.
- Score: 54.51310112013655
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multiple choice question answering tasks evaluate the reasoning, comprehension, and mathematical abilities of Large Language Models (LLMs). While existing benchmarks employ automatic translation for multilingual evaluation, this approach is error-prone and potentially introduces culturally biased questions, especially in social sciences. We introduce the first multitask, multiple-choice Turkish QA benchmark, TurkishMMLU, to evaluate LLMs' understanding of the Turkish language. TurkishMMLU includes over 10,000 questions, covering 9 different subjects from Turkish high-school education curricula. These questions are written by curriculum experts, suitable for the high-school curricula in Turkey, covering subjects ranging from natural sciences and math questions to more culturally representative topics such as Turkish Literature and the history of the Turkish Republic. We evaluate over 20 LLMs, including multilingual open-source (e.g., Gemma, Llama, MT5), closed-source (GPT 4o, Claude, Gemini), and Turkish-adapted (e.g., Trendyol) models. We provide an extensive evaluation, including zero-shot and few-shot evaluation of LLMs, chain-of-thought reasoning, and question difficulty analysis along with model performance. We provide an in-depth analysis of the Turkish capabilities and limitations of current LLMs to provide insights for future LLMs for the Turkish language. We publicly release our code for the dataset and evaluation: https://github.com/ArdaYueksel/TurkishMMLU.
Related papers
- TUMLU: A Unified and Native Language Understanding Benchmark for Turkic Languages [2.115206401188031]
We propose two benchmarks for Turkic language MMLU: TUMLU and TUMLU-mini.
TUMLU-mini consists of middle- and high-school level questions spanning 11 academic subjects in Azerbaijani, Crimean Tatar, Karakalpak, Kazakh, Tatar, Turkish, Uyghur, and Uzbek.
We also present TUMLU-mini, a more concise, balanced, and manually verified subset of the dataset.
arXiv Detail & Related papers (2025-02-16T07:07:38Z) - Setting Standards in Turkish NLP: TR-MMLU for Large Language Model Evaluation [0.29687381456163997]
The Turkish MMLU (TR-MMLU) benchmark is designed to assess the linguistic and conceptual capabilities of large language models (LLMs) in Turkish.
TR-MMLU is constructed from a dataset comprising 6200 multiple-choice questions across 62 sections, selected from a pool of 280000 questions spanning 67 disciplines and over 800 topics within the Turkish education system.
Our findings reveal critical challenges, such as the impact of tokenization and fine-tuning strategies, and highlight areas for improvement in model design.
arXiv Detail & Related papers (2024-12-31T18:43:49Z) - CaLMQA: Exploring culturally specific long-form question answering across 23 languages [58.18984409715615]
CaLMQA is a collection of 1.5K culturally specific questions spanning 23 languages and 51 culturally translated questions from English into 22 other languages.
We collect naturally-occurring questions from community web forums and hire native speakers to write questions to cover under-studied languages such as Fijian and Kirundi.
Our dataset contains diverse, complex questions that reflect cultural topics (e.g. traditions, laws, news) and the language usage of native speakers.
arXiv Detail & Related papers (2024-06-25T17:45:26Z) - Do LLMs Recognize me, When I is not me: Assessment of LLMs Understanding of Turkish Indexical Pronouns in Indexical Shift Contexts [0.0]
This study focuses on the Indexical Shift problem in Turkish.
The Indexical Shift problem involves resolving pronouns in indexical shift contexts, a grammatical challenge not present in high-resource languages like English.
We present the first study examining indexical shift in any language, releasing a Turkish dataset specifically designed for this purpose.
arXiv Detail & Related papers (2024-06-08T20:30:53Z) - Automating Turkish Educational Quiz Generation Using Large Language Models [10.876144855651608]
We present a specialized dataset, named the Turkish-Quiz-Instruct, comprising an extensive collection of Turkish educational texts.
This research leverages the capabilities of Large Language Models (LLMs), including GPT-4-Turbo, GPT-3.5-Turbo, Llama-2-7b-chat-hf, and Llama-2-13b-chat-hf, to automatically generate quiz questions and answers.
arXiv Detail & Related papers (2024-06-05T15:54:50Z) - MLaKE: Multilingual Knowledge Editing Benchmark for Large Language Models [65.10456412127405]
MLaKE is a benchmark for the adaptability of knowledge editing methods across five languages.
MLaKE aggregates fact chains from Wikipedia across languages and generates questions in both free-form and multiple-choice.
We evaluate the multilingual knowledge editing generalization capabilities of existing methods on MLaKE.
arXiv Detail & Related papers (2024-04-07T15:23:28Z) - OMGEval: An Open Multilingual Generative Evaluation Benchmark for Large
Language Models [59.54423478596468]
We introduce OMGEval, the first Open-source Multilingual Generative test set that can assess the capability of LLMs in different languages.
For each language, OMGEval provides 804 open-ended questions, covering a wide range of important capabilities of LLMs.
Specifically, the current version of OMGEval includes 5 languages (i.e., Zh, Ru, Fr, Es, Ar)
arXiv Detail & Related papers (2024-02-21T04:42:41Z) - Benchmarking Procedural Language Understanding for Low-Resource
Languages: A Case Study on Turkish [2.396465363376008]
We conduct a case study on Turkish procedural texts.
We first expand the number of tutorials in Turkish wikiHow from 2,000 to 52,000 using automated translation tools.
We generate several downstream tasks on the corpus, such as linking actions, goal inference, and summarization.
arXiv Detail & Related papers (2023-09-13T03:42:28Z) - CMMLU: Measuring massive multitask language understanding in Chinese [133.70911295934746]
This paper introduces a comprehensive Chinese benchmark that covers various subjects, including natural science, social sciences, engineering, and humanities.
CMMLU fills the gap in evaluating the knowledge and reasoning capabilities of large language models within the Chinese context.
arXiv Detail & Related papers (2023-06-15T15:49:51Z) - M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining
Large Language Models [76.88692952308084]
M3Exam is a benchmark for evaluating large language models (LLMs) in a multilingual, multimodal, and multilevel context.
M3Exam contains 12,317 questions in 9 diverse languages with three educational levels.
We assess the performance of top-performing LLMs on M3Exam and find that current models, including GPT-4, still struggle with multilingual text.
arXiv Detail & Related papers (2023-06-08T13:21:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.