Related papers: TurkBench: A Benchmark for Evaluating Turkish Large Language Models

TurkBench: A Benchmark for Evaluating Turkish Large Language Models

URL: http://arxiv.org/abs/2601.07020v1
Date: Sun, 11 Jan 2026 18:28:23 GMT
Title: TurkBench: A Benchmark for Evaluating Turkish Large Language Models
Authors: Çağrı Toraman, Ahmet Kaan Sever, Ayse Aysu Cengiz, Elif Ecem Arslan, Görkem Sevinç, Mete Mert Birdal, Yusuf Faruk Güldemir, Ali Buğra Kanburoğlu, Sezen Felekoğlu, Osman Gürlek, Sarp Kantar, Birsen Şahin Kütük, Büşra Tufan, Elif Genç, Serkan Coşkun, Gupse Ekin Demir, Muhammed Emin Arayıcı, Olgun Dursun, Onur Gungor, Susan Üsküdarlı, Abdullah Topraksoy, Esra Darıcı,
Abstract summary: TurkBench is a benchmark designed to assess the capabilities of generative large language models in the Turkish language.<n>It involves 8,151 data samples across 21 distinct subtasks.<n>The diverse range of tasks and the culturally relevant data would provide researchers and developers with a valuable tool for evaluating their models.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: With the recent surge in the development of large language models, the need for comprehensive and language-specific evaluation benchmarks has become critical. While significant progress has been made in evaluating English language models, benchmarks for other languages, particularly those with unique linguistic characteristics such as Turkish, remain less developed. Our study introduces TurkBench, a comprehensive benchmark designed to assess the capabilities of generative large language models in the Turkish language. TurkBench involves 8,151 data samples across 21 distinct subtasks. These are organized under six main categories of evaluation: Knowledge, Language Understanding, Reasoning, Content Moderation, Turkish Grammar and Vocabulary, and Instruction Following. The diverse range of tasks and the culturally relevant data would provide researchers and developers with a valuable tool for evaluating their models and identifying areas for improvement. We further publish our benchmark for online submissions at https://huggingface.co/turkbench

Related papers

Cetvel: A Unified Benchmark for Evaluating Language Understanding, Generation and Cultural Capacity of LLMs for Turkish [9.111556632499472]
Cetvel is a benchmark designed to evaluate large language models (LLMs) in Turkish.<n>It combines a broad range of both discriminative and generative tasks ensuring content that reflects the linguistic and cultural richness of Turkish language.
arXiv Detail & Related papers (2025-08-22T14:42:50Z)
TUMLU: A Unified and Native Language Understanding Benchmark for Turkic Languages [2.115206401188031]
We propose two benchmarks for Turkic language MMLU: TUMLU and TUMLU-mini.<n>TUMLU-mini consists of middle- and high-school level questions spanning 11 academic subjects in Azerbaijani, Crimean Tatar, Karakalpak, Kazakh, Tatar, Turkish, Uyghur, and Uzbek.<n>We also present TUMLU-mini, a more concise, balanced, and manually verified subset of the dataset.
arXiv Detail & Related papers (2025-02-16T07:07:38Z)
Optimizing Large Language Models for Turkish: New Methodologies in Corpus Selection and Training [0.0]
We adapt Large Language Model generated datasets and translated English datasets into Turkish.<n>This approach led to substantial enhancements in model accuracy for both few-shot and zero-shot learning scenarios.
arXiv Detail & Related papers (2024-12-03T19:17:18Z)
Navigating Text-to-Image Generative Bias across Indic Languages [53.92640848303192]
This research investigates biases in text-to-image (TTI) models for the Indic languages widely spoken across India. It evaluates and compares the generative performance and cultural relevance of leading TTI models in these languages against their performance in English.
arXiv Detail & Related papers (2024-08-01T04:56:13Z)
TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish [54.51310112013655]
We introduce the first multitask, multiple-choice Turkish QA benchmark, TurkishMMLU. TurkishMMLU includes over 10,000 questions, covering 9 different subjects from Turkish high-school education curricula. We evaluate over 20 LLMs, including multilingual open-source (e.g., Gemma, Llama, MT5), closed-source (GPT 4o, Claude, Gemini), and Turkish-adapted (e.g., Trendyol) models.
arXiv Detail & Related papers (2024-07-17T08:28:55Z)
Introducing cosmosGPT: Monolingual Training for Turkish Language Models [0.0]
This study introduces the cosmosGPT models that we created with this alternative method. We then introduce new finetune datasets for basic language models to fulfill user requests and new evaluation datasets for measuring the capabilities of Turkish language models. The results show that the language models we built with the monolingual corpus have promising performance despite being about 10 times smaller than the others.
arXiv Detail & Related papers (2024-04-26T11:34:11Z)
Türkçe Dil Modellerinin Performans Karşılaştırması Performance Comparison of Turkish Language Models [0.0]
A comparison is made among seven selected language models based on their contextual learning and question-answering abilities. The results show that for question-answering, continuing pretraining before fine-tuning with instructional datasets is more successful in adapting multilingual models to Turkish.
arXiv Detail & Related papers (2024-04-25T20:10:14Z)
Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks. Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena. For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z)
SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization Evaluation [52.186343500576214]
We introduce SEAHORSE, a dataset for multilingual, multifaceted summarization evaluation. SEAHORSE consists of 96K summaries with human ratings along 6 dimensions of text quality. We show that metrics trained with SEAHORSE achieve strong performance on the out-of-domain meta-evaluation benchmarks TRUE and mFACE.
arXiv Detail & Related papers (2023-05-22T16:25:07Z)
CUGE: A Chinese Language Understanding and Generation Evaluation Benchmark [144.05723617401674]
General-purpose language intelligence evaluation has been a longstanding goal for natural language processing. We argue that for general-purpose language intelligence evaluation, the benchmark itself needs to be comprehensive and systematic. We propose CUGE, a Chinese Language Understanding and Generation Evaluation benchmark with the following features.
arXiv Detail & Related papers (2021-12-27T11:08:58Z)
XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization [128.37244072182506]
Cross-lingual TRansfer Evaluation of Multilinguals XTREME is a benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks. We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models.
arXiv Detail & Related papers (2020-03-24T19:09:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.