Cetvel: A Unified Benchmark for Evaluating Language Understanding, Generation and Cultural Capacity of LLMs for Turkish
- URL: http://arxiv.org/abs/2508.16431v1
- Date: Fri, 22 Aug 2025 14:42:50 GMT
- Title: Cetvel: A Unified Benchmark for Evaluating Language Understanding, Generation and Cultural Capacity of LLMs for Turkish
- Authors: Yakup Abrek Er, Ilker Kesen, Gözde Gül Şahin, Aykut Erdem,
- Abstract summary: Cetvel is a benchmark designed to evaluate large language models (LLMs) in Turkish.<n>It combines a broad range of both discriminative and generative tasks ensuring content that reflects the linguistic and cultural richness of Turkish language.
- Score: 9.111556632499472
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We introduce Cetvel, a comprehensive benchmark designed to evaluate large language models (LLMs) in Turkish. Existing Turkish benchmarks often lack either task diversity or culturally relevant content, or both. Cetvel addresses these gaps by combining a broad range of both discriminative and generative tasks ensuring content that reflects the linguistic and cultural richness of Turkish language. Cetvel covers 23 tasks grouped into seven categories, including tasks such as grammatical error correction, machine translation, and question answering rooted in Turkish history and idiomatic language. We evaluate 33 open-weight LLMs (up to 70B parameters) covering different model families and instruction paradigms. Our experiments reveal that Turkish-centric instruction-tuned models generally underperform relative to multilingual or general-purpose models (e.g. Llama 3 and Mistral), despite being tailored for the language. Moreover, we show that tasks such as grammatical error correction and extractive question answering are particularly discriminative in differentiating model capabilities. Cetvel offers a comprehensive and culturally grounded evaluation suite for advancing the development and assessment of LLMs in Turkish.
Related papers
- TurkBench: A Benchmark for Evaluating Turkish Large Language Models [0.0]
TurkBench is a benchmark designed to assess the capabilities of generative large language models in the Turkish language.<n>It involves 8,151 data samples across 21 distinct subtasks.<n>The diverse range of tasks and the culturally relevant data would provide researchers and developers with a valuable tool for evaluating their models.
arXiv Detail & Related papers (2026-01-11T18:28:23Z) - Decoding Memes: Benchmarking Narrative Role Classification across Multilingual and Multimodal Models [26.91963265869296]
This work investigates the challenging task of identifying narrative roles in Internet memes.<n>It builds on an annotated dataset originally skewed toward the 'Other' class.<n> Comprehensive lexical and structural analyses highlight the nuanced, culture-specific, and context-rich language used in real memes.
arXiv Detail & Related papers (2025-06-29T07:12:11Z) - TurBLiMP: A Turkish Benchmark of Linguistic Minimal Pairs [10.156237643034123]
TurBLiMP is the first Turkish benchmark of linguistic minimal pairs.<n> Covering 16 linguistic phenomena with 1000 minimal pairs each, TurBLiMP fills an important gap in linguistic evaluation resources for Turkish.
arXiv Detail & Related papers (2025-06-16T13:45:30Z) - Disentangling Language and Culture for Evaluating Multilingual Large Language Models [48.06219053598005]
This paper introduces a Dual Evaluation Framework to comprehensively assess the multilingual capabilities of LLMs.<n>By decomposing the evaluation along the dimensions of linguistic medium and cultural context, this framework enables a nuanced analysis of LLMs' ability to process questions cross-lingually.
arXiv Detail & Related papers (2025-05-30T14:25:45Z) - Language Mixing in Reasoning Language Models: Patterns, Impact, and Internal Causes [49.770097731093216]
Reasoning language models (RLMs) excel at complex tasks by leveraging a chain-of-thought process to generate structured intermediate steps.<n> Language mixing, i.e., reasoning steps containing tokens from languages other than the prompt, has been observed in their outputs and shown to affect performance.<n>We present the first systematic study of language mixing in RLMs, examining its patterns, impact, and internal causes across 15 languages.
arXiv Detail & Related papers (2025-05-20T18:26:53Z) - MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation [86.7047714187813]
MMLU-ProX is a benchmark covering 29 languages, built on an English benchmark.<n>Each language version consists of 11,829 identical questions, enabling direct cross-linguistic comparisons.<n>To meet efficient evaluation needs, we provide a lite version containing 658 questions per language.
arXiv Detail & Related papers (2025-03-13T15:59:20Z) - TUMLU: A Unified and Native Language Understanding Benchmark for Turkic Languages [2.115206401188031]
We propose two benchmarks for Turkic language MMLU: TUMLU and TUMLU-mini.<n>TUMLU-mini consists of middle- and high-school level questions spanning 11 academic subjects in Azerbaijani, Crimean Tatar, Karakalpak, Kazakh, Tatar, Turkish, Uyghur, and Uzbek.<n>We also present TUMLU-mini, a more concise, balanced, and manually verified subset of the dataset.
arXiv Detail & Related papers (2025-02-16T07:07:38Z) - Setting Standards in Turkish NLP: TR-MMLU for Large Language Model Evaluation [0.29687381456163997]
The Turkish MMLU (TR-MMLU) benchmark is designed to assess the linguistic and conceptual capabilities of large language models (LLMs) in Turkish.<n> TR-MMLU is constructed from a dataset comprising 6200 multiple-choice questions across 62 sections, selected from a pool of 280000 questions spanning 67 disciplines and over 800 topics within the Turkish education system.<n>Our findings reveal critical challenges, such as the impact of tokenization and fine-tuning strategies, and highlight areas for improvement in model design.
arXiv Detail & Related papers (2024-12-31T18:43:49Z) - Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation [71.59208664920452]
Cultural biases in multilingual datasets pose significant challenges for their effectiveness as global benchmarks.<n>We show that progress on MMLU depends heavily on learning Western-centric concepts, with 28% of all questions requiring culturally sensitive knowledge.<n>We release Global MMLU, an improved MMLU with evaluation coverage across 42 languages.
arXiv Detail & Related papers (2024-12-04T13:27:09Z) - TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish [54.51310112013655]
We introduce the first multitask, multiple-choice Turkish QA benchmark, TurkishMMLU.
TurkishMMLU includes over 10,000 questions, covering 9 different subjects from Turkish high-school education curricula.
We evaluate over 20 LLMs, including multilingual open-source (e.g., Gemma, Llama, MT5), closed-source (GPT 4o, Claude, Gemini), and Turkish-adapted (e.g., Trendyol) models.
arXiv Detail & Related papers (2024-07-17T08:28:55Z) - Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds [59.71218039095155]
We evaluate language understanding capacities on simple inference tasks that most humans find trivial.
We target (i) grammatically-specified entailments, (ii) premises with evidential adverbs of uncertainty, and (iii) monotonicity entailments.
The models exhibit moderate to low performance on these evaluation sets.
arXiv Detail & Related papers (2023-05-24T06:41:09Z) - Analyzing the Limits of Self-Supervision in Handling Bias in Language [52.26068057260399]
We evaluate how well language models capture the semantics of four tasks for bias: diagnosis, identification, extraction and rephrasing.
Our analyses indicate that language models are capable of performing these tasks to widely varying degrees across different bias dimensions, such as gender and political affiliation.
arXiv Detail & Related papers (2021-12-16T05:36:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.