Related papers: TurBLiMP: A Turkish Benchmark of Linguistic Minimal Pairs

TurBLiMP: A Turkish Benchmark of Linguistic Minimal Pairs

URL: http://arxiv.org/abs/2506.13487v1
Date: Mon, 16 Jun 2025 13:45:30 GMT
Title: TurBLiMP: A Turkish Benchmark of Linguistic Minimal Pairs
Authors: Ezgi Başar, Francesca Padovani, Jaap Jumelet, Arianna Bisazza,
Abstract summary: TurBLiMP is the first Turkish benchmark of linguistic minimal pairs.<n> Covering 16 linguistic phenomena with 1000 minimal pairs each, TurBLiMP fills an important gap in linguistic evaluation resources for Turkish.
Score: 4.476339707463773
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce TurBLiMP, the first Turkish benchmark of linguistic minimal pairs, designed to evaluate the linguistic abilities of monolingual and multilingual language models (LMs). Covering 16 linguistic phenomena with 1000 minimal pairs each, TurBLiMP fills an important gap in linguistic evaluation resources for Turkish. In designing the benchmark, we give extra attention to two properties of Turkish that remain understudied in current syntactic evaluations of LMs, namely word order flexibility and subordination through morphological processes. Our experiments on a wide range of LMs and a newly collected set of human acceptability judgments reveal that even cutting-edge Large LMs still struggle with grammatical phenomena that are not challenging for humans, and may also exhibit different sensitivities to word order and morphological complexity compared to humans.

Related papers

IMPACT: Inflectional Morphology Probes Across Complex Typologies [0.0]
IMPACT is a synthetically generated evaluation framework focused on inflectional morphology.<n>It is designed to evaluate performance across five morphologically rich languages: Arabic, Russian, Finnish, Turkish, and Hebrew.<n>We assess eight multilingual LLMs that, despite strong English performance, struggle with other languages and uncommon morphological patterns.
arXiv Detail & Related papers (2025-06-30T14:58:23Z)
TUMLU: A Unified and Native Language Understanding Benchmark for Turkic Languages [2.115206401188031]
We propose two benchmarks for Turkic language MMLU: TUMLU and TUMLU-mini.<n>TUMLU-mini consists of middle- and high-school level questions spanning 11 academic subjects in Azerbaijani, Crimean Tatar, Karakalpak, Kazakh, Tatar, Turkish, Uyghur, and Uzbek.<n>We also present TUMLU-mini, a more concise, balanced, and manually verified subset of the dataset.
arXiv Detail & Related papers (2025-02-16T07:07:38Z)
Controlled Evaluation of Syntactic Knowledge in Multilingual Language Models [16.414150004715367]
This study develops targeted syntactic evaluation tests for three low-resource languages.<n>We use them to evaluate five families of open-access multilingual Transformer LMs.<n>We find that some syntactic tasks prove relatively easy for LMs while others are challenging.
arXiv Detail & Related papers (2024-11-12T01:26:41Z)
TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish [54.51310112013655]
We introduce the first multitask, multiple-choice Turkish QA benchmark, TurkishMMLU. TurkishMMLU includes over 10,000 questions, covering 9 different subjects from Turkish high-school education curricula. We evaluate over 20 LLMs, including multilingual open-source (e.g., Gemma, Llama, MT5), closed-source (GPT 4o, Claude, Gemini), and Turkish-adapted (e.g., Trendyol) models.
arXiv Detail & Related papers (2024-07-17T08:28:55Z)
Understanding and Mitigating Language Confusion in LLMs [76.96033035093204]
We evaluate 15 typologically diverse languages with existing and newly-created English and multilingual prompts.<n>We find that Llama Instruct and Mistral models exhibit high degrees of language confusion.<n>We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning.
arXiv Detail & Related papers (2024-06-28T17:03:51Z)
RuBLiMP: Russian Benchmark of Linguistic Minimal Pairs [2.9521383230206966]
This paper introduces the Russian Benchmark of Linguistic Minimal Pairs (RuBLiMP) RuBLiMP includes 45k pairs of sentences that differ in grammaticality and isolate a morphological, syntactic, or semantic phenomenon. We find that the widely used language models for Russian are sensitive to morphological and agreement-oriented contrasts but fall behind humans on phenomena requiring understanding of structural relations, negation, transitivity, and tense.
arXiv Detail & Related papers (2024-06-27T14:55:19Z)
Holmes: A Benchmark to Assess the Linguistic Competence of Language Models [59.627729608055006]
We introduce Holmes, a new benchmark designed to assess language models (LMs) linguistic competence. We use computation-based probing to examine LMs' internal representations regarding distinct linguistic phenomena. As a result, we meet recent calls to disentangle LMs' linguistic competence from other cognitive abilities.
arXiv Detail & Related papers (2024-04-29T17:58:36Z)
Revisiting non-English Text Simplification: A Unified Multilingual Benchmark [14.891068432456262]
This paper introduces the MultiSim benchmark, a collection of 27 resources in 12 distinct languages containing over 1.7 million complex-simple sentence pairs. Our experiments using MultiSim with pre-trained multilingual language models reveal exciting performance improvements from multilingual training in non-English settings.
arXiv Detail & Related papers (2023-05-25T03:03:29Z)
Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis. We cluster all the target languages into multiple groups and name each group as a representation sprachbund. Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z)
AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context. It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts. Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z)
Knowledge Distillation for Multilingual Unsupervised Neural Machine Translation [61.88012735215636]
Unsupervised neural machine translation (UNMT) has recently achieved remarkable results for several language pairs. UNMT can only translate between a single language pair and cannot produce translation results for multiple language pairs at the same time. In this paper, we empirically introduce a simple method to translate between thirteen languages using a single encoder and a single decoder.
arXiv Detail & Related papers (2020-04-21T17:26:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.