Mukayese: Turkish NLP Strikes Back
- URL: http://arxiv.org/abs/2203.01215v1
- Date: Wed, 2 Mar 2022 16:18:44 GMT
- Title: Mukayese: Turkish NLP Strikes Back
- Authors: Ali Safaya, Emirhan Kurtulu\c{s}, Arda G\"okto\u{g}an, Deniz Yuret
- Abstract summary: We demonstrate that languages such as Turkish are left behind the state-of-the-art in NLP applications.
We present Mukayese, a set of NLP benchmarks for the Turkish language.
We present four new benchmarking datasets in Turkish for language modeling, sentence segmentation, and spell checking.
- Score: 0.19116784879310023
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Having sufficient resources for language X lifts it from the under-resourced
languages class, but not necessarily from the under-researched class. In this
paper, we address the problem of the absence of organized benchmarks in the
Turkish language. We demonstrate that languages such as Turkish are left behind
the state-of-the-art in NLP applications. As a solution, we present Mukayese, a
set of NLP benchmarks for the Turkish language that contains several NLP tasks.
We work on one or more datasets for each benchmark and present two or more
baselines. Moreover, we present four new benchmarking datasets in Turkish for
language modeling, sentence segmentation, and spell checking. All datasets and
baselines are available under: https://github.com/alisafaya/mukayese
Related papers
- TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish [54.51310112013655]
We introduce the first multitask, multiple-choice Turkish QA benchmark, TurkishMMLU.
TurkishMMLU includes over 10,000 questions, covering 9 different subjects from Turkish high-school education curricula.
We evaluate over 20 LLMs, including multilingual open-source (e.g., Gemma, Llama, MT5), closed-source (GPT 4o, Claude, Gemini), and Turkish-adapted (e.g., Trendyol) models.
arXiv Detail & Related papers (2024-07-17T08:28:55Z) - Do LLMs Recognize me, When I is not me: Assessment of LLMs Understanding of Turkish Indexical Pronouns in Indexical Shift Contexts [0.0]
This study focuses on the Indexical Shift problem in Turkish.
The Indexical Shift problem involves resolving pronouns in indexical shift contexts, a grammatical challenge not present in high-resource languages like English.
We present the first study examining indexical shift in any language, releasing a Turkish dataset specifically designed for this purpose.
arXiv Detail & Related papers (2024-06-08T20:30:53Z) - A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - Can a Multichoice Dataset be Repurposed for Extractive Question Answering? [52.28197971066953]
We repurposed the Belebele dataset (Bandarkar et al., 2023), which was designed for multiple-choice question answering (MCQA)
We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic (MSA).
Our aim is to enable others to adapt our approach for the 120+ other language variants in Belebele, many of which are deemed under-resourced.
arXiv Detail & Related papers (2024-04-26T11:46:05Z) - DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages [49.38663048447942]
We propose DIALECTBENCH, the first-ever large-scale benchmark for NLP on varieties.
This allows for a comprehensive evaluation of NLP system performance on different language varieties.
We provide substantial evidence of performance disparities between standard and non-standard language varieties.
arXiv Detail & Related papers (2024-03-16T20:18:36Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Benchmarking Procedural Language Understanding for Low-Resource
Languages: A Case Study on Turkish [2.396465363376008]
We conduct a case study on Turkish procedural texts.
We first expand the number of tutorials in Turkish wikiHow from 2,000 to 52,000 using automated translation tools.
We generate several downstream tasks on the corpus, such as linking actions, goal inference, and summarization.
arXiv Detail & Related papers (2023-09-13T03:42:28Z) - This is the way: designing and compiling LEPISZCZE, a comprehensive NLP
benchmark for Polish [5.8090623549313944]
We introduce LEPISZCZE, a new, comprehensive benchmark for Polish NLP.
We use five datasets from the Polish benchmark and add eight novel datasets.
We provide insights and experiences learned while creating the benchmark for Polish as the blueprint to design similar benchmarks for other low-resourced languages.
arXiv Detail & Related papers (2022-11-23T16:51:09Z) - Data and Representation for Turkish Natural Language Inference [6.135815931215188]
We offer a positive response for natural language inference (NLI) in Turkish.
We translate two large English NLI datasets into Turkish and had a team of experts validate their translation quality and fidelity to the original labels.
We find that in-language embeddings are essential and that morphological parsing can be avoided where the training set is large.
arXiv Detail & Related papers (2020-04-30T17:12:52Z) - CLUE: A Chinese Language Understanding Evaluation Benchmark [41.86950255312653]
We introduce the first large-scale Chinese Language Understanding Evaluation (CLUE) benchmark.
CLUE brings together 9 tasks spanning several well-established single-sentence/sentence-pair classification tasks, as well as machine reading comprehension.
We report scores using an exhaustive set of current state-of-the-art pre-trained Chinese models.
arXiv Detail & Related papers (2020-04-13T15:02:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.