TurkicNLP: An NLP Toolkit for Turkic Languages
- URL: http://arxiv.org/abs/2602.19174v2
- Date: Sun, 01 Mar 2026 15:46:23 GMT
- Title: TurkicNLP: An NLP Toolkit for Turkic Languages
- Authors: Sherzod Hakimov,
- Abstract summary: TurkicNLP is a Python library providing a single, consistent NLP pipeline for Turkic languages.<n>It covers tokenization, morphological analysis, part-of-speech tagging, dependency parsing, named entity recognition, bidirectional script transliterations, and machine translation.
- Score: 6.156016907917316
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Natural language processing for the Turkic language family, spoken by over 200 million people across Eurasia, remains fragmented, with most languages lacking unified tooling and resources. We present TurkicNLP, an open-source Python library providing a single, consistent NLP pipeline for Turkic languages across four script families: Latin, Cyrillic, Perso-Arabic, and Old Turkic Runic. The library covers tokenization, morphological analysis, part-of-speech tagging, dependency parsing, named entity recognition, bidirectional script transliteration, cross-lingual sentence embeddings, and machine translation through one language-agnostic API. A modular multi-backend architecture integrates rule-based finite-state transducers and neural models transparently, with automatic script detection and routing between script variants. Outputs follow the CoNLL-U standard for full interoperability and extension. Code and documentation are hosted at https://github.com/turkic-nlp/turkicnlp .
Related papers
- Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment [50.27950279695363]
The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language.
Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
arXiv Detail & Related papers (2024-06-28T08:59:24Z) - CMULAB: An Open-Source Framework for Training and Deployment of Natural Language Processing Models [59.91221728187576]
This paper introduces the CMU Linguistic Linguistic Backend, an open-source framework that simplifies model deployment and continuous human-in-the-loop fine-tuning of NLP models.
CMULAB enables users to leverage the power of multilingual models to quickly adapt and extend existing tools for speech recognition, OCR, translation, and syntactic analysis to new languages.
arXiv Detail & Related papers (2024-04-03T02:21:46Z) - VNLP: Turkish NLP Package [0.0]
VNLP is a state-of-the-art Natural Language Processing (NLP) package for the Turkish language.
It contains a wide variety of tools, ranging from the simplest tasks, such as sentence splitting and text normalization, to the more advanced ones, such as text and token classification models.
VNLP has an open-source GitHub repository, ReadtheDocs documentation, PyPi package for convenient installation, Python and command-line API.
arXiv Detail & Related papers (2024-03-02T20:46:56Z) - Benchmarking Procedural Language Understanding for Low-Resource
Languages: A Case Study on Turkish [2.396465363376008]
We conduct a case study on Turkish procedural texts.
We first expand the number of tutorials in Turkish wikiHow from 2,000 to 52,000 using automated translation tools.
We generate several downstream tasks on the corpus, such as linking actions, goal inference, and summarization.
arXiv Detail & Related papers (2023-09-13T03:42:28Z) - Multilingual Text-to-Speech Synthesis for Turkic Languages Using
Transliteration [3.0122461286351796]
This work aims to build a multilingual text-to-speech (TTS) synthesis system for ten lower-resourced Turkic languages.
We specifically target the zero-shot learning scenario, where a TTS model trained using the data of one language is applied to synthesise speech for other, unseen languages.
An end-to-end TTS system based on the Tacotron 2 architecture was trained using only the available data of the Kazakh language.
arXiv Detail & Related papers (2023-05-25T05:57:54Z) - Beyond Arabic: Software for Perso-Arabic Script Manipulation [67.31374614549237]
We provide a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script.
The library also provides simple FST-based romanization and transliteration.
arXiv Detail & Related papers (2023-01-26T20:37:03Z) - Mukayese: Turkish NLP Strikes Back [0.19116784879310023]
We demonstrate that languages such as Turkish are left behind the state-of-the-art in NLP applications.
We present Mukayese, a set of NLP benchmarks for the Turkish language.
We present four new benchmarking datasets in Turkish for language modeling, sentence segmentation, and spell checking.
arXiv Detail & Related papers (2022-03-02T16:18:44Z) - Continual Learning in Multilingual NMT via Language-Specific Embeddings [92.91823064720232]
It consists in replacing the shared vocabulary with a small language-specific vocabulary and fine-tuning the new embeddings on the new language's parallel data.
Because the parameters of the original model are not modified, its performance on the initial languages does not degrade.
arXiv Detail & Related papers (2021-10-20T10:38:57Z) - Revisiting Language Encoding in Learning Multilingual Representations [70.01772581545103]
We propose a new approach called Cross-lingual Language Projection (XLP) to replace language embedding.
XLP projects the word embeddings into language-specific semantic space, and then the projected embeddings will be fed into the Transformer model.
Experiments show that XLP can freely and significantly boost the model performance on extensive multilingual benchmark datasets.
arXiv Detail & Related papers (2021-02-16T18:47:10Z) - Phonological Features for 0-shot Multilingual Speech Synthesis [50.591267188664666]
We show that code-switching is possible for languages unseen during training, even within monolingual models.
We generate intelligible, code-switched speech in a new language at test time, including the approximation of sounds never seen in training.
arXiv Detail & Related papers (2020-08-06T18:25:18Z) - Stanza: A Python Natural Language Processing Toolkit for Many Human
Languages [44.8226642800919]
We introduce Stanza, an open-source Python natural language processing toolkit supporting 66 human languages.
Stanza features a language-agnostic fully neural pipeline for text analysis, including tokenization, multi-word token expansion, lemmatization, part-of-speech and morphological feature tagging.
We have trained Stanza on a total of 112 datasets, including the Universal Dependencies treebanks and other multilingual corpora.
arXiv Detail & Related papers (2020-03-16T09:05:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.