Recent Advancements and Challenges of Turkic Central Asian Language Processing
- URL: http://arxiv.org/abs/2407.05006v1
- Date: Sat, 6 Jul 2024 08:58:26 GMT
- Title: Recent Advancements and Challenges of Turkic Central Asian Language Processing
- Authors: Yana Veitsman,
- Abstract summary: This paper focuses on the NLP sphere of the Turkic counterparts of Central Asian languages, namely Kazakh, Uzbek, Kyrgyz, and Turkmen.
It gives a broad, high-level overview of the linguistic properties of the languages, the current coverage and performance of already developed technology, and availability of labeled and unlabeled data for each language.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Research in the NLP sphere of the Turkic counterparts of Central Asian languages, namely Kazakh, Uzbek, Kyrgyz, and Turkmen, comes with the typical challenges of low-resource languages, like data scarcity and a general lack of linguistic resources. However, in the recent years research has greatly advanced via collection of language-specific datasets and development of downstream task technologies. Aiming to summarize this research up until May 2024, this paper also seeks to identify potential areas of future work. To achieve this, the paper gives a broad, high-level overview of the linguistic properties of the languages, the current coverage and performance of already developed technology, application of transfer learning techniques from higher-resource languages, and availability of labeled and unlabeled data for each language. Providing a summary of the current state of affairs, we hope that further research will be facilitated with the considerations we provide in the current paper.
Related papers
- A Survey on Spoken Italian Datasets and Corpora [0.3222802562733787]
This survey provides a comprehensive analysis of 66 spoken Italian datasets.
The datasets are categorized by speech type, source and context, and demographic and linguistic features.
Challenges related to dataset scarcity, representativeness, and accessibility are discussed.
arXiv Detail & Related papers (2025-01-11T14:33:57Z) - From Statistical Methods to Pre-Trained Models; A Survey on Automatic Speech Recognition for Resource Scarce Urdu Language [41.272055304311905]
This paper focuses on the resource-constrained Urdu language, which is widely spoken across South Asian nations.
It outlines current research trends, technological advancements, and potential directions for future studies in Urdu ASR.
arXiv Detail & Related papers (2024-11-20T17:39:56Z) - LIMBA: An Open-Source Framework for the Preservation and Valorization of Low-Resource Languages using Generative Models [62.47865866398233]
This white paper proposes a framework to generate linguistic tools for low-resource languages.
By addressing the data scarcity that hinders intelligent applications for such languages, we contribute to promoting linguistic diversity.
arXiv Detail & Related papers (2024-11-20T16:59:41Z) - Monolingual and Multilingual Misinformation Detection for Low-Resource Languages: A Comprehensive Survey [2.5459710368096586]
This survey provides a comprehensive overview of the current research on low-resource language misinformation detection.
We review the existing datasets, methodologies, and tools used in these domains, identifying key challenges related to: data resources, model development, cultural and linguistic context, real-world applications, and research efforts.
Our findings underscore the need for robust, inclusive systems capable of addressing misinformation across diverse linguistic and cultural contexts.
arXiv Detail & Related papers (2024-10-24T03:02:03Z) - A Survey on Large Language Models with Multilingualism: Recent Advances and New Frontiers [48.314619377988436]
The rapid development of Large Language Models (LLMs) demonstrates remarkable multilingual capabilities in natural language processing.
Despite the breakthroughs of LLMs, the investigation into the multilingual scenario remains insufficient.
This survey aims to help the research community address multilingual problems and provide a comprehensive understanding of the core concepts, key techniques, and latest developments in multilingual natural language processing based on LLMs.
arXiv Detail & Related papers (2024-05-17T17:47:39Z) - The Ghanaian NLP Landscape: A First Look [9.17372840572907]
Ghanaian languages, in particular, face an alarming decline, with documented extinction and several at risk.
This study pioneers a comprehensive survey of Natural Language Processing (NLP) research focused on Ghanaian languages.
arXiv Detail & Related papers (2024-05-10T21:39:09Z) - Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers [81.47046536073682]
We present a review and provide a unified perspective to summarize the recent progress as well as emerging trends in multilingual large language models (MLLMs) literature.
We hope our work can provide the community with quick access and spur breakthrough research in MLLMs.
arXiv Detail & Related papers (2024-04-07T11:52:44Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Neural Machine Translation for the Indigenous Languages of the Americas:
An Introduction [102.13536517783837]
Most languages from the Americas are among them, having a limited amount of parallel and monolingual data, if any.
We discuss the recent advances and findings and open questions, product of an increased interest of the NLP community in these languages.
arXiv Detail & Related papers (2023-06-11T23:27:47Z) - Computational historical linguistics and language diversity in South
Asia [1.5293427903448025]
South Asia is home to a plethora of languages, many of which severely lack access to new language technologies.
This linguistic diversity also results in a research environment conducive to the study of comparative, contact, and historical linguistics.
We claim that data scatteredness is the primary obstacle in the development of South Asian language technology.
arXiv Detail & Related papers (2022-03-23T16:36:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.