Related papers: Recent Advancements and Challenges of Turkic Central Asian Language Processing

Recent Advancements and Challenges of Turkic Central Asian Language Processing

URL: http://arxiv.org/abs/2407.05006v2
Date: Sat, 23 Nov 2024 12:34:59 GMT
Title: Recent Advancements and Challenges of Turkic Central Asian Language Processing
Authors: Yana Veitsman, Mareike Hartmann,
Abstract summary: Research in NLP for Central Asian Turkic languages faces typical low-resource language challenges. Recent advancements have included the collection of language-specific datasets and the development of models for downstream tasks.
Score: 4.189204855014775
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Research in NLP for Central Asian Turkic languages - Kazakh, Uzbek, Kyrgyz, and Turkmen - faces typical low-resource language challenges like data scarcity, limited linguistic resources and technology development. However, recent advancements have included the collection of language-specific datasets and the development of models for downstream tasks. Thus, this paper aims to summarize recent progress and identify future research directions. It provides a high-level overview of each language's linguistic features, the current technology landscape, the application of transfer learning from higher-resource languages, and the availability of labeled and unlabeled data. By outlining the current state, we hope to inspire and facilitate future research.

Related papers

From Data Scarcity to Data Care: Reimagining Language Technologies for Serbian and other Low-Resource Languages [0.0]
This study examines the structural, historical, and sociotechnical factors shaping language technology development for low resource languages in the AI age.<n>It traces challenges rooted in historical destruction of Serbian textual heritage, intensified by contemporary issues.<n>To address these challenges, the study proposes Data Care, a framework grounded in CARE principles.
arXiv Detail & Related papers (2025-12-11T13:29:25Z)
Tibetan Language and AI: A Comprehensive Survey of Resources, Methods and Challenges [27.73456704472439]
Tibetan is one of the major low-resource languages in Asia.<n>Despite increasing interest in developing AI systems for underrepresented languages, Tibetan has received limited attention due to a lack of accessible data resources.<n>This paper provides a comprehensive survey of the current state of Tibetan AI in the AI domain.
arXiv Detail & Related papers (2025-10-22T00:29:35Z)
A Survey on Spoken Italian Datasets and Corpora [0.3222802562733787]
This survey provides a comprehensive analysis of 66 spoken Italian datasets. The datasets are categorized by speech type, source and context, and demographic and linguistic features. Challenges related to dataset scarcity, representativeness, and accessibility are discussed.
arXiv Detail & Related papers (2025-01-11T14:33:57Z)
From Statistical Methods to Pre-Trained Models; A Survey on Automatic Speech Recognition for Resource Scarce Urdu Language [41.272055304311905]
This paper focuses on the resource-constrained Urdu language, which is widely spoken across South Asian nations. It outlines current research trends, technological advancements, and potential directions for future studies in Urdu ASR.
arXiv Detail & Related papers (2024-11-20T17:39:56Z)
LIMBA: An Open-Source Framework for the Preservation and Valorization of Low-Resource Languages using Generative Models [62.47865866398233]
This white paper proposes a framework to generate linguistic tools for low-resource languages. By addressing the data scarcity that hinders intelligent applications for such languages, we contribute to promoting linguistic diversity.
arXiv Detail & Related papers (2024-11-20T16:59:41Z)
Monolingual and Multilingual Misinformation Detection for Low-Resource Languages: A Comprehensive Survey [2.5459710368096586]
This survey provides a comprehensive overview of the current research on low-resource language misinformation detection. We review the existing datasets, methodologies, and tools used in these domains, identifying key challenges related to: data resources, model development, cultural and linguistic context, real-world applications, and research efforts. Our findings underscore the need for robust, inclusive systems capable of addressing misinformation across diverse linguistic and cultural contexts.
arXiv Detail & Related papers (2024-10-24T03:02:03Z)
A Survey on Large Language Models with Multilingualism: Recent Advances and New Frontiers [48.314619377988436]
The rapid development of Large Language Models (LLMs) demonstrates remarkable multilingual capabilities in natural language processing. Despite the breakthroughs of LLMs, the investigation into the multilingual scenario remains insufficient. This survey aims to help the research community address multilingual problems and provide a comprehensive understanding of the core concepts, key techniques, and latest developments in multilingual natural language processing based on LLMs.
arXiv Detail & Related papers (2024-05-17T17:47:39Z)
The Ghanaian NLP Landscape: A First Look [9.17372840572907]
Ghanaian languages, in particular, face an alarming decline, with documented extinction and several at risk. This study pioneers a comprehensive survey of Natural Language Processing (NLP) research focused on Ghanaian languages.
arXiv Detail & Related papers (2024-05-10T21:39:09Z)
Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers [81.47046536073682]
We present a review and provide a unified perspective to summarize the recent progress as well as emerging trends in multilingual large language models (MLLMs) literature. We hope our work can provide the community with quick access and spur breakthrough research in MLLMs.
arXiv Detail & Related papers (2024-04-07T11:52:44Z)
Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets. This survey delves into an important attribute of these datasets: the dialect of a language. Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z)
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
Neural Machine Translation for the Indigenous Languages of the Americas: An Introduction [102.13536517783837]
Most languages from the Americas are among them, having a limited amount of parallel and monolingual data, if any. We discuss the recent advances and findings and open questions, product of an increased interest of the NLP community in these languages.
arXiv Detail & Related papers (2023-06-11T23:27:47Z)
Computational historical linguistics and language diversity in South Asia [1.5293427903448025]
South Asia is home to a plethora of languages, many of which severely lack access to new language technologies. This linguistic diversity also results in a research environment conducive to the study of comparative, contact, and historical linguistics. We claim that data scatteredness is the primary obstacle in the development of South Asian language technology.
arXiv Detail & Related papers (2022-03-23T16:36:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.