Recent Advancements and Challenges of Turkic Central Asian Language Processing
- URL: http://arxiv.org/abs/2407.05006v2
- Date: Sat, 23 Nov 2024 12:34:59 GMT
- Title: Recent Advancements and Challenges of Turkic Central Asian Language Processing
- Authors: Yana Veitsman, Mareike Hartmann,
- Abstract summary: Research in NLP for Central Asian Turkic languages faces typical low-resource language challenges.
Recent advancements have included the collection of language-specific datasets and the development of models for downstream tasks.
- Score: 4.189204855014775
- License:
- Abstract: Research in NLP for Central Asian Turkic languages - Kazakh, Uzbek, Kyrgyz, and Turkmen - faces typical low-resource language challenges like data scarcity, limited linguistic resources and technology development. However, recent advancements have included the collection of language-specific datasets and the development of models for downstream tasks. Thus, this paper aims to summarize recent progress and identify future research directions. It provides a high-level overview of each language's linguistic features, the current technology landscape, the application of transfer learning from higher-resource languages, and the availability of labeled and unlabeled data. By outlining the current state, we hope to inspire and facilitate future research.
Related papers
- From Statistical Methods to Pre-Trained Models; A Survey on Automatic Speech Recognition for Resource Scarce Urdu Language [41.272055304311905]
This paper focuses on the resource-constrained Urdu language, which is widely spoken across South Asian nations.
It outlines current research trends, technological advancements, and potential directions for future studies in Urdu ASR.
arXiv Detail & Related papers (2024-11-20T17:39:56Z) - LIMBA: An Open-Source Framework for the Preservation and Valorization of Low-Resource Languages using Generative Models [62.47865866398233]
This white paper proposes a framework to generate linguistic tools for low-resource languages.
By addressing the data scarcity that hinders intelligent applications for such languages, we contribute to promoting linguistic diversity.
arXiv Detail & Related papers (2024-11-20T16:59:41Z) - Monolingual and Multilingual Misinformation Detection for Low-Resource Languages: A Comprehensive Survey [2.5459710368096586]
This survey provides a comprehensive overview of the current research on low-resource language misinformation detection.
We review the existing datasets, methodologies, and tools used in these domains, identifying key challenges related to: data resources, model development, cultural and linguistic context, real-world applications, and research efforts.
Our findings underscore the need for robust, inclusive systems capable of addressing misinformation across diverse linguistic and cultural contexts.
arXiv Detail & Related papers (2024-10-24T03:02:03Z) - A Survey on Large Language Models with Multilingualism: Recent Advances and New Frontiers [48.314619377988436]
The rapid development of Large Language Models (LLMs) demonstrates remarkable multilingual capabilities in natural language processing.
Despite the breakthroughs of LLMs, the investigation into the multilingual scenario remains insufficient.
This survey aims to help the research community address multilingual problems and provide a comprehensive understanding of the core concepts, key techniques, and latest developments in multilingual natural language processing based on LLMs.
arXiv Detail & Related papers (2024-05-17T17:47:39Z) - The Ghanaian NLP Landscape: A First Look [9.17372840572907]
Ghanaian languages, in particular, face an alarming decline, with documented extinction and several at risk.
This study pioneers a comprehensive survey of Natural Language Processing (NLP) research focused on Ghanaian languages.
arXiv Detail & Related papers (2024-05-10T21:39:09Z) - Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers [81.47046536073682]
We present a review and provide a unified perspective to summarize the recent progress as well as emerging trends in multilingual large language models (MLLMs) literature.
We hope our work can provide the community with quick access and spur breakthrough research in MLLMs.
arXiv Detail & Related papers (2024-04-07T11:52:44Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Neural Machine Translation for the Indigenous Languages of the Americas:
An Introduction [102.13536517783837]
Most languages from the Americas are among them, having a limited amount of parallel and monolingual data, if any.
We discuss the recent advances and findings and open questions, product of an increased interest of the NLP community in these languages.
arXiv Detail & Related papers (2023-06-11T23:27:47Z) - Computational historical linguistics and language diversity in South
Asia [1.5293427903448025]
South Asia is home to a plethora of languages, many of which severely lack access to new language technologies.
This linguistic diversity also results in a research environment conducive to the study of comparative, contact, and historical linguistics.
We claim that data scatteredness is the primary obstacle in the development of South Asian language technology.
arXiv Detail & Related papers (2022-03-23T16:36:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.