One Country, 700+ Languages: NLP Challenges for Underrepresented
Languages and Dialects in Indonesia
- URL: http://arxiv.org/abs/2203.13357v1
- Date: Thu, 24 Mar 2022 22:07:22 GMT
- Title: One Country, 700+ Languages: NLP Challenges for Underrepresented
Languages and Dialects in Indonesia
- Authors: Alham Fikri Aji, Genta Indra Winata, Fajri Koto, Samuel Cahyawijaya,
Ade Romadhony, Rahmad Mahendra, Kemal Kurniawan, David Moeljadi, Radityo Eko
Prasojo, Timothy Baldwin, Jey Han Lau, Sebastian Ruder
- Abstract summary: We provide an overview of the current state of NLP research for Indonesia's 700+ languages.
We highlight challenges in Indonesian NLP and how these affect the performance of current NLP systems.
- Score: 60.87739250251769
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: NLP research is impeded by a lack of resources and awareness of the
challenges presented by underrepresented languages and dialects. Focusing on
the languages spoken in Indonesia, the second most linguistically diverse and
the fourth most populous nation of the world, we provide an overview of the
current state of NLP research for Indonesia's 700+ languages. We highlight
challenges in Indonesian NLP and how these affect the performance of current
NLP systems. Finally, we provide general recommendations to help develop NLP
technology not only for languages of Indonesia but also other underrepresented
languages.
Related papers
- Towards Systematic Monolingual NLP Surveys: GenA of Greek NLP [2.3499129784547663]
This study fills the gap by introducing a method for creating systematic and comprehensive monolingual NLP surveys.
Characterized by a structured search protocol, it can be used to select publications and organize them through a taxonomy of NLP tasks.
By applying our method, we conducted a systematic literature review of Greek NLP from 2012 to 2022.
arXiv Detail & Related papers (2024-07-13T12:01:52Z) - NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural [0.0]
NusaBERT builds upon IndoBERT by incorporating vocabulary expansion and leveraging a diverse multilingual corpus that includes regional languages and dialects.
Through rigorous evaluation across a range of benchmarks, NusaBERT demonstrates state-of-the-art performance in tasks involving multiple languages of Indonesia.
arXiv Detail & Related papers (2024-03-04T08:05:34Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - NusaCrowd: Open Source Initiative for Indonesian NLP Resources [104.5381571820792]
NusaCrowd is a collaborative initiative to collect and unify existing resources for Indonesian languages.
Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken.
arXiv Detail & Related papers (2022-12-19T17:28:22Z) - NusaCrowd: A Call for Open and Reproducible NLP Research in Indonesian
Languages [20.051904366350293]
NusaCrowd strives to provide the largest crowdsourcing aggregation with standardized data loading for NLP tasks in all Indonesian languages.
By enabling open and centralized access to Indonesian NLP resources, we hope NusaCrowd can tackle the data scarcity problem hindering NLP progress in Indonesia.
arXiv Detail & Related papers (2022-07-21T15:05:42Z) - NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local
Languages [100.59889279607432]
We focus on developing resources for languages in Indonesia.
Most languages in Indonesia are categorized as endangered and some are even extinct.
We develop the first-ever parallel resource for 10 low-resource languages in Indonesia.
arXiv Detail & Related papers (2022-05-31T17:03:50Z) - Systematic Inequalities in Language Technology Performance across the
World's Languages [94.65681336393425]
We introduce a framework for estimating the global utility of language technologies.
Our analyses involve the field at large, but also more in-depth studies on both user-facing technologies and more linguistic NLP tasks.
arXiv Detail & Related papers (2021-10-13T14:03:07Z) - The State and Fate of Linguistic Diversity and Inclusion in the NLP
World [12.936270946393483]
Language technologies contribute to promoting multilingualism and linguistic diversity around the world.
Only a very small number of the over 7000 languages of the world are represented in the rapidly evolving language technologies and applications.
arXiv Detail & Related papers (2020-04-20T07:19:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.