Related papers: One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia

One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia

URL: http://arxiv.org/abs/2203.13357v1
Date: Thu, 24 Mar 2022 22:07:22 GMT
Title: One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia
Authors: Alham Fikri Aji, Genta Indra Winata, Fajri Koto, Samuel Cahyawijaya, Ade Romadhony, Rahmad Mahendra, Kemal Kurniawan, David Moeljadi, Radityo Eko Prasojo, Timothy Baldwin, Jey Han Lau, Sebastian Ruder
Abstract summary: We provide an overview of the current state of NLP research for Indonesia's 700+ languages. We highlight challenges in Indonesian NLP and how these affect the performance of current NLP systems.
Score: 60.87739250251769
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: NLP research is impeded by a lack of resources and awareness of the challenges presented by underrepresented languages and dialects. Focusing on the languages spoken in Indonesia, the second most linguistically diverse and the fourth most populous nation of the world, we provide an overview of the current state of NLP research for Indonesia's 700+ languages. We highlight challenges in Indonesian NLP and how these affect the performance of current NLP systems. Finally, we provide general recommendations to help develop NLP technology not only for languages of Indonesia but also other underrepresented languages.

Related papers

LoraxBench: A Multitask, Multilingual Benchmark Suite for 20 Indonesian Languages [45.640417004733166]
We introduce LoraxBench, a benchmark that focuses on low-resource languages of Indonesia.<n>Our dataset covers 20 languages, with the addition of two formality registers for three languages.<n>We show that a change in register affects model performance, especially with registers not commonly found in social media.
arXiv Detail & Related papers (2025-08-17T18:07:57Z)
Charting the Landscape of African NLP: Mapping Progress and Shaping the Road Ahead [24.670007883062475]
Africa represents one of the richest linguistic regions in the world with over 2,000 languages.<n>This diversity is scarcely reflected in state-of-the-art natural language processing systems.<n>We analyze 734 research papers on NLP for African languages published over the past five years.
arXiv Detail & Related papers (2025-05-27T15:13:08Z)
NaijaNLP: A Survey of Nigerian Low-Resource Languages [0.0]
Three languages -- Hausa, Yorub'a and Igbo -- account for about 60% of the spoken languages in Nigeria. These languages are categorised as low-resource due to insufficient resources to support tasks in computational linguistics. This study presents the first comprehensive review of advancements in low-resource NLP (LR-NLP) research across the three major Nigerian languages.
arXiv Detail & Related papers (2025-02-27T05:48:51Z)
Towards Systematic Monolingual NLP Surveys: GenA of Greek NLP [2.3499129784547663]
This study fills the gap by introducing a method for creating systematic and comprehensive monolingual NLP surveys. Characterized by a structured search protocol, it can be used to select publications and organize them through a taxonomy of NLP tasks. By applying our method, we conducted a systematic literature review of Greek NLP from 2012 to 2022.
arXiv Detail & Related papers (2024-07-13T12:01:52Z)
NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural [0.0]
NusaBERT builds upon IndoBERT by incorporating vocabulary expansion and leveraging a diverse multilingual corpus that includes regional languages and dialects. Through rigorous evaluation across a range of benchmarks, NusaBERT demonstrates state-of-the-art performance in tasks involving multiple languages of Indonesia.
arXiv Detail & Related papers (2024-03-04T08:05:34Z)
Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets. This survey delves into an important attribute of these datasets: the dialect of a language. Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z)
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
NusaCrowd: Open Source Initiative for Indonesian NLP Resources [104.5381571820792]
NusaCrowd is a collaborative initiative to collect and unify existing resources for Indonesian languages. Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken.
arXiv Detail & Related papers (2022-12-19T17:28:22Z)
NusaCrowd: A Call for Open and Reproducible NLP Research in Indonesian Languages [20.051904366350293]
NusaCrowd strives to provide the largest crowdsourcing aggregation with standardized data loading for NLP tasks in all Indonesian languages. By enabling open and centralized access to Indonesian NLP resources, we hope NusaCrowd can tackle the data scarcity problem hindering NLP progress in Indonesia.
arXiv Detail & Related papers (2022-07-21T15:05:42Z)
NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages [100.59889279607432]
We focus on developing resources for languages in Indonesia. Most languages in Indonesia are categorized as endangered and some are even extinct. We develop the first-ever parallel resource for 10 low-resource languages in Indonesia.
arXiv Detail & Related papers (2022-05-31T17:03:50Z)
Systematic Inequalities in Language Technology Performance across the World's Languages [94.65681336393425]
We introduce a framework for estimating the global utility of language technologies. Our analyses involve the field at large, but also more in-depth studies on both user-facing technologies and more linguistic NLP tasks.
arXiv Detail & Related papers (2021-10-13T14:03:07Z)
Including Signed Languages in Natural Language Processing [48.62744923724317]
Signed languages are the primary means of communication for many deaf and hard of hearing individuals. This position paper calls on the NLP community to include signed languages as a research area with high social and scientific impact.
arXiv Detail & Related papers (2021-05-11T17:37:55Z)
The State and Fate of Linguistic Diversity and Inclusion in the NLP World [12.936270946393483]
Language technologies contribute to promoting multilingualism and linguistic diversity around the world. Only a very small number of the over 7000 languages of the world are represented in the rapidly evolving language technologies and applications.
arXiv Detail & Related papers (2020-04-20T07:19:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.